Models, code, and papers for "Yu-Wing Tai":

Adversarial Attacks Beyond the Image Space

Sep 10, 2018
Xiaohui Zeng, Chenxi Liu, Yu-Siang Wang, Weichao Qiu, Lingxi Xie, Yu-Wing Tai, Chi Keung Tang, Alan L. Yuille

Generating adversarial examples is an intriguing problem and an important way of understanding the working mechanism of deep neural networks. Most existing approaches generated perturbations in the image space, i.e., each pixel can be modified independently. However, in this paper we pay special attention to the subset of adversarial examples that are physically authentic -- those corresponding to actual changes in 3D physical properties (like surface normals, illumination condition, etc.). These adversaries arguably pose a more serious concern, as they demonstrate the possibility of causing neural network failure by small perturbations of real-world 3D objects and scenes. In the contexts of object classification and visual question answering, we augment state-of-the-art deep neural networks that receive 2D input images with a rendering module (either differentiable or not) in front, so that a 3D scene (in the physical space) is rendered into a 2D image (in the image space), and then mapped to a prediction (in the output space). The adversarial perturbations can now go beyond the image space, and have clear meanings in the 3D physical world. Through extensive experiments, we found that a vast majority of image-space adversaries cannot be explained by adjusting parameters in the physical space, i.e., they are usually physically inauthentic. But it is still possible to successfully attack beyond the image space on the physical space (such that authenticity is enforced), though this is more difficult than image-space attacks, reflected in lower success rates and heavier perturbations required.

* 10 pages, 4 figures (new method and experiments added beyond v2) 

  Click for Model/Code and Paper
Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector

Aug 06, 2019
Qi Fan, Wei Zhuo, Yu-Wing Tai

Conventional methods for object detection usually requires substantial amount of training data and to prepare such high quality training data is labor intensive. In this paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training examples. Central to our method is the Attention-RPN and the multi-relation module which fully exploit the similarity between the few shot training examples and the test set to detect novel objects while suppressing the false detection in background. To train our network, we have prepared a new dataset which contains 1000 categories of varies objects with high quality annotations. To the best of our knowledge, this is also the first dataset specifically designed for few shot object detection. Once our network is trained, we can apply object detection for unseen classes without further training or fine tuning. This is also the major advantage of few shot object detection. Our method is general, and has a wide range of applications. We demonstrate the effectiveness of our method quantitatively and qualitatively on different datasets. The dataset link is:

  Click for Model/Code and Paper
Deep Saliency with Encoded Low level Distance Map and High Level Features

Apr 19, 2016
Gayoung Lee, Yu-Wing Tai, Junmo Kim

Recent advances in saliency detection have utilized deep learning to obtain high level features to detect salient regions in a scene. These advances have demonstrated superior results over previous works that utilize hand-crafted low level features for saliency detection. In this paper, we demonstrate that hand-crafted features can provide complementary information to enhance performance of saliency detection that utilizes only high level features. Our method utilizes both high level and low level features for saliency detection under a unified deep learning framework. The high level features are extracted using the VGG-net, and the low level features are compared with other parts of an image to form a low level distance map. The low level distance map is then encoded using a convolutional neural network(CNN) with multiple 1X1 convolutional and ReLU layers. We concatenate the encoded low level distance map and the high level features, and connect them to a fully connected neural network classifier to evaluate the saliency of a query region. Our experiments show that our method can further improve the performance of state-of-the-art deep learning-based saliency detection methods.

* Accepted by IEEE Conference on Computer Vision and Pattern Recognition(CVPR) 2016. Project page: 

  Click for Model/Code and Paper
Conditional CycleGAN for Attribute Guided Face Image Generation

May 28, 2017
Yongyi Lu, Yu-Wing Tai, Chi-Keung Tang

State-of-the-art techniques in Generative Adversarial Networks (GANs) such as cycleGAN is able to learn the mapping of one image domain $X$ to another image domain $Y$ using unpaired image data. We extend the cycleGAN to ${\it Conditional}$ cycleGAN such that the mapping from $X$ to $Y$ is subjected to attribute condition $Z$. Using face image generation as an application example, where $X$ is a low resolution face image, $Y$ is a high resolution face image, and $Z$ is a set of attributes related to facial appearance (e.g. gender, hair color, smile), we present our method to incorporate $Z$ into the network, such that the hallucinated high resolution face image $Y'$ not only satisfies the low resolution constrain inherent in $X$, but also the attribute condition prescribed by $Z$. Using face feature vector extracted from face verification network as $Z$, we demonstrate the efficacy of our approach on identity-preserving face image super-resolution. Our approach is general and applicable to high-quality face image generation where specific facial attributes can be controlled easily in the automatically generated results.

  Click for Model/Code and Paper
A Unified Approach of Multi-scale Deep and Hand-crafted Features for Defocus Estimation

Apr 28, 2017
Jinsun Park, Yu-Wing Tai, Donghyeon Cho, In So Kweon

In this paper, we introduce robust and synergetic hand-crafted features and a simple but efficient deep feature from a convolutional neural network (CNN) architecture for defocus estimation. This paper systematically analyzes the effectiveness of different features, and shows how each feature can compensate for the weaknesses of other features when they are concatenated. For a full defocus map estimation, we extract image patches on strong edges sparsely, after which we use them for deep and hand-crafted feature extraction. In order to reduce the degree of patch-scale dependency, we also propose a multi-scale patch extraction strategy. A sparse defocus map is generated using a neural network classifier followed by a probability-joint bilateral filter. The final defocus map is obtained from the sparse defocus map with guidance from an edge-preserving filtered input image. Experimental results show that our algorithm is superior to state-of-the-art algorithms in terms of defocus estimation. Our work can be used for applications such as segmentation, blur magnification, all-in-focus image generation, and 3-D estimation.

* 10 pages, 14 figures. To appear in CVPR 2017. Project page : 

  Click for Model/Code and Paper
Refining Geometry from Depth Sensors using IR Shading Images

Aug 18, 2016
Gyeongmin Choe, Jaesik Park, Yu-Wing Tai, In So Kweon

We propose a method to refine geometry of 3D meshes from a consumer level depth camera, e.g. Kinect, by exploiting shading cues captured from an infrared (IR) camera. A major benefit to using an IR camera instead of an RGB camera is that the IR images captured are narrow band images that filter out most undesired ambient light, which makes our system robust against natural indoor illumination. Moreover, for many natural objects with colorful textures in the visible spectrum, the subjects appear to have a uniform albedo in the IR spectrum. Based on our analyses on the IR projector light of the Kinect, we define a near light source IR shading model that describes the captured intensity as a function of surface normals, albedo, lighting direction, and distance between light source and surface points. To resolve the ambiguity in our model between the normals and distances, we utilize an initial 3D mesh from the Kinect fusion and multi-view information to reliably estimate surface details that were not captured and reconstructed by the Kinect fusion. Our approach directly operates on the mesh model for geometry refinement. We ran experiments on our algorithm for geometries captured by both the Kinect I and Kinect II, as the depth acquisition in Kinect I is based on a structured-light technique and that of the Kinect II is based on a time-of-flight (ToF) technology. The effectiveness of our approach is demonstrated through several challenging real-world examples. We have also performed a user study to evaluate the quality of the mesh models before and after our refinements.

* Accepted to the International Journal of Computer Vision (IJCV) 

  Click for Model/Code and Paper
SF-Net: Structured Feature Network for Continuous Sign Language Recognition

Aug 04, 2019
Zhaoyang Yang, Zhenmei Shi, Xiaoyong Shen, Yu-Wing Tai

Continuous sign language recognition (SLR) aims to translate a signing sequence into a sentence. It is very challenging as sign language is rich in vocabulary, while many among them contain similar gestures and motions. Moreover, it is weakly supervised as the alignment of signing glosses is not available. In this paper, we propose Structured Feature Network (SF-Net) to address these challenges by effectively learn multiple levels of semantic information in the data. The proposed SF-Net extracts features in a structured manner and gradually encodes information at the frame level, the gloss level and the sentence level into the feature representation. The proposed SF-Net can be trained end-to-end without the help of other models or pre-training. We tested the proposed SF-Net on two large scale public SLR datasets collected from different continuous SLR scenarios. Results show that the proposed SF-Net clearly outperforms previous sequence level supervision based methods in terms of both accuracy and adaptability.

* 12 pages, 8 figures 

  Click for Model/Code and Paper
Fast Randomized Singular Value Thresholding for Low-rank Optimization

Aug 22, 2016
Tae-Hyun Oh, Yasuyuki Matsushita, Yu-Wing Tai, In So Kweon

Rank minimization can be converted into tractable surrogate problems, such as Nuclear Norm Minimization (NNM) and Weighted NNM (WNNM). The problems related to NNM, or WNNM, can be solved iteratively by applying a closed-form proximal operator, called Singular Value Thresholding (SVT), or Weighted SVT, but they suffer from high computational cost of Singular Value Decomposition (SVD) at each iteration. We propose a fast and accurate approximation method for SVT, that we call fast randomized SVT (FRSVT), with which we avoid direct computation of SVD. The key idea is to extract an approximate basis for the range of the matrix from its compressed matrix. Given the basis, we compute partial singular values of the original matrix from the small factored matrix. In addition, by developping a range propagation method, our method further speeds up the extraction of approximate basis at each iteration. Our theoretical analysis shows the relationship between the approximation bound of SVD and its effect to NNM via SVT. Along with the analysis, our empirical results quantitatively and qualitatively show that our approximation rarely harms the convergence of the host algorithms. We assess the efficiency and accuracy of the proposed method on various computer vision problems, e.g., subspace clustering, weather artifact removal, and simultaneous multi-image alignment and rectification.

* Appeared in CVPR 2015, and under major revision of TPAMI. Source code is available on 

  Click for Model/Code and Paper
DAWN: Dual Augmented Memory Network for Unsupervised Video Object Tracking

Aug 08, 2019
Zhenmei Shi, Haoyang Fang, Yu-Wing Tai, Chi-Keung Tang

Psychological studies have found that human visual tracking system involves learning, memory, and planning. Despite recent successes, not many works have focused on memory and planning in deep learning based tracking. We are thus interested in memory augmented network, where an external memory remembers the evolving appearance of the target (foreground) object without backpropagation for updating weights. Our Dual Augmented Memory Network (DAWN) is unique in remembering both target and background, and using an improved attention LSTM memory to guide the focus on memorized features. DAWN is effective in unsupervised tracking in handling total occlusion, severe motion blur, abrupt changes in target appearance, multiple object instances, and similar foreground and background features. We present extensive quantitative and qualitative experimental comparison with state-of-the-art methods including top contenders in recent VOT challenges. Notably, despite the straightforward implementation, DAWN is ranked third in both VOT2016 and VOT2017 challenges with excellent success rate among all VOT fast trackers running at fps > 10 in unsupervised tracking in both challenges. We propose DAWN-RPN, where we simply augment our memory and attention LSTM modules to the state-of-the-art SiamRPN, and report immediate performance gain, thus demonstrating DAWN can work well with and directly benefit other models to handle difficult cases as well.

* Zhenmei and Haoyang have equal contribution 

  Click for Model/Code and Paper
Pairwise Body-Part Attention for Recognizing Human-Object Interactions

Jul 28, 2018
Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, Cewu Lu

In human-object interactions (HOI) recognition, conventional methods consider the human body as a whole and pay a uniform attention to the entire body region. They ignore the fact that normally, human interacts with an object by using some parts of the body. In this paper, we argue that different body parts should be paid with different attention in HOI recognition, and the correlations between different body parts should be further considered. This is because our body parts always work collaboratively. We propose a new pairwise body-part attention model which can learn to focus on crucial parts, and their correlations for HOI recognition. A novel attention based feature selection method and a feature representation scheme that can capture pairwise correlations between body parts are introduced in the model. Our proposed approach achieved 4% improvement over the state-of-the-art results in HOI recognition on the HICO dataset. We will make our model and source codes publicly available.

  Click for Model/Code and Paper
Image Generation from Sketch Constraint Using Contextual GAN

Jul 26, 2018
Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, Chi-Keung Tang

In this paper we investigate image generation guided by hand sketch. When the input sketch is badly drawn, the output of common image-to-image translation follows the input edges due to the hard condition imposed by the translation process. Instead, we propose to use sketch as weak constraint, where the output edges do not necessarily follow the input edges. We address this problem using a novel joint image completion approach, where the sketch provides the image context for completing, or generating the output image. We train a generated adversarial network, i.e, contextual GAN to learn the joint distribution of sketch and the corresponding image by using joint images. Our contextual GAN has several advantages. First, the simple joint image representation allows for simple and effective learning of joint distribution in the same image-sketch space, which avoids complicated issues in cross-domain learning. Second, while the output is related to its input overall, the generated features exhibit more freedom in appearance and do not strictly align with the input features as previous conditional GANs do. Third, from the joint image's point of view, image and sketch are of no difference, thus exactly the same deep joint image completion network can be used for image-to-sketch generation. Experiments evaluated on three different datasets show that our contextual GAN can generate more realistic images than state-of-the-art conditional GANs on challenging inputs and generalize well on common categories.

* ECCV 2018 

  Click for Model/Code and Paper
Deep High Dynamic Range Imaging with Large Foreground Motions

Jul 24, 2018
Shangzhe Wu, Jiarui Xu, Yu-Wing Tai, Chi-Keung Tang

This paper proposes the first non-flow-based deep framework for high dynamic range (HDR) imaging of dynamic scenes with large-scale foreground motions. In state-of-the-art deep HDR imaging, input images are first aligned using optical flows before merging, which are still error-prone due to occlusion and large motions. In stark contrast to flow-based methods, we formulate HDR imaging as an image translation problem without optical flows. Moreover, our simple translation network can automatically hallucinate plausible HDR details in the presence of total occlusion, saturation and under-exposure, which are otherwise almost impossible to recover by conventional optimization approaches. Our framework can also be extended for different reference images. We performed extensive qualitative and quantitative comparisons to show that our approach produces excellent results where color artifacts and geometric distortions are significantly reduced compared to existing state-of-the-art methods, and is robust across various inputs, including images without radiometric calibration.

* ECCV 2018 

  Click for Model/Code and Paper
RMPE: Regional Multi-person Pose Estimation

Feb 04, 2018
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, Cewu Lu

Multi-person pose estimation in the wild is challenging. Although state-of-the-art human detectors have demonstrated good performance, small errors in localization and recognition are inevitable. These errors can cause failures for a single-person pose estimator (SPPE), especially for methods that solely depend on human detection results. In this paper, we propose a novel regional multi-person pose estimation (RMPE) framework to facilitate pose estimation in the presence of inaccurate human bounding boxes. Our framework consists of three components: Symmetric Spatial Transformer Network (SSTN), Parametric Pose Non-Maximum-Suppression (NMS), and Pose-Guided Proposals Generator (PGPG). Our method is able to handle inaccurate bounding boxes and redundant detections, allowing it to achieve a 17% increase in mAP over the state-of-the-art methods on the MPII (multi person) dataset.Our model and source codes are publicly available.

* Models & Codes available at or 

  Click for Model/Code and Paper
Deep Video Generation, Prediction and Completion of Human Action Sequences

Dec 08, 2017
Haoye Cai, Chunyan Bai, Yu-Wing Tai, Chi-Keung Tang

Current deep learning results on video generation are limited while there are only a few first results on video prediction and no relevant significant results on video completion. This is due to the severe ill-posedness inherent in these three problems. In this paper, we focus on human action videos, and propose a general, two-stage deep framework to generate human action videos with no constraints or arbitrary number of constraints, which uniformly address the three problems: video generation given no input frames, video prediction given the first few frames, and video completion given the first and last frames. To make the problem tractable, in the first stage we train a deep generative model that generates a human pose sequence from random noise. In the second stage, a skeleton-to-image network is trained, which is used to generate a human action video given the complete human pose sequence generated in the first stage. By introducing the two-stage strategy, we sidestep the original ill-posed problems while producing for the first time high-quality video generation/prediction/completion results of much longer duration. We present quantitative and qualitative evaluation to show that our two-stage approach outperforms state-of-the-art methods in video generation, prediction and video completion. Our video result demonstration can be viewed at

* Under review for CVPR 2018. Haoye and Chunyan have equal contribution 

  Click for Model/Code and Paper
MAVOT: Memory-Augmented Video Object Tracking

Nov 26, 2017
Boyu Liu, Yanzhao Wang, Yu-Wing Tai, Chi-Keung Tang

We introduce a one-shot learning approach for video object tracking. The proposed algorithm requires seeing the object to be tracked only once, and employs an external memory to store and remember the evolving features of the foreground object as well as backgrounds over time during tracking. With the relevant memory retrieved and updated in each tracking, our tracking model is capable of maintaining long-term memory of the object, and thus can naturally deal with hard tracking scenarios including partial and total occlusion, motion changes and large scale and shape variations. In our experiments we use the ImageNet ILSVRC2015 video detection dataset to train and use the VOT-2016 benchmark to test and compare our Memory-Augmented Video Object Tracking (MAVOT) model. From the results, we conclude that given its oneshot property and simplicity in design, MAVOT is an attractive approach in visual tracking because it shows good performance on VOT-2016 benchmark and is among the top 5 performers in accuracy and robustness in occlusion, motion changes and empty target.

* Submitted to CVPR2018 

  Click for Model/Code and Paper
Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures

Jul 12, 2016
Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang

State-of-the-art neural networks are getting deeper and wider. While their performance increases with the increasing number of layers and neurons, it is crucial to design an efficient deep architecture in order to reduce computational and memory costs. Designing an efficient neural network, however, is labor intensive requiring many experiments, and fine-tunings. In this paper, we introduce network trimming which iteratively optimizes the network by pruning unimportant neurons based on analysis of their outputs on a large dataset. Our algorithm is inspired by an observation that the outputs of a significant portion of neurons in a large network are mostly zero, regardless of what inputs the network received. These zero activation neurons are redundant, and can be removed without affecting the overall accuracy of the network. After pruning the zero activation neurons, we retrain the network using the weights before pruning as initialization. We alternate the pruning and retraining to further reduce zero activations in a network. Our experiments on the LeNet and VGG-16 show that we can achieve high compression ratio of parameters without losing or even achieving higher accuracy than the original network.

  Click for Model/Code and Paper
Reflective Decoding Network for Image Captioning

Aug 30, 2019
Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, Yu-Wing Tai

State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence between words and syntactic paradigm of sentences are also important to generate high-quality image caption. Following the conventional encoder-decoder framework, we propose the Reflective Decoding Network (RDN) for image captioning, which enhances both the long-sequence dependency and position perception of words in a caption decoder. Our model learns to collaboratively attend on both visual and textual features and meanwhile perceive each word's relative position in the sentence to maximize the information delivered in the generated caption. We evaluate the effectiveness of our RDN on the COCO image captioning datasets and achieve superior performance over the previous methods. Further experiments reveal that our approach is particularly advantageous for hard cases with complex scenes to describe by captions.

* ICCV 2019 

  Click for Model/Code and Paper
StableNet: Semi-Online, Multi-Scale Deep Video Stabilization

Jul 24, 2019
Chia-Hung Huang, Hang Yin, Yu-Wing Tai, Chi-Keung Tang

Video stabilization algorithms are of greater importance nowadays with the prevalence of hand-held devices which unavoidably produce videos with undesirable shaky motions. In this paper we propose a data-driven online video stabilization method along with a paired dataset for deep learning. The network processes each unsteady frame progressively in a multi-scale manner, from low resolution to high resolution, and then outputs an affine transformation to stabilize the frame. Different from conventional methods which require explicit feature tracking or optical flow estimation, the underlying stabilization process is learned implicitly from the training data, and the stabilization process can be done online. Since there are limited public video stabilization datasets available, we synthesized unstable videos with different extent of shake that simulate real-life camera movement. Experiments show that our method is able to outperform other stabilization methods in several unstable samples while remaining comparable in general. Also, our method is tested on complex contents and found robust enough to dampen these samples to some extent even it was not explicitly trained in the contents.

* Chia-Hung and Hang have equal contribution 

  Click for Model/Code and Paper
Weakly- and Self-Supervised Learning for Content-Aware Deep Image Retargeting

Aug 09, 2017
Donghyeon Cho, Jinsun Park, Tae-Hyun Oh, Yu-Wing Tai, In So Kweon

This paper proposes a weakly- and self-supervised deep convolutional neural network (WSSDCNN) for content-aware image retargeting. Our network takes a source image and a target aspect ratio, and then directly outputs a retargeted image. Retargeting is performed through a shift map, which is a pixel-wise mapping from the source to the target grid. Our method implicitly learns an attention map, which leads to a content-aware shift map for image retargeting. As a result, discriminative parts in an image are preserved, while background regions are adjusted seamlessly. In the training phase, pairs of an image and its image-level annotation are used to compute content and structure losses. We demonstrate the effectiveness of our proposed method for a retargeting application with insightful analyses.

* 10 pages, 11 figures. To appear in ICCV 2017, Spotlight Presentation 

  Click for Model/Code and Paper
Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer

May 11, 2018
Hao-Shu Fang, Guansong Lu, Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, Cewu Lu

Human body part parsing, or human semantic part segmentation, is fundamental to many computer vision tasks. In conventional semantic segmentation methods, the ground truth segmentations are provided, and fully convolutional networks (FCN) are trained in an end-to-end scheme. Although these methods have demonstrated impressive results, their performance highly depends on the quantity and quality of training data. In this paper, we present a novel method to generate synthetic human part segmentation data using easily-obtained human keypoint annotations. Our key idea is to exploit the anatomical similarity among human to transfer the parsing results of a person to another person with similar pose. Using these estimated results as additional training data, our semi-supervised model outperforms its strong-supervised counterpart by 6 mIOU on the PASCAL-Person-Part dataset, and we achieve state-of-the-art human parsing results. Our approach is general and can be readily extended to other object/animal parsing task assuming that their anatomical similarity can be annotated by keypoints. The proposed model and accompanying source code are available at

* CVPR'18 spotlight 

  Click for Model/Code and Paper