Models, code, and papers for "Naiyan Wang":

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

Oct 03, 2019
Jia-Wang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, Ian Reid

Recent work has shown that CNN-based depth and ego-motion estimators can be learned using unlabelled monocular videos. However, the performance is limited by unidentified moving objects that violate the underlying static scene assumption in geometric image reconstruction. More significantly, due to lack of proper constraints, networks output scale-inconsistent results over different samples, i.e., the ego-motion network cannot provide full camera trajectories over a long video sequence because of the per-frame scale ambiguity. This paper tackles these challenges by proposing a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions. Since we do not leverage multi-task learning like recent works, our framework is much simpler and more efficient. Comprehensive evaluation results demonstrate that our depth estimator achieves the state-of-the-art performance on the KITTI dataset. Moreover, we show that our ego-motion network is able to predict a globally scale-consistent camera trajectory for long video sequences, and the resulting visual odometry accuracy is competitive with the recent model that is trained using stereo videos. To the best of our knowledge, this is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale-consistent camera trajectories over a long video sequence.

* Accepted to NeurIPS 2019. Code is available at 

  Click for Model/Code and Paper
Data-Driven Sparse Structure Selection for Deep Neural Networks

Sep 05, 2018
Zehao Huang, Naiyan Wang

Deep convolutional neural networks have liberated its extraordinary power on various tasks. However, it is still very challenging to deploy state-of-the-art models into real-world applications due to their high computational complexity. How can we design a compact and effective network without massive experiments and expert knowledge? In this paper, we propose a simple and effective framework to learn and prune deep models in an end-to-end manner. In our framework, a new type of parameter -- scaling factor is first introduced to scale the outputs of specific structures, such as neurons, groups or residual blocks. Then we add sparsity regularizations on these factors, and solve this optimization problem by a modified stochastic Accelerated Proximal Gradient (APG) method. By forcing some of the factors to zero, we can safely remove the corresponding structures, thus prune the unimportant parts of a CNN. Comparing with other structure selection methods that may need thousands of trials or iterative fine-tuning, our method is trained fully end-to-end in one training pass without bells and whistles. We evaluate our method, Sparse Structure Selection with several state-of-the-art CNNs, and demonstrate very promising results with adaptive depth and width selection.

* ECCV Camera ready version 

  Click for Model/Code and Paper
Like What You Like: Knowledge Distill via Neuron Selectivity Transfer

Dec 18, 2017
Zehao Huang, Naiyan Wang

Despite deep neural networks have demonstrated extraordinary power in various applications, their superior performances are at expense of high storage and computational costs. Consequently, the acceleration and compression of neural networks have attracted much attention recently. Knowledge Transfer (KT), which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the popular solutions. In this paper, we propose a novel knowledge transfer method by treating it as a distribution matching problem. Particularly, we match the distributions of neuron selectivity patterns between teacher and student networks. To achieve this goal, we devise a new KT loss function by minimizing the Maximum Mean Discrepancy (MMD) metric between these distributions. Combined with the original loss function, our method can significantly improve the performance of student networks. We validate the effectiveness of our method across several datasets, and further combine it with other KT methods to explore the best possible results. Last but not least, we fine-tune the model to other tasks such as object detection. The results are also encouraging, which confirm the transferability of the learned features.

  Click for Model/Code and Paper
On The Stability of Video Detection and Tracking

Apr 04, 2017
Hong Zhang, Naiyan Wang

In this paper, we study an important yet less explored aspect in video detection and tracking -- stability. Surprisingly, there is no prior work that tried to study it. As a result, we start our work by proposing a novel evaluation metric for video detection which considers both stability and accuracy. For accuracy, we extend the existing accuracy metric mean Average Precision (mAP). For stability, we decompose it into three terms: fragment error, center position error, scale and ratio error. Each error represents one aspect of stability. Furthermore, we demonstrate that the stability metric has low correlation with accuracy metric. Thus, it indeed captures a different perspective of quality. Lastly, based on this metric, we evaluate several existing methods for video detection and show how they affect accuracy and stability. We believe our work can provide guidance and solid baselines for future researches in the related areas.

  Click for Model/Code and Paper
Multi-shot Pedestrian Re-identification via Sequential Decision Making

May 09, 2018
Jianfu Zhang, Naiyan Wang, Liqing Zhang

Multi-shot pedestrian re-identification problem is at the core of surveillance video analysis. It matches two tracks of pedestrians from different cameras. In contrary to existing works that aggregate single frames features by time series model such as recurrent neural network, in this paper, we propose an interpretable reinforcement learning based approach to this problem. Particularly, we train an agent to verify a pair of images at each time. The agent could choose to output the result (same or different) or request another pair of images to verify (unsure). By this way, our model implicitly learns the difficulty of image pairs, and postpone the decision when the model does not accumulate enough evidence. Moreover, by adjusting the reward for unsure action, we can easily trade off between speed and accuracy. In three open benchmarks, our method are competitive with the state-of-the-art methods while only using 3% to 6% images. These promising results demonstrate that our method is favorable in both efficiency and performance.

  Click for Model/Code and Paper
You Only Search Once: Single Shot Neural Architecture Search via Direct Sparse Optimization

Nov 05, 2018
Xinbang Zhang, Zehao Huang, Naiyan Wang

Recently Neural Architecture Search (NAS) has aroused great interest in both academia and industry, however it remains challenging because of its huge and non-continuous search space. Instead of applying evolutionary algorithm or reinforcement learning as previous works, this paper proposes a Direct Sparse Optimization NAS (DSO-NAS) method. In DSO-NAS, we provide a novel model pruning view to NAS problem. In specific, we start from a completely connected block, and then introduce scaling factors to scale the information flow between operations. Next, we impose sparse regularizations to prune useless connections in the architecture. Lastly, we derive an efficient and theoretically sound optimization method to solve it. Our method enjoys both advantages of differentiability and efficiency, therefore can be directly applied to large datasets like ImageNet. Particularly, On CIFAR-10 dataset, DSO-NAS achieves an average test error 2.84\%, while on the ImageNet dataset DSO-NAS achieves 25.4\% test error under 600M FLOPs with 8 GPUs in 18 hours.

* ICLR2019 Submission 

  Click for Model/Code and Paper
DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer

Dec 18, 2017
Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang

We have witnessed rapid evolution of deep neural network architecture design in the past years. These latest progresses greatly facilitate the developments in various areas such as computer vision and natural language processing. However, along with the extraordinary performance, these state-of-the-art models also bring in expensive computational cost. Directly deploying these models into applications with real-time requirement is still infeasible. Recently, Hinton etal. have shown that the dark knowledge within a powerful teacher model can significantly help the training of a smaller and faster student network. These knowledge are vastly beneficial to improve the generalization ability of the student model. Inspired by their work, we introduce a new type of knowledge -- cross sample similarities for model compression and acceleration. This knowledge can be naturally derived from deep metric learning model. To transfer them, we bring the "learning to rank" technique into deep metric learning formulation. We test our proposed DarkRank method on various metric learning tasks including pedestrian re-identification, image retrieval and image clustering. The results are quite encouraging. Our method can improve over the baseline method by a large margin. Moreover, it is fully compatible with other existing methods. When combined, the performance can be further boosted.

  Click for Model/Code and Paper
Collaborative Deep Learning for Recommender Systems

Jun 18, 2015
Hao Wang, Naiyan Wang, Dit-Yan Yeung

Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CF-based methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CF-based methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse. To address this problem, we generalize recent advances in deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix. Extensive experiments on three real-world datasets from different domains show that CDL can significantly advance the state of the art.

  Click for Model/Code and Paper
Cross View Fusion for 3D Human Pose Estimation

Sep 03, 2019
Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, Wenjun Zeng

We present an approach to recover absolute 3D human poses from multi-view images by incorporating multi-view geometric priors in our model. It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses. First, we introduce a cross-view fusion scheme into CNN to jointly estimate 2D poses for multiple views. Consequently, the 2D pose estimation for each view already benefits from other views. Second, we present a recursive Pictorial Structure Model to recover the 3D pose from the multi-view 2D poses. It gradually improves the accuracy of 3D pose with affordable computational cost. We test our method on two public datasets H36M and Total Capture. The Mean Per Joint Position Errors on the two datasets are 26mm and 29mm, which outperforms the state-of-the-arts remarkably (26mm vs 52mm, 29mm vs 35mm). Our code is released at \url{}.

* Accepted by ICCV 2019 

  Click for Model/Code and Paper
Sequence Level Semantics Aggregation for Video Object Detection

Aug 20, 2019
Haiping Wu, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang

Video objection detection (VID) has been a rising research direction in recent years. A central issue of VID is the appearance degradation of video frames caused by fast motion. This problem is essentially ill-posed for a single frame. Therefore, aggregating features from other frames becomes a natural choice. Existing methods rely heavily on optical flow or recurrent neural networks for feature aggregation. However, these methods emphasize more on the temporally nearby frames. In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection. To achieve this goal, we devise a novel Sequence Level Semantics Aggregation (SELSA) module. We further demonstrate the close relationship between the proposed method and the classic spectral clustering method, providing a novel view for understanding the VID problem. We test the proposed method on the ImageNet VID and the EPIC KITCHENS dataset and achieve new state-of-the-art results. Our method does not need complicated postprocessing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean.

* ICCV 2019 camera ready 

  Click for Model/Code and Paper
Revisiting Feature Alignment for One-stage Object Detection

Aug 05, 2019
Yuntao Chen, Chenxia Han, Naiyan Wang, Zhaoxiang Zhang

Recently, one-stage object detectors gain much attention due to their simplicity in practice. Its fully convolutional nature greatly reduces the difficulty of training and deployment compared with two-stage detectors which require NMS and sorting for the proposal stage. However, a fundamental issue lies in all one-stage detectors is the misalignment between anchor boxes and convolutional features, which significantly hinders the performance of one-stage detectors. In this work, we first reveal the deep connection between the widely used im2col operator and the RoIAlign operator. Guided by this illuminating observation, we propose a RoIConv operator which aligns the features and its corresponding anchors in one-stage detection in a principled way. We then design a fully convolutional AlignDet architecture which combines the flexibility of learned anchors and the preciseness of aligned features. Specifically, our AlignDet achieves a state-of-the-art mAP of 44.1 on the COCO test-dev with ResNeXt-101 backbone.

  Click for Model/Code and Paper
Scale-Aware Trident Networks for Object Detection

Jan 07, 2019
Yanghao Li, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang

Scale variation is one of the key challenges in object detection. In this work, we first present a controlled experiment to investigate the effect of receptive fields on the detection of different scale objects. Based on the findings from the exploration experiments, we propose a novel Trident Network (TridentNet) aiming to generate scale-specific feature maps with a uniform representational power. We construct a parallel multi-branch architecture in which each branch shares the same transformation parameters but with different receptive fields. Then, we propose a scale-aware training scheme to specialize each branch by sampling object instances of proper scales for training. As a bonus, a fast approximation version of TridentNet could achieve significant improvements without any additional parameters and computational cost. On the COCO dataset, our TridentNet with ResNet-101 backbone achieves state-of-the-art single-model results by obtaining an mAP of 48.4. Code will be made publicly available.

  Click for Model/Code and Paper
Spectral Feature Transformation for Person Re-identification

Nov 28, 2018
Chuanchen Luo, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang

With the surge of deep learning techniques, the field of person re-identification has witnessed rapid progress in recent years. Deep learning based methods focus on learning a feature space where samples are clustered compactly according to their corresponding identities. Most existing methods rely on powerful CNNs to transform the samples individually. In contrast, we propose to consider the sample relations in the transformation. To achieve this goal, we incorporate spectral clustering technique into CNN. We derive a novel module named Spectral Feature Transformation and seamlessly integrate it into existing CNN pipeline with negligible cost,which makes our method enjoy the best of two worlds. Empirical studies show that the proposed approach outperforms previous state-of-the-art methods on four public benchmarks by a considerable margin without bells and whistles.

* Tech Report 

  Click for Model/Code and Paper
Factorized Bilinear Models for Image Recognition

Sep 04, 2017
Yanghao Li, Naiyan Wang, Jiaying Liu, Xiaodi Hou

Although Deep Convolutional Neural Networks (CNNs) have liberated their power in various computer vision tasks, the most important components of CNN, convolutional layers and fully connected layers, are still limited to linear transformations. In this paper, we propose a novel Factorized Bilinear (FB) layer to model the pairwise feature interactions by considering the quadratic terms in the transformations. Compared with existing methods that tried to incorporate complex non-linearity structures into CNNs, the factorized parameterization makes our FB layer only require a linear increase of parameters and affordable computational cost. To further reduce the risk of overfitting of the FB layer, a specific remedy called DropFactor is devised during the training process. We also analyze the connection between FB layer and some existing models, and show FB layer is a generalization to them. Finally, we validate the effectiveness of FB layer on several widely adopted datasets including CIFAR-10, CIFAR-100 and ImageNet, and demonstrate superior results compared with various state-of-the-art deep models.

* Accepted by ICCV 2017 

  Click for Model/Code and Paper
Demystifying Neural Style Transfer

Jul 01, 2017
Yanghao Li, Naiyan Wang, Jiaying Liu, Xiaodi Hou

Neural Style Transfer has recently demonstrated very exciting results which catches eyes in both academia and industry. Despite the amazing results, the principle of neural style transfer, especially why the Gram matrices could represent style remains unclear. In this paper, we propose a novel interpretation of neural style transfer by treating it as a domain adaptation problem. Specifically, we theoretically show that matching the Gram matrices of feature maps is equivalent to minimize the Maximum Mean Discrepancy (MMD) with the second order polynomial kernel. Thus, we argue that the essence of neural style transfer is to match the feature distributions between the style images and the generated images. To further support our standpoint, we experiment with several other distribution alignment methods, and achieve appealing results. We believe this novel interpretation connects these two important research fields, and could enlighten future researches.

* Accepted by IJCAI 2017 

  Click for Model/Code and Paper
Empirical Evaluation of Rectified Activations in Convolutional Network

Nov 27, 2015
Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li

In this paper we investigate the performance of different types of rectified activation functions in convolutional neural network: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU) and a new randomized leaky rectified linear units (RReLU). We evaluate these activation function on standard image classification task. Our experiments suggest that incorporating a non-zero slope for negative part in rectified activation units could consistently improve the results. Thus our findings are negative on the common belief that sparsity is the key of good performance in ReLU. Moreover, on small scale dataset, using deterministic negative slope or learning it are both prone to overfitting. They are not as effective as using their randomized counterpart. By using RReLU, we achieved 75.68\% accuracy on CIFAR-100 test set without multiple test or ensemble.

  Click for Model/Code and Paper
Understanding and Diagnosing Visual Tracking Systems

Apr 23, 2015
Naiyan Wang, Jianping Shi, Dit-Yan Yeung, Jiaya Jia

Several benchmark datasets for visual tracking research have been proposed in recent years. Despite their usefulness, whether they are sufficient for understanding and diagnosing the strengths and weaknesses of different trackers remains questionable. To address this issue, we propose a framework by breaking a tracker down into five constituent parts, namely, motion model, feature extractor, observation model, model updater, and ensemble post-processor. We then conduct ablative experiments on each component to study how it affects the overall result. Surprisingly, our findings are discrepant with some common beliefs in the visual tracking research community. We find that the feature extractor plays the most important role in a tracker. On the other hand, although the observation model is the focus of many studies, we find that it often brings no significant improvement. Moreover, the motion model and model updater contain many details that could affect the result. Also, the ensemble post-processor can improve the result substantially when the constituent trackers have high diversity. Based on our findings, we put together some very elementary building blocks to give a basic tracker which is competitive in performance to the state-of-the-art trackers. We believe our framework can provide a solid baseline when conducting controlled experiments for visual tracking research.

  Click for Model/Code and Paper
Transferring Rich Feature Hierarchies for Robust Visual Tracking

Apr 23, 2015
Naiyan Wang, Siyi Li, Abhinav Gupta, Dit-Yan Yeung

Convolutional neural network (CNN) models have demonstrated great success in various computer vision tasks including image classification and object detection. However, some equally important tasks such as visual tracking remain relatively unexplored. We believe that a major hurdle that hinders the application of CNN to visual tracking is the lack of properly labeled training data. While existing applications that liberate the power of CNN often need an enormous amount of training data in the order of millions, visual tracking applications typically have only one labeled example in the first frame of each video. We address this research issue here by pre-training a CNN offline and then transferring the rich feature hierarchies learned to online tracking. The CNN is also fine-tuned during online tracking to adapt to the appearance of the tracked target specified in the first video frame. To fit the characteristics of object tracking, we first pre-train the CNN to recognize what is an object, and then propose to generate a probability map instead of producing a simple class label. Using two challenging open benchmarks for performance evaluation, our proposed tracker has demonstrated substantial improvement over other state-of-the-art trackers.

  Click for Model/Code and Paper