Models, code, and papers for "Zhaoxiang Zhang":
Visual and audio modalities are two symbiotic modalities underlying videos, which contain both common and complementary information. If they can be mined and fused sufficiently, performances of related video tasks can be significantly enhanced. However, due to the environmental interference or sensor fault, sometimes, only one modality exists while the other is abandoned or missing. By recovering the missing modality from the existing one based on the common information shared between them and the prior information of the specific modality, great bonus will be gained for various vision tasks. In this paper, we propose a Cross-Modal Cycle Generative Adversarial Network (CMCGAN) to handle cross-modal visual-audio mutual generation. Specifically, CMCGAN is composed of four kinds of subnetworks: audio-to-visual, visual-to-audio, audio-to-audio and visual-to-visual subnetworks respectively, which are organized in a cycle architecture. CMCGAN has several remarkable advantages. Firstly, CMCGAN unifies visual-audio mutual generation into a common framework by a joint corresponding adversarial loss. Secondly, through introducing a latent vector with Gaussian distribution, CMCGAN can handle dimension and structure asymmetry over visual and audio modalities effectively. Thirdly, CMCGAN can be trained end-to-end to achieve better convenience. Benefiting from CMCGAN, we develop a dynamic multimodal classification network to handle the modality missing problem. Abundant experiments have been conducted and validate that CMCGAN obtains the state-of-the-art cross-modal visual-audio generation results. Furthermore, it is shown that the generated modality achieves comparable effects with those of original modality, which demonstrates the effectiveness and advantages of our proposed method.
Weakly supervised semantic segmentation based on image-level labels aims for alleviating the data scarcity problem by training with coarse labels. State-of-the-art methods rely on image-level labels to generate proxy segmentation masks, then train the segmentation network on these masks with various constraints. These methods consider each image independently and lack the exploration of cross-image relationships. We argue the cross-image relationship is vital to weakly supervised learning. We propose an end-to-end affinity module for explicitly modeling the relationship among a group of images. By means of this, one image can benefit from the complementary information from other images, and the supervision guidance can be shared in the group. The proposed method improves over the baseline with a large margin. Our method achieves 64.1\% mIOU score on Pascal VOC 2012 validation set, and 64.7\% mIOU score on test set, which is a new state-of-the-art by only using image-level labels, demonstrating the effectiveness of the method.
We have witnessed rapid evolution of deep neural network architecture design in the past years. These latest progresses greatly facilitate the developments in various areas such as computer vision and natural language processing. However, along with the extraordinary performance, these state-of-the-art models also bring in expensive computational cost. Directly deploying these models into applications with real-time requirement is still infeasible. Recently, Hinton etal. have shown that the dark knowledge within a powerful teacher model can significantly help the training of a smaller and faster student network. These knowledge are vastly beneficial to improve the generalization ability of the student model. Inspired by their work, we introduce a new type of knowledge -- cross sample similarities for model compression and acceleration. This knowledge can be naturally derived from deep metric learning model. To transfer them, we bring the "learning to rank" technique into deep metric learning formulation. We test our proposed DarkRank method on various metric learning tasks including pedestrian re-identification, image retrieval and image clustering. The results are quite encouraging. Our method can improve over the baseline method by a large margin. Moreover, it is fully compatible with other existing methods. When combined, the performance can be further boosted.
Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visual-audio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.
Pedestrian attribute recognition has been an emerging research topic in the area of video surveillance. To predict the existence of a particular attribute, it is demanded to localize the regions related to the attribute. However, in this task, the region annotations are not available. How to carve out these attribute-related regions remains challenging. Existing methods applied attribute-agnostic visual attention or heuristic body-part localization mechanisms to enhance the local feature representations, while neglecting to employ attributes to define local feature areas. We propose a flexible Attribute Localization Module (ALM) to adaptively discover the most discriminative regions and learns the regional features for each attribute at multiple levels. Moreover, a feature pyramid architecture is also introduced to enhance the attribute-specific localization at low-levels with high-level semantic guidance. The proposed framework does not require additional region annotations and can be trained end-to-end with multi-level deep supervision. Extensive experiments show that the proposed method achieves state-of-the-art results on three pedestrian attribute datasets, including PETA, RAP, and PA-100K.
Video objection detection (VID) has been a rising research direction in recent years. A central issue of VID is the appearance degradation of video frames caused by fast motion. This problem is essentially ill-posed for a single frame. Therefore, aggregating features from other frames becomes a natural choice. Existing methods rely heavily on optical flow or recurrent neural networks for feature aggregation. However, these methods emphasize more on the temporally nearby frames. In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection. To achieve this goal, we devise a novel Sequence Level Semantics Aggregation (SELSA) module. We further demonstrate the close relationship between the proposed method and the classic spectral clustering method, providing a novel view for understanding the VID problem. We test the proposed method on the ImageNet VID and the EPIC KITCHENS dataset and achieve new state-of-the-art results. Our method does not need complicated postprocessing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean.
Recently, one-stage object detectors gain much attention due to their simplicity in practice. Its fully convolutional nature greatly reduces the difficulty of training and deployment compared with two-stage detectors which require NMS and sorting for the proposal stage. However, a fundamental issue lies in all one-stage detectors is the misalignment between anchor boxes and convolutional features, which significantly hinders the performance of one-stage detectors. In this work, we first reveal the deep connection between the widely used im2col operator and the RoIAlign operator. Guided by this illuminating observation, we propose a RoIConv operator which aligns the features and its corresponding anchors in one-stage detection in a principled way. We then design a fully convolutional AlignDet architecture which combines the flexibility of learned anchors and the preciseness of aligned features. Specifically, our AlignDet achieves a state-of-the-art mAP of 44.1 on the COCO test-dev with ResNeXt-101 backbone.
Scale variation is one of the key challenges in object detection. In this work, we first present a controlled experiment to investigate the effect of receptive fields on the detection of different scale objects. Based on the findings from the exploration experiments, we propose a novel Trident Network (TridentNet) aiming to generate scale-specific feature maps with a uniform representational power. We construct a parallel multi-branch architecture in which each branch shares the same transformation parameters but with different receptive fields. Then, we propose a scale-aware training scheme to specialize each branch by sampling object instances of proper scales for training. As a bonus, a fast approximation version of TridentNet could achieve significant improvements without any additional parameters and computational cost. On the COCO dataset, our TridentNet with ResNet-101 backbone achieves state-of-the-art single-model results by obtaining an mAP of 48.4. Code will be made publicly available.
With the surge of deep learning techniques, the field of person re-identification has witnessed rapid progress in recent years. Deep learning based methods focus on learning a feature space where samples are clustered compactly according to their corresponding identities. Most existing methods rely on powerful CNNs to transform the samples individually. In contrast, we propose to consider the sample relations in the transformation. To achieve this goal, we incorporate spectral clustering technique into CNN. We derive a novel module named Spectral Feature Transformation and seamlessly integrate it into existing CNN pipeline with negligible cost,which makes our method enjoy the best of two worlds. Empirical studies show that the proposed approach outperforms previous state-of-the-art methods on four public benchmarks by a considerable margin without bells and whistles.
Most of convolutional neural networks share the same characteristic: each convolutional layer is followed by a nonlinear activation layer where Rectified Linear Unit (ReLU) is the most widely used. In this paper, we argue that the designed structure with the equal ratio between these two layers may not be the best choice since it could result in the poor generalization ability. Thus, we try to investigate a more suitable method on using ReLU to explore the better network architectures. Specifically, we propose a proportional module to keep the ratio between convolution and ReLU amount to be N:M (N>M). The proportional module can be applied in almost all networks with no extra computational cost to improve the performance. Comprehensive experimental results indicate that the proposed method achieves better performance on different benchmarks with different network architectures, thus verify the superiority of our work.
Recently, Neural Architecture Search has achieved great success in large-scale image classification. In contrast, there have been limited works focusing on architecture search for object detection, mainly because the costly ImageNet pre-training is always required for detectors. Training from scratch, as a substitute, demands more epochs to converge and brings no computation saving. To overcome this obstacle, we introduce a practical neural architecture transformation search(NATS)algorithm for object detection in this paper. Instead of searching and constructing an entire network, NATS explores the architecture space on the base of existing network and reusing its weights. We propose a novel neural architecture search strategy in channel-level instead of path-level and devise a search space specially targeting at object detection. With the combination of these two designs, an architecture transformation scheme could be discovered to adapt a network designed for image classification to task of object detection. Since our method is gradient-based and only searches for a transformation scheme, the weights of models pretrained inImageNet could be utilized in both searching and retraining stage, which makes the whole process very efficient. The transformed network requires no extra parameters and FLOPs, and is friendly to hardware optimization, which is practical to use in real-time application. In experiments, we demonstrate the effectiveness of NATSon networks like ResNet and ResNeXt. Our transformed networks, combined with various detection frameworks, achieve significant improvements on the COCO dataset while keeping fast.
Scale-sensitive object detection remains a challenging task, where most of the existing methods could not learn it explicitly and are not robust to scale variance. In addition, the most existing methods are less efficient during training or slow during inference, which are not friendly to real-time applications. In this paper, we propose a practical object detection method with scale-sensitive network.Our method first predicts a global continuous scale ,which is shared by all position, for each convolution filter of each network stage. To effectively learn the scale, we average the spatial features and distill the scale from channels. For fast-deployment, we propose a scale decomposition method that transfers the robust fractional scale into combination of fixed integral scales for each convolution filter, which exploits the dilated convolution. We demonstrate it on one-stage and two-stage algorithms under different configurations. For practical applications, training of our method is of efficiency and simplicity which gets rid of complex data sampling or optimize strategy. During test-ing, the proposed method requires no extra operation and is very supportive of hardware acceleration like TensorRT and TVM. On the COCO test-dev, our model could achieve a 41.5 mAP on one-stage detector and 42.1 mAP on two-stage detectors based on ResNet-101, outperforming base-lines by 2.4 and 2.1 respectively without extra FLOPS.
This paper presents an efficient module named spatial bottleneck for accelerating the convolutional layers in deep neural networks. The core idea is to decompose convolution into two stages, which first reduce the spatial resolution of the feature map, and then restore it to the desired size. This operation decreases the sampling density in the spatial domain, which is independent yet complementary to network acceleration approaches in the channel domain. Using different sampling rates, we can tradeoff between recognition accuracy and model complexity. As a basic building block, spatial bottleneck can be used to replace any single convolutional layer, or the combination of two convolutional layers. We empirically verify the effectiveness of spatial bottleneck by applying it to the deep residual networks. Spatial bottleneck achieves 2x and 1.4x speedup on the regular and channel-bottlenecked residual blocks, respectively, with the accuracies retained in recognizing low-resolution images, and even improved in recognizing high-resolution images.
With the fast development of effective and low-cost human skeleton capture systems, skeleton-based action recognition has attracted much attention recently. Most existing methods use Convolutional Neural Network(CNN) and Recurrent Neural Network(RNN) to extract spatio-temporal information embedded in the skeleton sequences for action recognition. However, these approaches are limited in the ability of relational modeling in a single skeleton, due to the loss of important structural information when converting the raw skeleton data to adapt to the CNN or RNN input. In this paper, we propose an Attentional Recurrent Relational Network-LSTM(ARRN-LSTM) to simultaneously model spatial configurations and temporal dynamics in skeletons for action recognition. The spatial patterns embedded in a single skeleton are learned by a Recurrent Relational Network, followed by a multi-layer LSTM to extract temporal features in the skeleton sequences. To exploit the complementarity between different geometries in the skeleton for sufficient relational modeling, we design a two-stream architecture to learn the relationship among joints and explore the underlying patterns among lines simultaneously. We also introduce an adaptive attentional module for focusing on potential discriminative parts of the skeleton towards a certain action. Extensive experiments are performed on several popular action recognition datasets and the results show that the proposed approach achieves competitive results with the state-of-the-art methods.
Projective analysis is an important solution for 3D shape retrieval, since human visual perceptions of 3D shapes rely on various 2D observations from different view points. Although multiple informative and discriminative views are utilized, most projection-based retrieval systems suffer from heavy computational cost, thus cannot satisfy the basic requirement of scalability for search engines. In this paper, we present a real-time 3D shape search engine based on the projective images of 3D shapes. The real-time property of our search engine results from the following aspects: (1) efficient projection and view feature extraction using GPU acceleration; (2) the first inverted file, referred as F-IF, is utilized to speed up the procedure of multi-view matching; (3) the second inverted file (S-IF), which captures a local distribution of 3D shapes in the feature manifold, is adopted for efficient context-based re-ranking. As a result, for each query the retrieval task can be finished within one second despite the necessary cost of IO overhead. We name the proposed 3D shape search engine, which combines GPU acceleration and Inverted File Twice, as GIFT. Besides its high efficiency, GIFT also outperforms the state-of-the-art methods significantly in retrieval accuracy on various shape benchmarks and competitions.
Sufficient training data normally is required to train deeply learned models. However, due to the expensive manual process for labelling large number of images, the amount of available training data is always limited. To produce more data for training a deep network, Generative Adversarial Network (GAN) can be used to generate artificial sample data. However, the generated data usually does not have annotation labels. To solve this problem, in this paper, we propose a virtual label called Multi-pseudo Regularized Label (MpRL) and assign it to the generated data. With MpRL, the generated data will be used as the supplementary of real training data to train a deep neural network in a semi-supervised learning fashion. To build the corresponding relationship between the real data and generated data, MpRL assigns each generated data a proper virtual label which reflects the likelihood of the affiliation of the generated data to pre-defined training classes in the real data domain. Unlike the traditional label which usually is a single integral number, the virtual label proposed in this work is a set of weight-based values each individual of which is a number in (0,1] called multi-pseudo label and reflects the degree of relation between each generated data to every pre-defined class of real data. A comprehensive evaluation is carried out by adopting two state-of-the-art convolutional neural networks (CNNs) in our experiments to verify the effectiveness of MpRL. Experiments demonstrate that by assigning MpRL to generated data, we can further improve the person re-ID performance on five re-ID datasets, i.e., Market-1501, DukeMTMC-reID, CUHK03, VIPeR, and CUHK01. The proposed method obtains +6.29%, +6.30%, +5.58%, +5.84%, and +3.48% improvements in rank-1 accuracy over a strong CNN baseline on the five datasets respectively, and outperforms state-of-the-art methods.
Person re-identification (re-ID) is a highly challenging task due to large variations of pose, viewpoint, illumination, and occlusion. Deep metric learning provides a satisfactory solution to person re-ID by training a deep network under supervision of metric loss, e.g., triplet loss. However, the performance of deep metric learning is greatly limited by traditional sampling methods. To solve this problem, we propose a Hard-Aware Point-to-Set (HAP2S) loss with a soft hard-mining scheme. Based on the point-to-set triplet loss framework, the HAP2S loss adaptively assigns greater weights to harder samples. Several advantageous properties are observed when compared with other state-of-the-art loss functions: 1) Accuracy: HAP2S loss consistently achieves higher re-ID accuracies than other alternatives on three large-scale benchmark datasets; 2) Robustness: HAP2S loss is more robust to outliers than other losses; 3) Flexibility: HAP2S loss does not rely on a specific weight function, i.e., different instantiations of HAP2S loss are equally effective. 4) Generality: In addition to person re-ID, we apply the proposed method to generic deep metric learning benchmarks including CUB-200-2011 and Cars196, and also achieve state-of-the-art results.
Regressing the illumination of a scene from the representations of object appearances is popularly adopted in computational color constancy. However, it's still challenging due to intrinsic appearance and label ambiguities caused by unknown illuminants, diverse reflection property of materials and extrinsic imaging factors (such as different camera sensors). In this paper, we introduce a novel algorithm by Cascading Convolutional Color Constancy (in short, C4) to improve robustness of regression learning and achieve stable generalization capability across datasets (different cameras and scenes) in a unique framework. The proposed C4 method ensembles a series of dependent illumination hypotheses from each cascade stage via introducing a weighted multiply-accumulate loss function, which can inherently capture different modes of illuminations and explicitly enforce coarse-to-fine network optimization. Experimental results on the public Color Checker and NUS 8-Camera benchmarks demonstrate superior performance of the proposed algorithm in comparison with the state-of-the-art methods, especially for more difficult scenes.
Object detection and instance recognition play a central role in many AI applications like autonomous driving, video surveillance and medical image analysis. However, training object detection models on large scale datasets remains computationally expensive and time consuming. This paper presents an efficient and open source object detection framework called SimpleDet which enables the training of state-of-the-art detection models on consumer grade hardware at large scale. SimpleDet supports up-to-date detection models with best practice. SimpleDet also supports distributed training with near linear scaling out of box. Codes, examples and documents of SimpleDet can be found at https://github.com/tusimple/simpledet .