Models, code, and papers for "Jingdong Wang":

Collaborative Quantization for Cross-Modal Similarity Search

Feb 02, 2019
Ting Zhang, Jingdong Wang

Cross-modal similarity search is a problem about designing a search system supporting querying across content modalities, e.g., using an image to search for texts or using a text to search for images. This paper presents a compact coding solution for efficient search, with a focus on the quantization approach which has already shown the superior performance over the hashing solutions in the single-modal similarity search. We propose a cross-modal quantization approach, which is among the early attempts to introduce quantization into cross-modal search. The major contribution lies in jointly learning the quantizers for both modalities through aligning the quantized representations for each pair of image and text belonging to a document. In addition, our approach simultaneously learns the common space for both modalities in which quantization is conducted to enable efficient and effective search using the Euclidean distance computed in the common space with fast distance table lookup. Experimental results compared with several competitive algorithms over three benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance.

* CVPR 2016 

  Click for Model/Code and Paper
OCNet: Object Context Network for Scene Parsing

Sep 04, 2018
Yuhui Yuan, Jingdong Wang

Context is essential for various computer vision tasks. The state-of-the-art scene parsing methods have exploited the effectiveness of the context defined over image-level. Such context carries the mixture of objects belonging to different categories. According to that the label of each pixel $\mathit{P}$ is defined as the category of the object it belongs to, we propose the pixel-wise Object Context that consists of the objects belonging to the same category with pixel $\mathit{P}$. The representation of pixel $\mathit{P}$'s object context is the aggregation of all the features that belong to the pixels sharing the same category with $\mathit{P}$. Since the ground truth objects that the pixel $\mathit{P}$ belonging to is unavailable, we employ the self-attention method to approximate the objects by learning a pixel-wise similarity map. We further propose the Pyramid Object Context and Atrous Spatial Pyramid Object Context to capture context of multiple scales. Based on the object context, we introduce the OCNet and show that OCNet achieves state-of-the-art performance on both Cityscapes benchmark and ADE20K benchmark. The code of OCNet will be made available at

  Click for Model/Code and Paper
Composite Quantization

Dec 04, 2017
Jingdong Wang, Ting Zhang

This paper studies the compact coding approach to approximate nearest neighbor search. We introduce a composite quantization framework. It uses the composition of several ($M$) elements, each of which is selected from a different dictionary, to accurately approximate a $D$-dimensional vector, thus yielding accurate search, and represents the data vector by a short code composed of the indices of the selected elements in the corresponding dictionaries. Our key contribution lies in introducing a near-orthogonality constraint, which makes the search efficiency is guaranteed as the cost of the distance computation is reduced to $O(M)$ from $O(D)$ through a distance table lookup scheme. The resulting approach is called near-orthogonal composite quantization. We theoretically justify the equivalence between near-orthogonal composite quantization and minimizing an upper bound of a function formed by jointly considering the quantization error and the search cost according to a generalized triangle inequality. We empirically show the efficacy of the proposed approach over several benchmark datasets. In addition, we demonstrate the superior performances in other three applications: combination with inverted multi-index, quantizing the query for mobile search, and inner-product similarity search.

  Click for Model/Code and Paper
Inner Product Similarity Search using Compositional Codes

Jun 20, 2014
Chao Du, Jingdong Wang

This paper addresses the nearest neighbor search problem under inner product similarity and introduces a compact code-based approach. The idea is to approximate a vector using the composition of several elements selected from a source dictionary and to represent this vector by a short code composed of the indices of the selected elements. The inner product between a query vector and a database vector is efficiently estimated from the query vector and the short code of the database vector. We show the superior performance of the proposed group $M$-selection algorithm that selects $M$ elements from $M$ source dictionaries for vector approximation in terms of search accuracy and efficiency for compact codes of the same length via theoretical and empirical analysis. Experimental results on large-scale datasets ($1M$ and $1B$ SIFT features, $1M$ linear models and Netflix) demonstrate the superiority of the proposed approach.

* The approach presented in this paper (ECCV14 submission) is closely related to multi-stage vector quantization and residual quantization. Thanks the reviewers (CVPR14 and ECCV14) for pointing out the relationship to the two algorithms. Related paper:, which also adopts the summation of vectors for vector approximation 

  Click for Model/Code and Paper
Object-Contextual Representations for Semantic Segmentation

Sep 24, 2019
Yuhui Yuan, Xilin Chen, Jingdong Wang

In this paper, we address the problem of semantic segmentation and focus on the context aggregation strategy for robust segmentation. Our motivation is that the label of a pixel is the category of the object that the pixel belongs to. We present a simple yet effective approach, object-contextual representations, characterizing a pixel by exploiting the representation of the corresponding object class. First, we construct object regions based on a feature map supervised by the ground-truth segmentation, and then compute the object region representations. Second, we compute the representation similarity between each pixel and each object region, and augment the representation of each pixel with an object contextual representation, which is a weighted aggregation of all the object region representations according to their similarities with the pixel. We empirically demonstrate that the proposed approach achieves competitive performance on six challenging semantic segmentation benchmarks: Cityscapes, ADE20K, LIP, PASCAL VOC 2012, PASCAL-Context and COCO-Stuff. Notably, we achieved the \nth{2} place on the Cityscapes leader-board with a single model.

* Project Page: 

  Click for Model/Code and Paper
Orthogonal and Idempotent Transformations for Learning Deep Neural Networks

Jul 19, 2017
Jingdong Wang, Yajie Xing, Kexin Zhang, Cha Zhang

Identity transformations, used as skip-connections in residual networks, directly connect convolutional layers close to the input and those close to the output in deep neural networks, improving information flow and thus easing the training. In this paper, we introduce two alternative linear transforms, orthogonal transformation and idempotent transformation. According to the definition and property of orthogonal and idempotent matrices, the product of multiple orthogonal (same idempotent) matrices, used to form linear transformations, is equal to a single orthogonal (idempotent) matrix, resulting in that information flow is improved and the training is eased. One interesting point is that the success essentially stems from feature reuse and gradient reuse in forward and backward propagation for maintaining the information during flow and eliminating the gradient vanishing problem because of the express way through skip-connections. We empirically demonstrate the effectiveness of the proposed two transformations: similar performance in single-branch networks and even superior in multi-branch networks in comparison to identity transformations.

  Click for Model/Code and Paper
Cross View Fusion for 3D Human Pose Estimation

Sep 03, 2019
Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, Wenjun Zeng

We present an approach to recover absolute 3D human poses from multi-view images by incorporating multi-view geometric priors in our model. It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses. First, we introduce a cross-view fusion scheme into CNN to jointly estimate 2D poses for multiple views. Consequently, the 2D pose estimation for each view already benefits from other views. Second, we present a recursive Pictorial Structure Model to recover the 3D pose from the multi-view 2D poses. It gradually improves the accuracy of 3D pose with affordable computational cost. We test our method on two public datasets H36M and Total Capture. The Mean Per Joint Position Errors on the two datasets are 26mm and 29mm, which outperforms the state-of-the-arts remarkably (26mm vs 52mm, 29mm vs 35mm). Our code is released at \url{}.

* Accepted by ICCV 2019 

  Click for Model/Code and Paper
Beyond Intra-modality Discrepancy: A Comprehensive Survey of Heterogeneous Person Re-identification

May 24, 2019
Zheng Wang, Zhixiang Wang, Yang Wu, Jingdong Wang, Shin'ichi Satoh

An effective and efficient person re-identification (ReID) algorithm will alleviate painful video watching, and accelerate the investigation progress. Recently, with the explosive requirements of practical applications, a lot of research efforts have been dedicated to heterogeneous person re-identification (He-ReID). In this paper, we review the state-of-the-art methods comprehensively with respect to four main application scenarios -- low-resolution, infrared, sketch and text. We begin with a comparison between He-ReID and the general Homogeneous ReID (Ho-ReID) task. Then, we survey the models that have been widely employed in He-ReID. Available existing datasets for performing evaluation are briefly described. We then summarize and compare the representative approaches. Finally, we discuss some future research directions.

  Click for Model/Code and Paper
Deep High-Resolution Representation Learning for Human Pose Estimation

Feb 25, 2019
Ke Sun, Bin Xiao, Dong Liu, Jingdong Wang

This is an official pytorch implementation of Deep High-Resolution Representation Learning for Human Pose Estimation. In this work, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. The code and models have been publicly available at \url{}.

* accepted by CVPR2019 

  Click for Model/Code and Paper
IGCV3: Interleaved Low-Rank Group Convolutions for Efficient Deep Neural Networks

Jul 20, 2018
Ke Sun, Mingjie Li, Dong Liu, Jingdong Wang

In this paper, we are interested in building lightweight and efficient convolutional neural networks. Inspired by the success of two design patterns, composition of structured sparse kernels, e.g., interleaved group convolutions (IGC), and composition of low-rank kernels, e.g., bottle-neck modules, we study the combination of such two design patterns, using the composition of structured sparse low-rank kernels, to form a convolutional kernel. Rather than introducing a complementary condition over channels, we introduce a loose complementary condition, which is formulated by imposing the complementary condition over super-channels, to guide the design for generating a dense convolutional kernel. The resulting network is called IGCV3. We empirically demonstrate that the combination of low-rank and sparse kernels boosts the performance and the superiority of our proposed approach to the state-of-the-arts, IGCV2 and MobileNetV2 over image classification on CIFAR and ImageNet and object detection on COCO.

* 10 pages, 2 figures, accepted by BMVC 2018 

  Click for Model/Code and Paper
Deeply-Learned Part-Aligned Representations for Person Re-Identification

Jul 23, 2017
Liming Zhao, Xi Li, Jingdong Wang, Yueting Zhuang

In this paper, we address the problem of person re-identification, which refers to associating the persons captured from different cameras. We propose a simple yet effective human part-aligned representation for handling the body part misalignment problem. Our approach decomposes the human body into regions (parts) which are discriminative for person matching, accordingly computes the representations over the regions, and aggregates the similarities computed between the corresponding regions of a pair of probe and gallery images as the overall matching score. Our formulation, inspired by attention models, is a deep neural network modeling the three steps together, which is learnt through minimizing the triplet loss function without requiring body part labeling information. Unlike most existing deep learning algorithms that learn a global or spatial partition-based local representation, our approach performs human body partition, and thus is more robust to pose changes and various human spatial distributions in the person bounding box. Our approach shows state-of-the-art results over standard datasets, Market-$1501$, CUHK$03$, CUHK$01$ and VIPeR.

* Accepted by ICCV 2017 

  Click for Model/Code and Paper
Deeply-Fused Nets

May 25, 2016
Jingdong Wang, Zhen Wei, Ting Zhang, Wenjun Zeng

In this paper, we present a novel deep learning approach, deeply-fused nets. The central idea of our approach is deep fusion, i.e., combine the intermediate representations of base networks, where the fused output serves as the input of the remaining part of each base network, and perform such combinations deeply over several intermediate representations. The resulting deeply fused net enjoys several benefits. First, it is able to learn multi-scale representations as it enjoys the benefits of more base networks, which could form the same fused network, other than the initial group of base networks. Second, in our suggested fused net formed by one deep and one shallow base networks, the flows of the information from the earlier intermediate layer of the deep base network to the output and from the input to the later intermediate layer of the deep base network are both improved. Last, the deep and shallow base networks are jointly learnt and can benefit from each other. More interestingly, the essential depth of a fused net composed from a deep base network and a shallow base network is reduced because the fused net could be composed from a less deep base network, and thus training the fused net is less difficult than training the initial deep base network. Empirical results demonstrate that our approach achieves superior performance over two closely-related methods, ResNet and Highway, and competitive performance compared to the state-of-the-arts.

  Click for Model/Code and Paper
Deep Regression for Face Alignment

Sep 18, 2014
Baoguang Shi, Xiang Bai, Wenyu Liu, Jingdong Wang

In this paper, we present a deep regression approach for face alignment. The deep architecture consists of a global layer and multi-stage local layers. We apply the back-propagation algorithm with the dropout strategy to jointly optimize the regression parameters. We show that the resulting deep regressor gradually and evenly approaches the true facial landmarks stage by stage, avoiding the tendency to yield over-strong early stage regressors while over-weak later stage regressors. Experimental results show that our approach achieves the state-of-the-art

  Click for Model/Code and Paper
Low-rank SIFT: An Affine Invariant Feature for Place Recognition

Aug 07, 2014
Chao Yang, Shengnan Caih, Jingdong Wang, Long Quan

In this paper, we present a novel affine-invariant feature based on SIFT, leveraging the regular appearance of man-made objects. The feature achieves full affine invariance without needing to simulate over affine parameter space. Low-rank SIFT, as we name the feature, is based on our observation that local tilt, which are caused by changes of camera axis orientation, could be normalized by converting local patches to standard low-rank forms. Rotation, translation and scaling invariance could be achieved in ways similar to SIFT. As an extension of SIFT, our method seeks to add prior to solve the ill-posed affine parameter estimation problem and normalizes them directly, and is applicable to objects with regular structures. Furthermore, owing to recent breakthrough in convex optimization, such parameter could be computed efficiently. We will demonstrate its effectiveness in place recognition as our major application. As extra contributions, we also describe our pipeline of constructing geotagged building database from the ground up, as well as an efficient scheme for automatic feature selection.

  Click for Model/Code and Paper
Deep Triplet Quantization

Feb 01, 2019
Bin Liu, Yue Cao, Mingsheng Long, Jianmin Wang, Jingdong Wang

Deep hashing establishes efficient and effective image retrieval by end-to-end learning of deep representations and hash codes from similarity data. We present a compact coding solution, focusing on deep learning to quantization approach that has shown superior performance over hashing solutions for similarity retrieval. We propose Deep Triplet Quantization (DTQ), a novel approach to learning deep quantization models from the similarity triplets. To enable more effective triplet training, we design a new triplet selection approach, Group Hard, that randomly selects hard triplets in each image group. To generate compact binary codes, we further apply a triplet quantization with weak orthogonality during triplet training. The quantization loss reduces the codebook redundancy and enhances the quantizability of deep representations through back-propagation. Extensive experiments demonstrate that DTQ can generate high-quality and compact binary codes, which yields state-of-the-art image retrieval performance on three benchmark datasets, NUS-WIDE, CIFAR-10, and MS-COCO.

* Accepted by ACM Multimedia 2018 as oral paper 

  Click for Model/Code and Paper
Rethink ReLU to Training Better CNNs

Aug 31, 2018
Gangming Zhao, Zhaoxiang Zhang, He Guan, Peng Tang, Jingdong Wang

Most of convolutional neural networks share the same characteristic: each convolutional layer is followed by a nonlinear activation layer where Rectified Linear Unit (ReLU) is the most widely used. In this paper, we argue that the designed structure with the equal ratio between these two layers may not be the best choice since it could result in the poor generalization ability. Thus, we try to investigate a more suitable method on using ReLU to explore the better network architectures. Specifically, we propose a proportional module to keep the ratio between convolution and ReLU amount to be N:M (N>M). The proportional module can be applied in almost all networks with no extra computational cost to improve the performance. Comprehensive experimental results indicate that the proposed method achieves better performance on different benchmarks with different network architectures, thus verify the superiority of our work.

* 8 pages,10 figures, conference 

  Click for Model/Code and Paper
Interleaved Group Convolutions for Deep Neural Networks

Jul 18, 2017
Ting Zhang, Guo-Jun Qi, Bin Xiao, Jingdong Wang

In this paper, we present a simple and modularized neural network architecture, named interleaved group convolutional neural networks (IGCNets). The main point lies in a novel building block, a pair of two successive interleaved group convolutions: primary group convolution and secondary group convolution. The two group convolutions are complementary: (i) the convolution on each partition in primary group convolution is a spatial convolution, while on each partition in secondary group convolution, the convolution is a point-wise convolution; (ii) the channels in the same secondary partition come from different primary partitions. We discuss one representative advantage: Wider than a regular convolution with the number of parameters and the computation complexity preserved. We also show that regular convolutions, group convolution with summation fusion, and the Xception block are special cases of interleaved group convolutions. Empirical results over standard benchmarks, CIFAR-$10$, CIFAR-$100$, SVHN and ImageNet demonstrate that our networks are more efficient in using parameters and computation complexity with similar or higher accuracy.

* To appear in ICCV 2017 

  Click for Model/Code and Paper
DisturbLabel: Regularizing CNN on the Loss Layer

Apr 30, 2016
Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, Qi Tian

During a long period of time we are combating over-fitting in the CNN training process with model regularization, including weight decay, model averaging, data augmentation, etc. In this paper, we present DisturbLabel, an extremely simple algorithm which randomly replaces a part of labels as incorrect values in each iteration. Although it seems weird to intentionally generate incorrect training labels, we show that DisturbLabel prevents the network training from over-fitting by implicitly averaging over exponentially many networks which are trained with different label sets. To the best of our knowledge, DisturbLabel serves as the first work which adds noises on the loss layer. Meanwhile, DisturbLabel cooperates well with Dropout to provide complementary regularization functions. Experiments demonstrate competitive recognition results on several popular image recognition datasets.

* To appear in CVPR 2016 (10 pages, 10 figures) 

  Click for Model/Code and Paper
Good Practice in CNN Feature Transfer

Apr 01, 2016
Liang Zheng, Yali Zhao, Shengjin Wang, Jingdong Wang, Qi Tian

The objective of this paper is the effective transfer of the Convolutional Neural Network (CNN) feature in image search and classification. Systematically, we study three facts in CNN transfer. 1) We demonstrate the advantage of using images with a properly large size as input to CNN instead of the conventionally resized one. 2) We benchmark the performance of different CNN layers improved by average/max pooling on the feature maps. Our observation suggests that the Conv5 feature yields very competitive accuracy under such pooling step. 3) We find that the simple combination of pooled features extracted across various CNN layers is effective in collecting evidences from both low and high level descriptors. Following these good practices, we are capable of improving the state of the art on a number of benchmarks to a large margin.

* 9 pages. It will be submitted to an appropriate journal 

  Click for Model/Code and Paper
Hashing for Similarity Search: A Survey

Aug 13, 2014
Jingdong Wang, Heng Tao Shen, Jingkuan Song, Jianqiu Ji

Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space.

  Click for Model/Code and Paper