Models, code, and papers for "Sheng Jin":

##### Deep Saliency Hashing

Jul 04, 2018
Sheng Jin

In recent years, hashing methods have been proved efficient for large-scale Web media search. However, existing general hashing methods have limited discriminative power for describing fine-grained objects that share similar overall appearance but have subtle difference. To solve this problem, we for the first time introduce attention mechanism to the learning of hashing codes. Specifically, we propose a novel deep hashing model, named deep saliency hashing (DSaH), which automatically mines salient regions and learns semantic-preserving hashing codes simultaneously. DSaH is a two-step end-to-end model consisting of an attention network and a hashing network. Our loss function contains three basic components, including the semantic loss, the saliency loss, and the quantization loss. The saliency loss guides the attention network to mine discriminative regions from pairs of images. We conduct extensive experiments on both fine-grained and general retrieval datasets for performance evaluation. Experimental results on Oxford Flowers-17 and Stanford Dogs-120 demonstrate that our DSaH performs the best for fine-grained retrieval task and beats the existing best retrieval performance (DPSH) by approximately 12%. DSaH also outperforms several state-of-the-art hashing methods on general datasets, including CIFAR-10 and NUS-WIDE.

##### Unsupervised Semantic Deep Hashing

Mar 19, 2018
Sheng Jin

In recent years, deep hashing methods have been proved to be efficient since it employs convolutional neural network to learn features and hashing codes simultaneously. However, these methods are mostly supervised. In real-world application, it is a time-consuming and overloaded task for annotating a large number of images. In this paper, we propose a novel unsupervised deep hashing method for large-scale image retrieval. Our method, namely unsupervised semantic deep hashing (\textbf{USDH}), uses semantic information preserved in the CNN feature layer to guide the training of network. We enforce four criteria on hashing codes learning based on VGG-19 model: 1) preserving relevant information of feature space in hashing space; 2) minimizing quantization loss between binary-like codes and hashing codes; 3) improving the usage of each bit in hashing codes by using maximum information entropy, and 4) invariant to image rotation. Extensive experiments on CIFAR-10, NUSWIDE have demonstrated that \textbf{USDH} outperforms several state-of-the-art unsupervised hashing methods for image retrieval. We also conduct experiments on Oxford 17 datasets for fine-grained classification to verify its efficiency for other computer vision tasks.

##### HoMM: Higher-order Moment Matching for Unsupervised Domain Adaptation

Minimizing the discrepancy of feature distributions between different domains is one of the most promising directions in unsupervised domain adaptation. From the perspective of distribution matching, most existing discrepancy-based methods are designed to match the second-order or lower statistics, which however, have limited expression of statistical characteristic for non-Gaussian distributions. In this work, we explore the benefits of using higher-order statistics (mainly refer to third-order and fourth-order statistics) for domain matching. We propose a Higher-order Moment Matching (HoMM) method, and further extend the HoMM into reproducing kernel Hilbert spaces (RKHS). In particular, our proposed HoMM can perform arbitrary-order moment tensor matching, we show that the first-order HoMM is equivalent to Maximum Mean Discrepancy (MMD) and the second-order HoMM is equivalent to Correlation Alignment (CORAL). Moreover, the third-order and the fourth-order moment tensor matching are expected to perform comprehensive domain alignment as higher-order statistics can approximate more complex, non-Gaussian distributions. Besides, we also exploit the pseudo-labeled target samples to learn discriminative representations in the target domain, which further improves the transfer performance. Extensive experiments are conducted, showing that our proposed HoMM consistently outperforms the existing moment matching methods by a large margin. Codes are available at \url{https://github.com/chenchao666/HoMM-Master}

* Accept by AAAI-2020, codes are available at https://github.com/chenchao666/HoMM-Master
##### 6D Object Pose Estimation without PnP

Feb 06, 2019
Jin Liu, Sheng He

In this paper, we propose an efficient end-to-end algorithm to tackle the problem of estimating the 6D pose of objects from a single RGB image. Our system trains a fully convolutional network to regress the 3D rotation and the 3D translation in region layer. On this basis, a special layer, Collinear Equation Layer, is added next to region layer to output the 2D projections of the 3D bounding boxs corners. In the back propagation stage, the 6D pose network are adjusted according to the error of the 2D projections. In the detection phase, we directly output the position and pose through the region layer. Besides, we introduce a novel and concise representation of 3D rotation to make the regression more precise and easier. Experiments show that our method outperforms base-line and state of the art methods both at accuracy and efficiency. In the LineMod dataset, our algorithm achieves less than 18 ms/object on a GeForce GTX 1080Ti GPU, while the translational error and rotational error are less than 1.67 cm and 2.5 degree.

##### 6D Object Pose Estimation Based on 2D Bounding Box

Jan 27, 2019
Jin Liu, Sheng He

In this paper, we present a simple but powerful method to tackle the problem of estimating the 6D pose of objects from a single RGB image. Our system trains a novel convolutional neural network to regress the unit quaternion, which represents the 3D rotation, from the partial image inside the bounding box returned by 2D detection systems. Then we propose an algorithm we call Bounding Box Equation to efficiently and accurately obtain the 3D translation, using 3D rotation and 2D bounding box. Considering that the quadratic sum of the quaternion's four elements equals to one, we add a normalization layer to keep the network's output on the unit sphere and put forward a special loss function for unit quaternion regression. We evaluate our method on the LineMod dataset and experiment shows that our approach outperforms base-line and some state of the art methods.

##### Detecting Curve Text in the Wild: New Dataset and New Solution

Dec 06, 2017
Liu Yuliang, Jin Lianwen, Zhang Shuaitao, Zhang Sheng

Scene text detection has been made great progress in recent years. The detection manners are evolving from axis-aligned rectangle to rotated rectangle and further to quadrangle. However, current datasets contain very little curve text, which can be widely observed in scene images such as signboard, product name and so on. To raise the concerns of reading curve text in the wild, in this paper, we construct a curve text dataset named CTW1500, which includes over 10k text annotations in 1,500 images (1000 for training and 500 for testing). Based on this dataset, we pioneering propose a polygon based curve text detector (CTD) which can directly detect curve text without empirical combination. Moreover, by seamlessly integrating the recurrent transverse and longitudinal offset connection (TLOC), the proposed method can be end-to-end trainable to learn the inherent connection among the position offsets. This allows the CTD to explore context information instead of predicting points independently, resulting in more smooth and accurate detection. We also propose two simple but effective post-processing methods named non-polygon suppress (NPS) and polygonal non-maximum suppression (PNMS) to further improve the detection accuracy. Furthermore, the proposed approach in this paper is designed in an universal manner, which can also be trained with rectangular or quadrilateral bounding boxes without extra efforts. Experimental results on CTW-1500 demonstrate our method with only a light backbone can outperform state-of-the-art methods with a large margin. By evaluating only in the curve or non-curve subset, the CTD + TLOC can still achieve the best results. Code is available at https://github.com/Yuliang-Liu/Curve-Text-Detector.

* 9 pages
##### RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning

Feb 08, 2020
Nan Jiang, Sheng Jin, Zhiyao Duan, Changshui Zhang

This paper presents a deep reinforcement learning algorithm for online accompaniment generation, with potential for real-time interactive human-machine duet improvisation. Different from offline music generation and harmonization, online music accompaniment requires the algorithm to respond to human input and generate the machine counterpart in a sequential order. We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state). The key of this algorithm is the well-functioning reward model. Instead of defining it using music composition rules, we learn this model from monophonic and polyphonic training data. This model considers the compatibility of the machine-generated note with both the machine-generated context and the human-generated context. Experiments show that this algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part. Subjective evaluations on preferences show that the proposed algorithm generates music pieces of higher quality than the baseline method.

##### Multi-person Articulated Tracking with Spatial and Temporal Embeddings

Mar 21, 2019
Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian

We propose a unified framework for multi-person pose estimation and tracking. Our framework consists of two main components,~\ie~SpatialNet and TemporalNet. The SpatialNet accomplishes body part detection and part-level data association in a single frame, while the TemporalNet groups human instances in consecutive frames into trajectories. Specifically, besides body part detection heatmaps, SpatialNet also predicts the Keypoint Embedding (KE) and Spatial Instance Embedding (SIE) for body part association. We model the grouping procedure into a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable. TemporalNet extends spatial grouping of keypoints to temporal grouping of human instances. Given human proposals from two consecutive frames, TemporalNet exploits both appearance features encoded in Human Embedding (HE) and temporally consistent geometric features embodied in Temporal Instance Embedding (TIE) for robust tracking. Extensive experiments demonstrate the effectiveness of our proposed model. Remarkably, we demonstrate substantial improvements over the state-of-the-art pose tracking method from 65.4\% to 71.8\% Multi-Object Tracking Accuracy (MOTA) on the ICCV'17 PoseTrack Dataset.

* CVPR2019
##### Feature Enhancement Network: A Refined Scene Text Detector

Nov 12, 2017
Sheng Zhang, Yuliang Liu, Lianwen Jin, Canjie Luo

In this paper, we propose a refined scene text detector with a \textit{novel} Feature Enhancement Network (FEN) for Region Proposal and Text Detection Refinement. Retrospectively, both region proposal with \textit{only} $3\times 3$ sliding-window feature and text detection refinement with \textit{single scale} high level feature are insufficient, especially for smaller scene text. Therefore, we design a new FEN network with \textit{task-specific}, \textit{low} and \textit{high} level semantic features fusion to improve the performance of text detection. Besides, since \textit{unitary} position-sensitive RoI pooling in general object detection is unreasonable for variable text regions, an \textit{adaptively weighted} position-sensitive RoI pooling layer is devised for further enhancing the detecting accuracy. To tackle the \textit{sample-imbalance} problem during the refinement stage, we also propose an effective \textit{positives mining} strategy for efficiently training our network. Experiments on ICDAR 2011 and 2013 robust text detection benchmarks demonstrate that our method can achieve state-of-the-art results, outperforming all reported methods in terms of F-measure.

* 8 pages, 5 figures, 2 tables. This paper is accepted to appear in AAAI 2018
##### The MBPEP: a deep ensemble pruning algorithm providing high quality uncertainty prediction

Feb 25, 2019
Ruihan Hu, Qijun Huang, Sheng Chang, Hao Wang, Jin He

Machine learning algorithms have been effectively applied into various real world tasks. However, it is difficult to provide high-quality machine learning solutions to accommodate an unknown distribution of input datasets; this difficulty is called the uncertainty prediction problems. In this paper, a margin-based Pareto deep ensemble pruning (MBPEP) model is proposed. It achieves the high-quality uncertainty estimation with a small value of the prediction interval width (MPIW) and a high confidence of prediction interval coverage probability (PICP) by using deep ensemble networks. In addition to these networks, unique loss functions are proposed, and these functions make the sub-learners available for standard gradient descent learning. Furthermore, the margin criterion fine-tuning-based Pareto pruning method is introduced to optimize the ensembles. Several experiments including predicting uncertainties of classification and regression are conducted to analyze the performance of MBPEP. The experimental results show that MBPEP achieves a small interval width and a low learning error with an optimal number of ensembles. For the real-world problems, MBPEP performs well on input datasets with unknown distributions datasets incomings and improves learning performance on a multi task problem when compared to that of each single model.

* Applied Intelligence(2019)
* 20 pages, 7 figures
##### Non-Spike Timing-Dependent Plasticity based Unsupervised Memristive Neural Networks with High Hardware Compatibility

Jul 23, 2019
Zhiri Tang, Yanhua Chen, Hao Wang, Jin He, Qijun Huang, Sheng Chang

With the development of research on memristor, memristive neural networks (MNNs) have become a hot research topic recently. Because memristor can mimic the spike timing-dependent plasticity (STDP), the research on STDP based MNNs is rapidly increasing. However, although state-of-the-art works on STDP based MNNs have many applications such as pattern recognition, STDP mechanism brings relatively complex hardware framework and low processing speed, which block MNNs' hardware realization. A non-STDP based unsupervised MNN is constructed in this paper. Through the comparison with STDP method on the basis of two common structures including feedforward and crossbar, non-STDP based MNNs not only remain the same advantages as STDP based MNNs including high accuracy and convergence speed in pattern recognition, but also better hardware performance as few hardware resources and higher processing speed. By virtue of the combination of memristive character and simple mechanism, non-STDP based MNNs have better hardware compatibility, which may give a new viewpoint for memristive neural networks' engineering applications.

* 9 pages, 10 figures
##### TRB: A Novel Triplet Representation for Understanding 2D Human Body

Oct 25, 2019
Haodong Duan, KwanYee Lin, Sheng Jin, Wentao Liu, Chen Qian, Wanli Ouyang

Human pose and shape are two important components of 2D human body. However, how to efficiently represent both of them in images is still an open question. In this paper, we propose the Triplet Representation for Body (TRB) -- a compact 2D human body representation, with skeleton keypoints capturing human pose information and contour keypoints containing human shape information. TRB not only preserves the flexibility of skeleton keypoint representation, but also contains rich pose and human shape information. Therefore, it promises broader application areas, such as human shape editing and conditional image generation. We further introduce the challenging problem of TRB estimation, where joint learning of human pose and shape is required. We construct several large-scale TRB estimation datasets, based on popular 2D pose datasets: LSP, MPII, COCO. To effectively solve TRB estimation, we propose a two-branch network (TRB-net) with three novel techniques, namely X-structure (Xs), Directional Convolution (DC) and Pairwise Mapping (PM), to enforce multi-level message passing for joint feature learning. We evaluate our proposed TRB-net and several leading approaches on our proposed TRB datasets, and demonstrate the superiority of our method through extensive evaluations.

* Accepted by ICCV2019
##### Omnidirectional Scene Text Detection with Sequential-free Box Discretization

Jun 07, 2019
Yuliang Liu, Sheng Zhang, Lianwen Jin, Lele Xie, Yaqiang Wu, Zhepeng Wang

Scene text in the wild is commonly presented with high variant characteristics. Using quadrilateral bounding box to localize the text instance is nearly indispensable for detection methods. However, recent researches reveal that introducing quadrilateral bounding box for scene text detection will bring a label confusion issue which is easily overlooked, and this issue may significantly undermine the detection performance. To address this issue, in this paper, we propose a novel method called Sequential-free Box Discretization (SBD) by discretizing the bounding box into key edges (KE) which can further derive more effective methods to improve detection performance. Experiments showed that the proposed method can outperform state-of-the-art methods in many popular scene text benchmarks, including ICDAR 2015, MLT, and MSRA-TD500. Ablation study also showed that simply integrating the SBD into Mask R-CNN framework, the detection performance can be substantially improved. Furthermore, an experiment on the general object dataset HRSC2016 (multi-oriented ships) showed that our method can outperform recent state-of-the-art methods by a large margin, demonstrating its powerful generalization ability.

* Accepted by IJCAI2019
##### Towards Self-similarity Consistency and Feature Discrimination for Unsupervised Domain Adaptation

Apr 13, 2019
Chao Chen, Zhihang Fu, Zhihong Chen, Zhaowei Cheng, Xinyu Jin, Xian-Sheng Hua

Recent advances in unsupervised domain adaptation mainly focus on learning shared representations by global distribution alignment without considering class information across domains. The neglect of class information, however, may lead to partial alignment (or even misalignment) and poor generalization performance. For comprehensive alignment, we argue that the similarities across different features in the source domain should be consistent with that of in the target domain. Based on this assumption, we propose a new domain discrepancy metric, i.e., Self-similarity Consistency (SSC), to enforce the feature structure being consistent across domains. The renowned correlation alignment (CORAL) is proven to be a special case, and a sub-optimal measure of our proposed SSC. Furthermore, we also propose to mitigate the side effect of the partial alignment and misalignment by incorporating the discriminative information of the deep representations. Specifically, an embarrassingly simple and effective feature norm constraint is exploited to enlarge the discrepancy of inter-class samples. It relieves the requirements of strict alignment when performing adaptation, therefore improving the adaptation performance significantly. Extensive experiments on visual domain adaptation tasks demonstrate the effectiveness of our proposed SSC metric and feature discrimination approach.

* This paper has been submitted to ACMMMM 2019
##### DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression

Jan 26, 2019
Sian Jin, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, Franck Cappello

DNNs have been quickly and broadly exploited to improve the data analysis quality in many complex science and engineering applications. Today's DNNs are becoming deeper and wider because of increasing demand on the analysis quality and more and more complex applications to resolve. The wide and deep DNNs, however, require large amounts of resources, significantly restricting their utilization on resource-constrained systems. Although some network simplification methods have been proposed to address this issue, they suffer from either low compression ratios or high compression errors, which may introduce a costly retraining process for the target accuracy. In this paper, we propose DeepSZ: an accuracy-loss bounded neural network compression framework, which involves four key steps: network pruning, error bound assessment, optimization for error bound configuration, and compressed model generation, featuring a high compression ratio and low encoding time. The contribution is three-fold. (1) We develop an adaptive approach to select the feasible error bounds for each layer. (2) We build a model to estimate the overall loss of accuracy based on the accuracy degradation caused by individual decompressed layers. (3) We develop an efficient optimization algorithm to determine the best-fit configuration of error bounds in order to maximize the compression ratio under the user-set accuracy constraint. Experiments show that DeepSZ can compress AlexNet and VGG-16 on the ImageNet by a compression ratio of 46X and 116X, respectively, and compress LeNet-300-100 and LeNet-5 on the MNIST by a compression ratio of 57X and 56X, respectively, with only up to 0.3% loss of accuracy. Compared with other state-of-the-art methods, DeepSZ can improve the compression ratio by up to 1.43X, the DNN encoding performance by up to 4.0X (with four Nvidia Tesla V100 GPUs), and the decoding performance by up to 6.2X.

* 12 pages, 6 figures, submitted to HPDC'19
##### SSAH: Semi-supervised Adversarial Deep Hashing with Self-paced Hard Sample Generation

Deep hashing methods have been proved to be effective and efficient for large-scale Web media search. The success of these data-driven methods largely depends on collecting sufficient labeled data, which is usually a crucial limitation in practical cases. The current solutions to this issue utilize Generative Adversarial Network (GAN) to augment data in semi-supervised learning. However, existing GAN-based methods treat image generations and hashing learning as two isolated processes, leading to generation ineffectiveness. Besides, most works fail to exploit the semantic information in unlabeled data. In this paper, we propose a novel Semi-supervised Self-pace Adversarial Hashing method, named SSAH to solve the above problems in a unified framework. The SSAH method consists of an adversarial network (A-Net) and a hashing network (H-Net). To improve the quality of generative images, first, the A-Net learns hard samples with multi-scale occlusions and multi-angle rotated deformations which compete against the learning of accurate hashing codes. Second, we design a novel self-paced hard generation policy to gradually increase the hashing difficulty of generated samples. To make use of the semantic information in unlabeled ones, we propose a semi-supervised consistent loss. The experimental results show that our method can significantly improve state-of-the-art models on both the widely-used hashing datasets and fine-grained datasets.

##### A Hardware Friendly Unsupervised Memristive Neural Network with Weight Sharing Mechanism

Jan 01, 2019
Zhiri Tang, Ruohua Zhu, Peng Lin, Jin He, Hao Wang, Qijun Huang, Sheng Chang, Qiming Ma

Memristive neural networks (MNNs), which use memristors as neurons or synapses, have become a hot research topic recently. However, most memristors are not compatible with mainstream integrated circuit technology and their stabilities in large-scale are not very well so far. In this paper, a hardware friendly MNN circuit is introduced, in which the memristive characteristics are implemented by digital integrated circuit. Through this method, spike timing dependent plasticity (STDP) and unsupervised learning are realized. A weight sharing mechanism is proposed to bridge the gap of network scale and hardware resource. Experiment results show the hardware resource is significantly saved with it, maintaining good recognition accuracy and high speed. Moreover, the tendency of resource increase is slower than the expansion of network scale, which infers our method's potential on large scale neuromorphic network's realization.

* Neurocomputing 2019
* 10 pages, 11 figures
##### Sharp Attention Network via Adaptive Sampling for Person Re-identification

In this paper, we present novel sharp attention networks by adaptively sampling feature maps from convolutional neural networks (CNNs) for person re-identification (re-ID) problem. Due to the introduction of sampling-based attention models, the proposed approach can adaptively generate sharper attention-aware feature masks. This greatly differs from the gating-based attention mechanism that relies soft gating functions to select the relevant features for person re-ID. In contrast, the proposed sampling-based attention mechanism allows us to effectively trim irrelevant features by enforcing the resultant feature masks to focus on the most discriminative features. It can produce sharper attentions that are more assertive in localizing subtle features relevant to re-identifying people across cameras. For this purpose, a differentiable Gumbel-Softmax sampler is employed to approximate the Bernoulli sampling to train the sharp attention networks. Extensive experimental evaluations demonstrate the superiority of this new sharp attention model for person re-ID over the other state-of-the-art methods on three challenging benchmarks including CUHK03, Market-1501, and DukeMTMC-reID.

* accepted by IEEE Transactions on Circuits and Systems for Video Technology(T-CSVT)
##### Learning Structured Communication for Multi-agent Reinforcement Learning

This work explores the large-scale multi-agent communication mechanism under a multi-agent reinforcement learning (MARL) setting. We summarize the general categories of topology for communication structures in MARL literature, which are often manually specified. Then we propose a novel framework termed as Learning Structured Communication (LSC) by using a more flexible and efficient communication topology. Our framework allows for adaptive agent grouping to form different hierarchical formations over episodes, which is generated by an auxiliary task combined with a hierarchical routing protocol. Given each formed topology, a hierarchical graph neural network is learned to enable effective message information generation and propagation among inter- and intra-group communications. In contrast to existing communication mechanisms, our method has an explicit while learnable design for hierarchical communication. Experiments on challenging tasks show the proposed LSC enjoys high communication efficiency, scalability, and global cooperation capability.