Models, code, and papers for "Xinghao Chen":

Two-stream convolutional neural network for accurate RGB-D fingertip detection using depth and edge information

Dec 23, 2016
Hengkai Guo, Guijin Wang, Xinghao Chen

Accurate detection of fingertips in depth image is critical for human-computer interaction. In this paper, we present a novel two-stream convolutional neural network (CNN) for RGB-D fingertip detection. Firstly edge image is extracted from raw depth image using random forest. Then the edge information is combined with depth information in our CNN structure. We study several fusion approaches and suggest a slow fusion strategy as a promising way of fingertip detection. As shown in our experiments, our real-time algorithm outperforms state-of-the-art fingertip detection methods on the public dataset HandNet with an average 3D error of 9.9mm, and shows comparable accuracy of fingertip estimation on NYU hand dataset.

* Accepted by ICIP 2016 

  Click for Model/Code and Paper
Pose Guided Structured Region Ensemble Network for Cascaded Hand Pose Estimation

Jun 24, 2018
Xinghao Chen, Guijin Wang, Hengkai Guo, Cairong Zhang

Hand pose estimation from a single depth image is an essential topic in computer vision and human computer interaction. Despite recent advancements in this area promoted by convolutional neural network, accurate hand pose estimation is still a challenging problem. In this paper we propose a Pose guided structured Region Ensemble Network (Pose-REN) to boost the performance of hand pose estimation. The proposed method extracts regions from the feature maps of convolutional neural network under the guide of an initially estimated pose, generating more optimal and representative features for hand pose estimation. The extracted feature regions are then integrated hierarchically according to the topology of hand joints by employing tree-structured fully connections. A refined estimation of hand pose is directly regressed by the proposed network and the final hand pose is obtained by utilizing an iterative cascaded method. Comprehensive experiments on public hand pose datasets demonstrate that our proposed method outperforms state-of-the-art algorithms.

* Accepted by Neurocomputing 

  Click for Model/Code and Paper
Motion Feature Augmented Recurrent Neural Network for Skeleton-based Dynamic Hand Gesture Recognition

Aug 10, 2017
Xinghao Chen, Hengkai Guo, Guijin Wang, Li Zhang

Dynamic hand gesture recognition has attracted increasing interests because of its importance for human computer interaction. In this paper, we propose a new motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition. Finger motion features are extracted to describe finger movements and global motion features are utilized to represent the global movement of hand skeleton. These motion features are then fed into a bidirectional recurrent neural network (RNN) along with the skeleton sequence, which can augment the motion features for RNN and improve the classification performance. Experiments demonstrate that our proposed method is effective and outperforms start-of-the-art methods.

* Accepted by ICIP 2017 

  Click for Model/Code and Paper
Towards Good Practices for Deep 3D Hand Pose Estimation

Jul 23, 2017
Hengkai Guo, Guijin Wang, Xinghao Chen, Cairong Zhang

3D hand pose estimation from single depth image is an important and challenging problem for human-computer interaction. Recently deep convolutional networks (ConvNet) with sophisticated design have been employed to address it, but the improvement over traditional random forest based methods is not so apparent. To exploit the good practice and promote the performance for hand pose estimation, we propose a tree-structured Region Ensemble Network (REN) for directly 3D coordinate regression. It first partitions the last convolution outputs of ConvNet into several grid regions. The results from separate fully-connected (FC) regressors on each regions are then integrated by another FC layer to perform the estimation. By exploitation of several training strategies including data augmentation and smooth $L_1$ loss, proposed REN can significantly improve the performance of ConvNet to localize hand joints. The experimental results demonstrate that our approach achieves the best performance among state-of-the-art algorithms on three public hand pose datasets. We also experiment our methods on fingertip detection and human pose datasets and obtain state-of-the-art accuracy.

* Extended version of arXiv:1702.02447 

  Click for Model/Code and Paper
Revisiting Distributed Synchronous SGD

Mar 21, 2017
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz

Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In contrast, the synchronous approach is often thought to be impractical due to idle time wasted on waiting for straggling workers. We revisit these conventional beliefs in this paper, and examine the weaknesses of both approaches. We demonstrate that a third approach, synchronous optimization with backup workers, can avoid asynchronous noise while mitigating for the worst stragglers. Our approach is empirically validated and shown to converge faster and to better test accuracies.

* 10 pages 

  Click for Model/Code and Paper
Bi-stream Pose Guided Region Ensemble Network for Fingertip Localization from Stereo Images

Feb 26, 2019
Guijin Wang, Cairong Zhang, Xinghao Chen, Xiangyang Ji, Jing-Hao Xue, Hang Wang

In human-computer interaction, it is important to accurately estimate the hand pose especially fingertips. However, traditional approaches for fingertip localization mainly rely on depth images and thus suffer considerably from the noise and missing values. Instead of depth images, stereo images can also provide 3D information of hands and promote 3D hand pose estimation. There are nevertheless limitations on the dataset size, global viewpoints, hand articulations and hand shapes in the publicly available stereo-based hand pose datasets. To mitigate these limitations and promote further research on hand pose estimation from stereo images, we propose a new large-scale binocular hand pose dataset called THU-Bi-Hand, offering a new perspective for fingertip localization. In the THU-Bi-Hand dataset, there are 447k pairs of stereo images of different hand shapes from 10 subjects with accurate 3D location annotations of the wrist and five fingertips. Captured with minimal restriction on the range of hand motion, the dataset covers large global viewpoint space and hand articulation space. To better present the performance of fingertip localization on THU-Bi-Hand, we propose a novel scheme termed Bi-stream Pose Guided Region Ensemble Network (Bi-Pose-REN). It extracts more representative feature regions around joint points in the feature maps under the guidance of the previously estimated pose. The feature regions are integrated hierarchically according to the topology of hand joints to regress the refined hand pose. Bi-Pose-REN and several existing methods are evaluated on THU-Bi-Hand so that benchmarks are provided for further research. Experimental results show that our new method has achieved the best performance on THU-Bi-Hand.

* Cairong Zhang and Xinghao Chen are equally contributed 

  Click for Model/Code and Paper
Interactive Hand Pose Estimation: Boosting accuracy in localizing extended finger joints

Jul 25, 2018
Cairong Zhang, Guijin Wang, Hengkai Guo, Xinghao Chen, Fei Qiao, Huazhong Yang

Accurate 3D hand pose estimation plays an important role in Human Machine Interaction (HMI). In the reality of HMI, joints in fingers stretching out, especially corresponding fingertips, are much more important than other joints. We propose a novel method to refine stretching-out finger joint locations after obtaining rough hand pose estimation. It first detects which fingers are stretching out, then neighbor pixels of certain joint vote for its new location based on random forests. The algorithm is tested on two public datasets: MSRA15 and ICVL. After the refinement stage of stretching-out fingers, errors of predicted HMI finger joint locations are significantly reduced. Mean error of all fingertips reduces around 5mm (relatively more than 20%). Stretching-out fingertip locations are even more precise, which in MSRA15 reduces 10.51mm (relatively 41.4%).

* Electronic Imaging, Visual Information Processing and Communication IX (2018), pp. 251-1-251-6(6) 
* Original publication available on 

  Click for Model/Code and Paper
Two-Stream Binocular Network: Accurate Near Field Finger Detection Based On Binocular Images

Apr 26, 2018
Yi Wei, Guijin Wang, Cairong Zhang, Hengkai Guo, Xinghao Chen, Huazhong Yang

Fingertip detection plays an important role in human computer interaction. Previous works transform binocular images into depth images. Then depth-based hand pose estimation methods are used to predict 3D positions of fingertips. Different from previous works, we propose a new framework, named Two-Stream Binocular Network (TSBnet) to detect fingertips from binocular images directly. TSBnet first shares convolutional layers for low level features of right and left images. Then it extracts high level features in two-stream convolutional networks separately. Further, we add a new layer: binocular distance measurement layer to improve performance of our model. To verify our scheme, we build a binocular hand image dataset, containing about 117k pairs of images in training set and 10k pairs of images in test set. Our methods achieve an average error of 10.9mm on our test set, outperforming previous work by 5.9mm (relatively 35.1%).

* Visual Communications and Image Processing (VCIP), 2017 IEEE (2017) 1-4 
* Published in: Visual Communications and Image Processing (VCIP), 2017 IEEE. Original IEEE publication available on Dataset available on 

  Click for Model/Code and Paper
Region Ensemble Network: Improving Convolutional Network for Hand Pose Estimation

May 09, 2017
Hengkai Guo, Guijin Wang, Xinghao Chen, Cairong Zhang, Fei Qiao, Huazhong Yang

Hand pose estimation from monocular depth images is an important and challenging problem for human-computer interaction. Recently deep convolutional networks (ConvNet) with sophisticated design have been employed to address it, but the improvement over traditional methods is not so apparent. To promote the performance of directly 3D coordinate regression, we propose a tree-structured Region Ensemble Network (REN), which partitions the convolution outputs into regions and integrates the results from multiple regressors on each regions. Compared with multi-model ensemble, our model is completely end-to-end training. The experimental results demonstrate that our approach achieves the best performance among state-of-the-arts on two public datasets.

* Accepted to ICIP 2017. Project: 

  Click for Model/Code and Paper
CARS: Continuous Evolution for Efficient Neural Architecture Search

Sep 17, 2019
Zhaohui Yang, Yunhe Wang, Xinghao Chen, Boxin Shi, Chao Xu, Chunjing Xu, Qi Tian, Chang Xu

Searching techniques in most of existing neural architecture search (NAS) algorithms are mainly dominated by differentiable methods for the efficiency reason. In contrast, we develop an efficient continuous evolutionary approach for searching neural networks. Architectures in the population which share parameters within one supernet in the latest iteration will be tuned over the training dataset with a few epochs. The searching in the next evolution iteration will directly inherit both the supernet and the population, which accelerates the optimal network generation. The non-dominated sorting strategy is further applied to preserve only results on the Pareto front for accurately updating the supernet. Several neural networks with different model sizes and performance will be produced after the continuous search with only 0.4 GPU days. As a result, our framework provides a series of networks with the number of parameters ranging from 3.7M to 5.1M under mobile settings. These networks surpass those produced by the state-of-the-art methods on the benchmark ImageNet dataset.

  Click for Model/Code and Paper
Unsupervised Adversarial Graph Alignment with Graph Embedding

Jul 01, 2019
Chaoqi Chen, Weiping Xie, Tingyang Xu, Yu Rong, Wenbing Huang, Xinghao Ding, Yue Huang, Junzhou Huang

Graph alignment, also known as network alignment, is a fundamental task in social network analysis. Many recent works have relied on partially labeled cross-graph node correspondences, i.e., anchor links. However, due to the privacy and security issue, the manual labeling of anchor links for diverse scenarios may be prohibitive. Aligning two graphs without any anchor links is a crucial and challenging task. In this paper, we propose an Unsupervised Adversarial Graph Alignment (UAGA) framework to learn a cross-graph alignment between two embedding spaces of different graphs in a fully unsupervised fashion (\emph{i.e.,} no existing anchor links and no users' personal profile or attribute information is available). The proposed framework learns the embedding spaces of each graph, and then attempts to align the two spaces via adversarial training, followed by a refinement procedure. We further extend our UAGA method to incremental UAGA (iUAGA) that iteratively reveals the unobserved user links based on the pseudo anchor links. This can be used to further improve both the embedding quality and the alignment accuracy. Moreover, the proposed methods will benefit some real-world applications, \emph{e.g.,} link prediction in social networks. Comprehensive experiments on real-world data demonstrate the effectiveness of our proposed approaches UAGA and iUAGA for unsupervised graph alignment.

* 10 pages, 4 figures 

  Click for Model/Code and Paper
Progressive Feature Alignment for Unsupervised Domain Adaptation

Nov 21, 2018
Chaoqi Chen, Weiping Xie, Tingyang Xu, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Junzhou Huang

Unsupervised domain adaptation (UDA) transfers knowledge from a label-rich source domain to a fully-unlabeled target domain. To tackle this task, recent approaches resort to discriminative domain transfer in virtue of pseudo-labels to enforce the class-level distribution alignment across the source and target domains. These methods, however, are vulnerable to the error accumulation and thus incapable of preserving cross-domain category consistency, as the pseudo-labeling accuracy is not guaranteed explicitly. In this paper, we propose the Progressive Feature Alignment Network (PFAN) to align the discriminative features across domains progressively and effectively, via exploiting the intra-class variation in the target domain. To be specific, we first develop an Easy-to-Hard Transfer Strategy (EHTS) and an Adaptive Prototype Alignment (APA) step to train our model iteratively and alternatively. Moreover, upon observing that a good domain adaptation usually requires a non-saturated source classifier, we consider a simple yet efficient way to retard the convergence speed of the source classification loss by further involving a temperature variate into the soft-max function. The extensive experimental results reveal that the proposed PFAN exceeds the state-of-the-art performance on three UDA datasets.

* 10 pages, 5 figures 

  Click for Model/Code and Paper
High Efficient Reconstruction of Single-shot T2 Mapping from OverLapping-Echo Detachment Planar Imaging Based on Deep Residual Network

Aug 17, 2017
Congbo Cai, Yiqing Zeng, Chao Wang, Shuhui Cai, Jun Zhang, Zhong Chen, Xinghao Ding, Jianhui Zhong

Purpose: An end-to-end deep convolutional neural network (CNN) based on deep residual network (ResNet) was proposed to efficiently reconstruct reliable T2 mapping from single-shot OverLapping-Echo Detachment (OLED) planar imaging. Methods: The training dataset was obtained from simulations carried out on SPROM software developed by our group. The relationship between the original OLED image containing two echo signals and the corresponded T2 mapping was learned by ResNet training. After the ResNet was trained, it was applied to reconstruct the T2 mapping from simulation and in vivo human brain data. Results: Though the ResNet was trained entirely on simulated data, the trained network was generalized well to real human brain data. The results from simulation and in vivo human brain experiments show that the proposed method significantly outperformed the echo-detachment-based method. Reliable T2 mapping was achieved within tens of milliseconds after the network had been trained while the echo-detachment-based OLED reconstruction method took minutes. Conclusion: The proposed method will greatly facilitate real-time dynamic and quantitative MR imaging via OLED sequence, and ResNet has the potential to reconstruct images from complex MRI sequence efficiently.

* 18 pages, 7 figures 

  Click for Model/Code and Paper
Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals

Mar 29, 2018
Shanxin Yuan, Guillermo Garcia-Hernando, Bjorn Stenger, Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, Pavlo Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, Junsong Yuan, Xinghao Chen, Guijin Wang, Fan Yang, Kai Akiyama, Yang Wu, Qingfu Wan, Meysam Madadi, Sergio Escalera, Shile Li, Dongheui Lee, Iason Oikonomidis, Antonis Argyros, Tae-Kyun Kim

In this paper, we strive to answer two questions: What is the current state of 3D hand pose estimation from depth images? And, what are the next challenges that need to be tackled? Following the successful Hands In the Million Challenge (HIM2017), we investigate the top 10 state-of-the-art methods on three tasks: single frame 3D pose estimation, 3D hand tracking, and hand pose estimation during object interaction. We analyze the performance of different CNN structures with regard to hand shape, joint visibility, view point and articulation distributions. Our findings include: (1) isolated 3D hand pose estimation achieves low mean errors (10 mm) in the view point range of [70, 120] degrees, but it is far from being solved for extreme view points; (2) 3D volumetric representations outperform 2D CNNs, better capturing the spatial structure of the depth data; (3) Discriminative methods still generalize poorly to unseen hand shapes; (4) While joint occlusions pose a challenge for most methods, explicit modeling of structure constraints can significantly narrow the gap between errors on visible and occluded joints.

  Click for Model/Code and Paper
Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes

Apr 13, 2019
Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, Xinghao Ding

Previous scene text detection methods have progressed substantially over the past years. However, limited by the receptive field of CNNs and the simple representations like rectangle bounding box or quadrangle adopted to describe text, previous methods may fall short when dealing with more challenging text instances, such as extremely long text and arbitrarily shaped text. To address these two problems, we present a novel text detector namely LOMO, which localizes the text progressively for multiple times (or in other word, LOok More than Once). LOMO consists of a direct regressor (DR), an iterative refinement module (IRM) and a shape expression module (SEM). At first, text proposals in the form of quadrangle are generated by DR branch. Next, IRM progressively perceives the entire long text by iterative refinement based on the extracted feature blocks of preliminary proposals. Finally, a SEM is introduced to reconstruct more precise representation of irregular text by considering the geometry properties of text instance, including text region, text center line and border offsets. The state-of-the-art results on several public benchmarks including ICDAR2017-RCTW, SCUT-CTW1500, Total-Text, ICDAR2015 and ICDAR17-MLT confirm the striking robustness and effectiveness of LOMO.

* Accepted by CVPR19 

  Click for Model/Code and Paper
Uncertainty-Guided Domain Alignment for Layer Segmentation in OCT Images

Aug 30, 2019
Jiexiang Wang, Cheng Bian, Meng Li, Xin Yang, Kai Ma, Wenao Ma, Jin Yuan, Xinghao Ding, Yefeng Zheng

Automatic and accurate segmentation for retinal and choroidal layers of Optical Coherence Tomography (OCT) is crucial for detection of various ocular diseases. However, because of the variations in different equipments, OCT data obtained from different manufacturers might encounter appearance discrepancy, which could lead to performance fluctuation to a deep neural network. In this paper, we propose an uncertainty-guided domain alignment method to aim at alleviating this problem to transfer discriminative knowledge across distinct domains. We disign a novel uncertainty-guided cross-entropy loss for boosting the performance over areas with high uncertainty. An uncertainty-guided curriculum transfer strategy is developed for the self-training (ST), which regards uncertainty as efficient and effective guidance to optimize the learning process in target domain. Adversarial learning with feature recalibration module (FRM) is applied to transfer informative knowledge from the domain feature spaces adaptively. The experiments on two OCT datasets show that the proposed methods can obtain significant segmentation improvements compared with the baseline models.

  Click for Model/Code and Paper