Research papers and code for "Xiaohua Huang":
Action Unit (AU) detection plays an important role for facial expression recognition. To the best of our knowledge, there is little research about AU analysis for micro-expressions. In this paper, we focus on AU detection in micro-expressions. Microexpression AU detection is challenging due to the small quantity of micro-expression databases, low intensity, short duration of facial muscle change, and class imbalance. In order to alleviate the problems, we propose a novel Spatio-Temporal Adaptive Pooling (STAP) network for AU detection in micro-expressions. Firstly, STAP is aggregated by a series of convolutional filters of different sizes. In this way, STAP can obtain multi-scale information on spatial and temporal domains. On the other hand, STAP contains less parameters, thus it has less computational cost and is suitable for micro-expression AU detection on very small databases. Furthermore, STAP module is designed to pool discriminative information for micro-expression AUs on spatial and temporal domains.Finally, Focal loss is employed to prevent the vast number of negatives from overwhelming the microexpression AU detector. In experiments, we firstly polish the AU annotations on three commonly used databases. We conduct intensive experiments on three micro-expression databases, and provide several baseline results on micro-expression AU detection. The results show that our proposed approach outperforms the basic Inflated inception-v1 (I3D) in terms of an average of F1- score. We also evaluate the performance of our proposed method on cross-database protocol. It demonstrates that our proposed approach is feasible for cross-database micro-expression AU detection. Importantly, the results on three micro-expression databases and cross-database protocol provide extensive baseline results for future research on micro-expression AU detection.

* 10 pages, 4 figures
Click to Read Paper and Get Code
Most of current person re-identification (ReID) methods neglect a spatial-temporal constraint. Given a query image, conventional methods compute the feature distances between the query image and all the gallery images and return a similarity ranked table. When the gallery database is very large in practice, these approaches fail to obtain a good performance due to appearance ambiguity across different camera views. In this paper, we propose a novel two-stream spatial-temporal person ReID (st-ReID) framework that mines both visual semantic information and spatial-temporal information. To this end, a joint similarity metric with Logistic Smoothing (LS) is introduced to integrate two kinds of heterogeneous information into a unified framework. To approximate a complex spatial-temporal probability distribution, we develop a fast Histogram-Parzen (HP) method. With the help of the spatial-temporal constraint, the st-ReID model eliminates lots of irrelevant images and thus narrows the gallery database. Without bells and whistles, our st-ReID method achieves rank-1 accuracy of 98.1\% on Market-1501 and 94.4\% on DukeMTMC-reID, improving from the baselines 91.2\% and 83.8\%, respectively, outperforming all previous state-of-the-art methods by a large margin.

* AAAI 2019
Click to Read Paper and Get Code
With the progress in automatic human behavior understanding, analysing the perceived affect of multiple people has been recieved interest in affective computing community. Unlike conventional facial expression analysis, this paper primarily focuses on analysing the behaviour of multiple people in an image. The proposed method is based on support vector regression with the combined global alignment kernels (GAKs) to estimate the happiness intensity of a group of people. We first exploit Riesz-based volume local binary pattern (RVLBP) and deep convolutional neural network (CNN) based features for characterizing facial images. Furthermore, we propose to use the GAK for RVLBP and deep CNN features, respectively for explicitly measuring the similarity of two group-level images. Specifically, we exploit the global weight sort scheme to sort the face images from group-level image according to their spatial weights, making an efficient data structure to GAK. Lastly, we propose Multiple kernel learning based on three combination strategies for combining two respective GAKs based on RVLBP and deep CNN features, such that enhancing the discriminative ability of each GAK. Intensive experiments are performed on the challenging group-level happiness intensity database, namely HAPPEI. Our experimental results demonstrate that the proposed approach achieves promising performance for group happiness intensity analysis, when compared with the recent state-of-the-art methods.

Click to Read Paper and Get Code
In this paper, we investigate the cross-database micro-expression recognition problem, where the training and testing samples are from two different micro-expression databases. Under this setting, the training and testing samples would have different feature distributions and hence the performance of most existing micro-expression recognition methods may decrease greatly. To solve this problem, we propose a simple yet effective method called Target Sample Re-Generator (TSRG) in this paper. By using TSRG, we are able to re-generate the samples from target micro-expression database and the re-generated target samples would share same or similar feature distributions with the original source samples. For this reason, we can then use the classifier learned based on the labeled source samples to accurately predict the micro-expression categories of the unlabeled target samples. To evaluate the performance of the proposed TSRG method, extensive cross-database micro-expression recognition experiments designed based on SMIC and CASME II databases are conducted. Compared with recent state-of-the-art cross-database emotion recognition methods, the proposed TSRG achieves more promising results.

* To appear at ACM Multimedia 2017
Click to Read Paper and Get Code
In typical neural machine translation~(NMT), the decoder generates a sentence word by word, packing all linguistic granularities in the same time-scale of RNN. In this paper, we propose a new type of decoder for NMT, which splits the decode state into two parts and updates them in two different time-scales. Specifically, we first predict a chunk time-scale state for phrasal modeling, on top of which multiple word time-scale states are generated. In this way, the target sentence is translated hierarchically from chunks to words, with information in different granularities being leveraged. Experiments show that our proposed model significantly improves the translation performance over the state-of-the-art NMT model.

* Accepted as a short paper by ACL 2017
Click to Read Paper and Get Code
Recently, there are increasing interests in inferring mirco-expression from facial image sequences. Due to subtle facial movement of micro-expressions, feature extraction has become an important and critical issue for spontaneous facial micro-expression recognition. Recent works usually used spatiotemporal local binary pattern for micro-expression analysis. However, the commonly used spatiotemporal local binary pattern considers dynamic texture information to represent face images while misses the shape attribute of face images. On the other hand, their works extracted the spatiotemporal features from the global face regions, which ignore the discriminative information between two micro-expression classes. The above-mentioned problems seriously limit the application of spatiotemporal local binary pattern on micro-expression recognition. In this paper, we propose a discriminative spatiotemporal local binary pattern based on an improved integral projection to resolve the problems of spatiotemporal local binary pattern for micro-expression recognition. Firstly, we develop an improved integral projection for preserving the shape attribute of micro-expressions. Furthermore, an improved integral projection is incorporated with local binary pattern operators across spatial and temporal domains. Specifically, we extract the novel spatiotemporal features incorporating shape attributes into spatiotemporal texture features. For increasing the discrimination of micro-expressions, we propose a new feature selection based on Laplacian method to extract the discriminative information for facial micro-expression recognition. Intensive experiments are conducted on three availably published micro-expression databases. We compare our method with the state-of-the-art algorithms. Experimental results demonstrate that our proposed method achieves promising performance for micro-expression recognition.

* 13pages, 8 figures, 5 tables, submitted to IEEE Transactions on Image Processing
Click to Read Paper and Get Code
Micro-expressions (MEs) are rapid, involuntary facial expressions which reveal emotions that people do not intend to show. Studying MEs is valuable as recognizing them has many important applications, particularly in forensic science and psychotherapy. However, analyzing spontaneous MEs is very challenging due to their short duration and low intensity. Automatic ME analysis includes two tasks: ME spotting and ME recognition. For ME spotting, previous studies have focused on posed rather than spontaneous videos. For ME recognition, the performance of previous studies is low. To address these challenges, we make the following contributions: (i)We propose the first method for spotting spontaneous MEs in long videos (by exploiting feature difference contrast). This method is training free and works on arbitrary unseen videos. (ii)We present an advanced ME recognition framework, which outperforms previous work by a large margin on two challenging spontaneous ME databases (SMIC and CASMEII). (iii)We propose the first automatic ME analysis system (MESR), which can spot and recognize MEs from spontaneous video data. Finally, we show our method outperforms humans in the ME recognition task by a large margin, and achieves comparable performance to humans at the very challenging task of spotting and then recognizing spontaneous MEs.

Click to Read Paper and Get Code
Millions of images on the web enable us to explore images from social events such as a family party, thus it is of interest to understand and model the affect exhibited by a group of people in images. But analysis of the affect expressed by multiple people is challenging due to varied indoor and outdoor settings, and interactions taking place between various numbers of people. A few existing works on Group-level Emotion Recognition (GER) have investigated on face-level information. Due to the challenging environments, face may not provide enough information to GER. Relatively few studies have investigated multi-modal GER. Therefore, we propose a novel multi-modal approach based on a new feature description for understanding emotional state of a group of people in an image. In this paper, we firstly exploit three kinds of rich information containing face, upperbody and scene in a group-level image. Furthermore, in order to integrate multiple person's information in a group-level image, we propose an information aggregation method to generate three features for face, upperbody and scene, respectively. We fuse face, upperbody and scene information for robustness of GER against the challenging environments. Intensive experiments are performed on two challenging group-level emotion databases to investigate the role of face, upperbody and scene as well as multi-modal framework. Experimental results demonstrate that our framework achieves very promising performance for GER.

* 11 pages. Submitted to the IEEE Transactions on Cybernetics
Click to Read Paper and Get Code
Sparse coding has achieved a great success in various image processing studies. However, there is not any benchmark to measure the sparsity of image patch/group because sparse discriminant conditions cannot keep unchanged. This paper analyzes the sparsity of group based on the strategy of the rank minimization. Firstly, an adaptive dictionary for each group is designed. Then, we prove that group-based sparse coding is equivalent to the rank minimization problem, and thus the sparse coefficient of each group is measured by estimating the singular values of each group. Based on that measurement, the weighted Schatten $p$-norm minimization (WSNM) has been found to be the closest solution to the real singular values of each group. Thus, WSNM can be equivalently transformed into a non-convex $\ell_p$-norm minimization problem in group-based sparse coding. To make the proposed scheme tractable and robust, the alternating direction method of multipliers (ADMM) is used to solve the $\ell_p$-norm minimization problem. Experimental results on two applications: image inpainting and image compressive sensing (CS) recovery have shown that the proposed scheme outperforms many state-of-the-art methods.

* arXiv admin note: text overlap with arXiv:1702.04463
Click to Read Paper and Get Code
Group sparsity has shown great potential in various low-level vision tasks (e.g, image denoising, deblurring and inpainting). In this paper, we propose a new prior model for image denoising via group sparsity residual constraint (GSRC). To enhance the performance of group sparse-based image denoising, the concept of group sparsity residual is proposed, and thus, the problem of image denoising is translated into one that reduces the group sparsity residual. To reduce the residual, we first obtain some good estimation of the group sparse coefficients of the original image by the first-pass estimation of noisy image, and then centralize the group sparse coefficients of noisy image to the estimation. Experimental results have demonstrated that the proposed method not only outperforms many state-of-the-art denoising methods such as BM3D and WNNM, but results in a faster speed.

Click to Read Paper and Get Code
In recent years, a growing body of research has focused on the problem of person re-identification (re-id). The re-id techniques attempt to match the images of pedestrians from disjoint non-overlapping camera views. A major challenge of re-id is the serious intra-class variations caused by changing viewpoints. To overcome this challenge, we propose a deep neural network-based framework which utilizes the view information in the feature extraction stage. The proposed framework learns a view-specific network for each camera view with a cross-view Euclidean constraint (CV-EC) and a cross-view center loss (CV-CL). We utilize CV-EC to decrease the margin of the features between diverse views and extend the center loss metric to a view-specific version to better adapt the re-id problem. Moreover, we propose an iterative algorithm to optimize the parameters of the view-specific networks from coarse to fine. The experiments demonstrate that our approach significantly improves the performance of the existing deep networks and outperforms the state-of-the-art methods on the VIPeR, CUHK01, CUHK03, SYSU-mReId, and Market-1501 benchmarks.

* Feng Z, Lai J, Xie X. Learning View-Specific Deep Networks for Person Re-Identification[J]. IEEE Transactions on Image Processing, 2018
* 12 pages, 8 figures, accepted by IEEE Transactions on image processing
Click to Read Paper and Get Code
We present a novel method of integrating motion and appearance cues for foreground object segmentation in unconstrained videos. Unlike conventional methods encoding motion and appearance patterns individually, our method puts particular emphasis on their mutual assistance. Specifically, we propose using an interactively constrained encoding (ICE) scheme to incorporate motion and appearance patterns into a graph that leads to a spatiotemporal energy optimization. The reason of utilizing ICE is that both motion and appearance cues for the same target share underlying correlative structure, thus can be exploited in a deeply collaborative manner. We perform ICE not only in the initialization but also in the refinement stage of a two-layer framework for object segmentation. This scheme allows our method to consistently capture structural patterns about object perceptions throughout the whole framework. Our method can be operated on superpixels instead of raw pixels to reduce the number of graph nodes by two orders of magnitude. Moreover, we propose to partially explore the multi-object localization problem with inter-occlusion by weighted bipartite graph matching. Comprehensive experiments on three benchmark datasets (i.e., SegTrack, MOViCS, and GaTech) demonstrate the effectiveness of our approach compared with extensive state-of-the-art methods.

* 11 pages, 7 figures
Click to Read Paper and Get Code
Modeling the underlying person structure for person re-identification (re-ID) is difficult due to diverse deformable poses, changeable camera views and imperfect person detectors. How to exploit underlying person structure information without extra annotations to improve the performance of person re-ID remains largely unexplored. To address this problem, we propose a novel Relative Local Distance (RLD) method that integrates a relative local distance constraint into convolutional neural networks (CNNs) in an end-to-end way. It is the first time that the relative local constraint is proposed to guide the global feature representation learning. Specially, a relative local distance matrix is computed by using feature maps and then regarded as a regularizer to guide CNNs to learn a structure-aware feature representation. With the discovered underlying person structure, the RLD method builds a bridge between the global and local feature representation and thus improves the capacity of feature representation for person re-ID. Furthermore, RLD also significantly accelerates deep network training compared with conventional methods. The experimental results show the effectiveness of RLD on the CUHK03, Market-1501, and DukeMTMC-reID datasets. Code is available at \url{https://github.com/Wanggcong/RLD_codes}.

* 9 pages
Click to Read Paper and Get Code
We present a learning model that makes full use of boundary information for salient object segmentation. Specifically, we come up with a novel loss function, i.e., Contour Loss, which leverages object contours to guide models to perceive salient object boundaries. Such a boundary-aware network can learn boundary-wise distinctions between salient objects and background, hence effectively facilitating the saliency detection. Yet the Contour Loss emphasizes on the local saliency. We further propose the hierarchical global attention module (HGAM), which forces the model hierarchically attend to global contexts, thus captures the global visual saliency. Comprehensive experiments on six benchmark datasets show that our method achieves superior performance over state-of-the-art ones. Moreover, our model has a real-time speed of 26 fps on a TITAN X GPU.

* 12 pages, 7 figures
Click to Read Paper and Get Code