Research papers and code for "Dacheng Tao":
Slow Feature Analysis (SFA) extracts slowly varying features from a quickly varying input signal. It has been successfully applied to modeling the visual receptive fields of the cortical neurons. Sufficient experimental results in neuroscience suggest that the temporal slowness principle is a general learning principle in visual perception. In this paper, we introduce the SFA framework to the problem of human action recognition by incorporating the discriminative information with SFA learning and considering the spatial relationship of body parts. In particular, we consider four kinds of SFA learning strategies, including the original unsupervised SFA (U-SFA), the supervised SFA (S-SFA), the discriminative SFA (D-SFA), and the spatial discriminative SFA (SD-SFA), to extract slow feature functions from a large amount of training cuboids which are obtained by random sampling in motion boundaries. Afterward, to represent action sequences, the squared first order temporal derivatives are accumulated over all transformed cuboids into one feature vector, which is termed the Accumulated Squared Derivative (ASD) feature. The ASD feature encodes the statistical distribution of slow features in an action sequence. Finally, a linear support vector machine (SVM) is trained to classify actions represented by ASD features. We conduct extensive experiments, including two sets of control experiments, two sets of large scale experiments on the KTH and Weizmann databases, and two sets of experiments on the CASIA and UT-interaction databases, to demonstrate the effectiveness of SFA for human action recognition.

* IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012
Click to Read Paper and Get Code
Single image dehazing is a critical image pre-processing step for subsequent high-level computer vision tasks. However, it remains challenging due to its ill-posed nature. Existing dehazing models tend to suffer from model overcomplexity and computational inefficiency or have limited representation capacity. To tackle these challenges, here we propose a fast and accurate multi-scale end-to-end dehazing network called FAMED-Net, which comprises encoders at three scales and a fusion module to efficiently and directly learn the haze-free image. Each encoder consists of cascaded and densely connected point-wise convolutional layers and pooling layers. Since no larger convolutional kernels are used and features are reused layer-by-layer, FAMED-Net is lightweight and computationally efficient. Thorough empirical studies on public synthetic datasets (including RESIDE) and real-world hazy images demonstrate the superiority of FAMED-Net over other representative state-of-the-art models with respect to model complexity, computational efficiency, restoration accuracy, and cross-set generalization. The code will be made publicly available.

* 13 pages, 9 figures, To appear in IEEE Transactions on Image Processing. The code is available at https://github.com/chaimi2013/FAMED-Net
Click to Read Paper and Get Code
This paper describes the University of Sydney's submission of the WMT 2019 shared news translation task. We participated in the Finnish$\rightarrow$English direction and got the best BLEU(33.0) score among all the participants. Our system is based on the self-attentional Transformer networks, into which we integrated the most recent effective strategies from academic research (e.g., BPE, back translation, multi-features data selection, data augmentation, greedy model ensemble, reranking, ConMBR system combination, and post-processing). Furthermore, we propose a novel augmentation method $Cycle Translation$ and a data mixture strategy $Big$/$Small$ parallel construction to entirely exploit the synthetic corpus. Extensive experiments show that adding the above techniques can make continuous improvements of the BLEU scores, and the best result outperforms the baseline (Transformer ensemble model trained with the original parallel corpus) by approximately 5.3 BLEU score, achieving the state-of-the-art performance.

* To appear in WMT2019
Click to Read Paper and Get Code
The rapid development of computer hardware and Internet technology makes large scale data dependent models computationally tractable, and opens a bright avenue for annotating images through innovative machine learning algorithms. Semi-supervised learning (SSL) has consequently received intensive attention in recent years and has been successfully deployed in image annotation. One representative work in SSL is Laplacian regularization (LR), which smoothes the conditional distribution for classification along the manifold encoded in the graph Laplacian, however, it has been observed that LR biases the classification function towards a constant function which possibly results in poor generalization. In addition, LR is developed to handle uniformly distributed data (or single view data), although instances or objects, such as images and videos, are usually represented by multiview features, such as color, shape and texture. In this paper, we present multiview Hessian regularization (mHR) to address the above two problems in LR-based image annotation. In particular, mHR optimally combines multiple Hessian regularizations, each of which is obtained from a particular view of instances, and steers the classification function which varies linearly along the data manifold. We apply mHR to kernel least squares and support vector machines as two examples for image annotation. Extensive experiments on the PASCAL VOC'07 dataset validate the effectiveness of mHR by comparing it with baseline algorithms, including LR and HR.

* IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2676 - 2687, 2013
Click to Read Paper and Get Code
Hashing methods have been widely investigated for fast approximate nearest neighbor searching in large data sets. Most existing methods use binary vectors in lower dimensional spaces to represent data points that are usually real vectors of higher dimensionality. We divide the hashing process into two steps. Data points are first embedded in a low-dimensional space, and the global positioning system method is subsequently introduced but modified for binary embedding. We devise dataindependent and data-dependent methods to distribute the satellites at appropriate locations. Our methods are based on finding the tradeoff between the information losses in these two steps. Experiments show that our data-dependent method outperforms other methods in different-sized data sets from 100k to 10M. By incorporating the orthogonality of the code matrix, both our data-independent and data-dependent methods are particularly impressive in experiments on longer bits.

Click to Read Paper and Get Code
Blur in facial images significantly impedes the efficiency of recognition approaches. However, most existing blind deconvolution methods cannot generate satisfactory results due to their dependence on strong edges, which are sufficient in natural images but not in facial images. In this paper, we represent point spread functions (PSFs) by the linear combination of a set of pre-defined orthogonal PSFs, and similarly, an estimated intrinsic (EI) sharp face image is represented by the linear combination of a set of pre-defined orthogonal face images. In doing so, PSF and EI estimation is simplified to discovering two sets of linear combination coefficients, which are simultaneously found by our proposed coupled learning algorithm. To make our method robust to different types of blurry face images, we generate several candidate PSFs and EIs for a test image, and then, a non-blind deconvolution method is adopted to generate more EIs by those candidate PSFs. Finally, we deploy a blind image quality assessment metric to automatically select the optimal EI. Thorough experiments on the facial recognition technology database, extended Yale face database B, CMU pose, illumination, and expression (PIE) database, and face recognition grand challenge database version 2.0 demonstrate that the proposed approach effectively restores intrinsic sharp face images and, consequently, improves the performance of face recognition.

Click to Read Paper and Get Code
Multiple sclerosis (MS) is an inflammatory demyelinating disease of the central nervous system (CNS) that results in focal injury to the grey and white matter. The presence of white matter lesions biases morphometric analyses such as registration, individual longitudinal measurements and tissue segmentation for brain volume measurements. Lesion-inpainting with intensities derived from surround healthy tissue represent one approach to alleviate such problems. However, existing methods inpaint lesions based on texture information derived from local surrounding tissue, often leading to inconsistent inpainting and the generation of artifacts such as intensity discrepancy and blurriness. Based on these observations, we propose non-local partial convolutions (NLPC) which integrates a Unet-like network with the non-local module. The non-local module is exploited to capture long range dependencies between the lesion area and remaining normal-appearing brain regions. Then, the lesion area is filled by referring to normal-appearing regions with more similar features. This method generates inpainted regions that appear more realistic and natural. Our quantitative experimental results also demonstrate superiority of this technique of existing state-of-the-art inpainting methods.

Click to Read Paper and Get Code
Face detection is essential to facial analysis tasks such as facial reenactment and face recognition. Both cascade face detectors and anchor-based face detectors have translated shining demos into practice and received intensive attention from the community. However, cascade face detectors often suffer from a low detection accuracy, while anchor-based face detectors rely heavily on very large networks pre-trained on large scale image classification datasets such as ImageNet [1], which is not computationally efficient for both training and deployment. In this paper, we devise an efficient anchor-based cascade framework called anchor cascade. To improve the detection accuracy by exploring contextual information, we further propose a context pyramid maxout mechanism for anchor cascade. As a result, anchor cascade can train very efficient face detection models with a high detection accuracy. Specifically, comparing with a popular CNN-based cascade face detector MTCNN [2], our anchor cascade face detector greatly improves the detection accuracy, e.g., from 0.9435 to 0.9704 at 1k false positives on FDDB, while it still runs in comparable speed. Experimental results on two widely used face detection benchmarks, FDDB and WIDER FACE, demonstrate the effectiveness of the proposed framework.

Click to Read Paper and Get Code
This paper explores the non-convex composition optimization in the form including inner and outer finite-sum functions with a large number of component functions. This problem arises in some important applications such as nonlinear embedding and reinforcement learning. Although existing approaches such as stochastic gradient descent (SGD) and stochastic variance reduced gradient (SVRG) descent can be applied to solve this problem, their query complexity tends to be high, especially when the number of inner component functions is large. In this paper, we apply the variance-reduced technique to derive two variance reduced algorithms that significantly improve the query complexity if the number of inner component functions is large. To the best of our knowledge, this is the first work that establishes the query complexity analysis for non-convex stochastic composition. Experiments validate the proposed algorithms and theoretical analysis.

Click to Read Paper and Get Code
We consider the composition optimization with two expected-value functions in the form of $\frac{1}{n}\sum\nolimits_{i = 1}^n F_i(\frac{1}{m}\sum\nolimits_{j = 1}^m G_j(x))+R(x)$, { which formulates many important problems in statistical learning and machine learning such as solving Bellman equations in reinforcement learning and nonlinear embedding}. Full Gradient or classical stochastic gradient descent based optimization algorithms are unsuitable or computationally expensive to solve this problem due to the inner expectation $\frac{1}{m}\sum\nolimits_{j = 1}^m G_j(x)$. We propose a duality-free based stochastic composition method that combines variance reduction methods to address the stochastic composition problem. We apply SVRG and SAGA based methods to estimate the inner function, and duality-free method to estimate the outer function. We prove the linear convergence rate not only for the convex composition problem, but also for the case that the individual outer functions are non-convex while the objective function is strongly-convex. We also provide the results of experiments that show the effectiveness of our proposed methods.

Click to Read Paper and Get Code
Human faces in surveillance videos often suffer from severe image blur, dramatic pose variations, and occlusion. In this paper, we propose a comprehensive framework based on Convolutional Neural Networks (CNN) to overcome challenges in video-based face recognition (VFR). First, to learn blur-robust face representations, we artificially blur training data composed of clear still images to account for a shortfall in real-world video training data. Using training data composed of both still images and artificially blurred data, CNN is encouraged to learn blur-insensitive features automatically. Second, to enhance robustness of CNN features to pose variations and occlusion, we propose a Trunk-Branch Ensemble CNN model (TBE-CNN), which extracts complementary information from holistic face images and patches cropped around facial components. TBE-CNN is an end-to-end model that extracts features efficiently by sharing the low- and middle-level convolutional layers between the trunk and branch networks. Third, to further promote the discriminative power of the representations learnt by TBE-CNN, we propose an improved triplet loss function. Systematic experiments justify the effectiveness of the proposed techniques. Most impressively, TBE-CNN achieves state-of-the-art performance on three popular video face databases: PaSC, COX Face, and YouTube Faces. With the proposed techniques, we also obtain the first place in the BTAS 2016 Video Person Recognition Evaluation.

* Accepted Version to IEEE T-PAMI
Click to Read Paper and Get Code
A well-designed fine-grained categorization system usually has three contradictory requirements: accuracy (the ability to identify objects among subordinate categories); interpretability (the ability to provide human-understandable explanation of recognition system behavior); and efficiency (the speed of the system). To handle the trade-off between accuracy and interpretability, we propose a novel "Deeper Part-Stacked CNN" architecture armed with interpretability by modeling subtle differences between object parts. The proposed architecture consists of a part localization network, a two-stream classification network that simultaneously encodes object-level and part-level cues, and a feature vectors fusion component. Specifically, the part localization network is implemented by exploring a new paradigm for key point localization that first samples a small number of representable pixels and then determine their labels via a convolutional layer followed by a softmax layer. We also use a cropping layer to extract part features and propose a scale mean-max layer for feature fusion learning. Experimentally, our proposed method outperform state-of-the-art approaches both in part localization task and classification task on Caltech-UCSD Birds-200-2011. Moreover, by adopting a set of sharing strategies between the computation of multiple object parts, our single model is fairly efficient running at 32 frames/sec.

* arXiv admin note: text overlap with arXiv:1512.08086
Click to Read Paper and Get Code
Here we study non-convex composite optimization: first, a finite-sum of smooth but non-convex functions, and second, a general function that admits a simple proximal mapping. Most research on stochastic methods for composite optimization assumes convexity or strong convexity of each function. In this paper, we extend this problem into the non-convex setting using variance reduction techniques, such as prox-SVRG and prox-SAGA. We prove that, with a constant step size, both prox-SVRG and prox-SAGA are suitable for non-convex composite optimization, and help the problem converge to a stationary point within $O(1/\epsilon)$ iterations. That is similar to the convergence rate seen with the state-of-the-art RSAG method and faster than stochastic gradient descent. Our analysis is also extended into the min-batch setting, which linearly accelerates the convergence. To the best of our knowledge, this is the first analysis of convergence rate of variance-reduced proximal stochastic gradient for non-convex composite optimization.

* This paper has been withdrawn by the author due to an error in the proof of the convergence rate. They will modify this proof as soon as possible
Click to Read Paper and Get Code
The capacity to recognize faces under varied poses is a fundamental human ability that presents a unique challenge for computer vision systems. Compared to frontal face recognition, which has been intensively studied and has gradually matured in the past few decades, pose-invariant face recognition (PIFR) remains a largely unsolved problem. However, PIFR is crucial to realizing the full potential of face recognition for real-world applications, since face recognition is intrinsically a passive biometric technology for recognizing uncooperative subjects. In this paper, we discuss the inherent difficulties in PIFR and present a comprehensive review of established techniques. Existing PIFR methods can be grouped into four categories, i.e., pose-robust feature extraction approaches, multi-view subspace learning approaches, face synthesis approaches, and hybrid approaches. The motivations, strategies, pros/cons, and performance of representative approaches are described and compared. Moreover, promising directions for future research are discussed.

* final version, ACM Transactions on Intelligent Systems and Technology, 2016
Click to Read Paper and Get Code
Face images appeared in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefited from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.

* To appear in IEEE Trans. Multimedia
Click to Read Paper and Get Code
In this paper, we study a classification problem in which sample labels are randomly corrupted. In this scenario, there is an unobservable sample with noise-free labels. However, before being observed, the true labels are independently flipped with a probability $\rho\in[0,0.5)$, and the random label noise can be class-conditional. Here, we address two fundamental problems raised by this scenario. The first is how to best use the abundant surrogate loss functions designed for the traditional classification problem when there is label noise. We prove that any surrogate loss function can be used for classification with noisy labels by using importance reweighting, with consistency assurance that the label noise does not ultimately hinder the search for the optimal classifier of the noise-free sample. The other is the open problem of how to obtain the noise rate $\rho$. We show that the rate is upper bounded by the conditional probability $P(y|x)$ of the noisy sample. Consequently, the rate can be estimated, because the upper bound can be easily reached in classification problems. Experimental results on synthetic and real datasets confirm the efficiency of our methods.

Click to Read Paper and Get Code
This paper comprehensively reviews the recent development of image deblurring, including non-blind/blind, spatially invariant/variant deblurring techniques. Indeed, these techniques share the same objective of inferring a latent sharp image from one or several corresponding blurry images, while the blind deblurring techniques are also required to derive an accurate blur kernel. Considering the critical role of image restoration in modern imaging systems to provide high-quality images under complex environments such as motion, undesirable lighting conditions, and imperfect system components, image deblurring has attracted growing attention in recent years. From the viewpoint of how to handle the ill-posedness which is a crucial issue in deblurring tasks, existing methods can be grouped into five categories: Bayesian inference framework, variational methods, sparse representation-based methods, homography-based modeling, and region-based methods. In spite of achieving a certain level of development, image deblurring, especially the blind case, is limited in its success by complex application conditions which make the blur kernel hard to obtain and be spatially variant. We provide a holistic understanding and deep insight into image deblurring in this review. An analysis of the empirical evidence for representative methods, practical issues, as well as a discussion of promising future directions are also presented.

* 53 pages, 17 figures
Click to Read Paper and Get Code
Editing faces in videos is a popular yet challenging aspect of computer vision and graphics, which encompasses several applications including facial attractiveness enhancement, makeup transfer, face replacement, and expression manipulation. Simply applying image-based warping algorithms to video-based face editing produces temporal incoherence in the synthesized videos because it is impossible to consistently localize facial features in two frames representing two different faces in two different videos (or even two consecutive frames representing the same face in one video). Therefore, high performance face editing usually requires significant manual manipulation. In this paper we propose a novel temporal-spatial-smooth warping (TSSW) algorithm to effectively exploit the temporal information in two consecutive frames, as well as the spatial smoothness within each frame. TSSW precisely estimates two control lattices in the horizontal and vertical directions respectively from the corresponding control lattices in the previous frame, by minimizing a novel energy function that unifies a data-driven term, a smoothness term, and feature point constraints. Corresponding warping surfaces then precisely map source frames to the target frames. Experimental testing on facial attractiveness enhancement, makeup transfer, face replacement, and expression manipulation demonstrates that the proposed approaches can effectively preserve spatial smoothness and temporal coherence in editing facial geometry, skin detail, identity, and expression, which outperform the existing face editing methods. In particular, TSSW is robust to subtly inaccurate localization of feature points and is a vast improvement over image-based warping methods.

Click to Read Paper and Get Code
Learning big data by matrix decomposition always suffers from expensive computation, mixing of complicated structures and noise. In this paper, we study more adaptive models and efficient algorithms that decompose a data matrix as the sum of semantic components with incoherent structures. We firstly introduce "GO decomposition (GoDec)", an alternating projection method estimating the low-rank part $L$ and the sparse part $S$ from data matrix $X=L+S+G$ corrupted by noise $G$. Two acceleration strategies are proposed to obtain scalable unmixing algorithm on big data: 1) Bilateral random projection (BRP) is developed to speed up the update of $L$ in GoDec by a closed-form built from left and right random projections of $X-S$ in lower dimensions; 2) Greedy bilateral (GreB) paradigm updates the left and right factors of $L$ in a mutually adaptive and greedy incremental manner, and achieve significant improvement in both time and sample complexities. Then we proposes three nontrivial variants of GoDec that generalizes GoDec to more general data type and whose fast algorithms can be derived from the two strategies......

* 42 pages, 5 figures, 4 tables, 5 algorithms
Click to Read Paper and Get Code