Sliced inverse regression (SIR) is a pioneer tool for supervised dimension reduction. It identifies the effective dimension reduction space, the subspace of significant factors with intrinsic lower dimensionality. In this paper, we propose to refine the SIR algorithm through an overlapping slicing scheme. The new algorithm, called overlapping sliced inverse regression (OSIR), is able to estimate the effective dimension reduction space and determine the number of effective factors more accurately. We show that such overlapping procedure has the potential to identify the information contained in the derivatives of the inverse regression curve, which helps to explain the superiority of OSIR. We also prove that OSIR algorithm is $\sqrt n $-consistent and verify its effectiveness by simulations and real applications.

Click to Read Paper
Existing temporal relation (TempRel) annotation schemes often have low inter-annotator agreements (IAA) even between experts, suggesting that the current annotation task needs a better definition. This paper proposes a new multi-axis modeling to better capture the temporal structure of events. In addition, we identify that event end-points are a major source of confusion in annotation, so we also propose to annotate TempRels based on start-points only. A pilot expert annotation using the proposed scheme shows significant improvement in IAA from the conventional 60's to 80's (Cohen's Kappa). This better-defined annotation scheme further enables the use of crowdsourcing to alleviate the labor intensity for each annotator. We hope that this work can foster more interesting studies towards event understanding.

* [Final Version] 14 pages, accepted by ACL'18
Click to Read Paper
Annotating temporal relations (TempRel) between events described in natural language is known to be labor intensive, partly because the total number of TempRels is quadratic in the number of events. As a result, only a small number of documents are typically annotated, limiting the coverage of various lexical/semantic phenomena. In order to improve existing approaches, one possibility is to make use of the readily available, partially annotated data (P as in partial) that cover more documents. However, missing annotations in P are known to hurt, rather than help, existing systems. This work is a case study in exploring various usages of P for TempRel extraction. Results show that despite missing annotations, P is still a useful supervision signal for this task within a constrained bootstrapping learning framework. The system described in this system is publicly available.

* [Final Version] short paper accepted by *SEM'18
Click to Read Paper
Extracting temporal relations (before, after, overlapping, etc.) is a key aspect of understanding events described in natural language. We argue that this task would gain from the availability of a resource that provides prior knowledge in the form of the temporal order that events usually follow. This paper develops such a resource -- a probabilistic knowledge base acquired in the news domain -- by extracting temporal relations between events from the New York Times (NYT) articles over a 20-year span (1987--2007). We show that existing temporal extraction systems can be improved via this resource. As a byproduct, we also show that interesting statistics can be retrieved from this resource, which can potentially benefit other time-aware tasks. The proposed system and resource are both publicly available.

* 13 pages, 3 figures, accepted by NAACL'18
Click to Read Paper
Text simplification (TS) can be viewed as monolingual translation task, translating between text variations within a single language. Recent neural TS models draw on insights from neural machine translation to learn lexical simplification and content reduction using encoder-decoder model. But different from neural machine translation, we cannot obtain enough ordinary and simplified sentence pairs for TS, which are expensive and time-consuming to build. Target-side simplified sentences plays an important role in boosting fluency for statistical TS, and we investigate the use of simplified sentences to train, with no changes to the network architecture. We propose to pair simple training sentence with a synthetic ordinary sentence via back-translation, and treating this synthetic data as additional training data. We train encoder-decoder model using synthetic sentence pairs and original sentence pairs, which can obtain substantial improvements on the available WikiLarge data and WikiSmall data compared with the state-of-the-art methods.

Click to Read Paper
We propose an approach to reduce the bias of ridge regression and regularization kernel network. When applied to a single data set the new algorithms have comparable learning performance with the original ones. When applied to incremental learning with block wise streaming data the new algorithms are more efficient due to bias reduction. Both theoretical characterizations and simulation studies are used to verify the effectiveness of these new algorithms.

Click to Read Paper
The developments of deep neural networks (DNN) in recent years have ushered a brand new era of artificial intelligence. DNNs are proved to be excellent in solving very complex problems, e.g., visual recognition and text understanding, to the extent of competing with or even surpassing people. Despite inspiring and encouraging success of DNNs, thorough theoretical analyses still lack to unravel the mystery of their magics. The design of DNN structure is dominated by empirical results in terms of network depth, number of neurons and activations. A few of remarkable works published recently in an attempt to interpret DNNs have established the first glimpses of their internal mechanisms. Nevertheless, research on exploring how DNNs operate is still at the initial stage with plenty of room for refinement. In this paper, we extend precedent research on neural networks with piecewise linear activations (PLNN) concerning linear regions bounds. We present (i) the exact maximal number of linear regions for single layer PLNNs; (ii) a upper bound for multi-layer PLNNs; and (iii) a tighter upper bound for the maximal number of liner regions on rectifier networks. The derived bounds also indirectly explain why deep models are more powerful than shallow counterparts, and how non-linearity of activation functions impacts on expressiveness of networks.

* Counting linear regions of neural networks
Click to Read Paper
Multidimensional scaling is an important dimension reduction tool in statistics and machine learning. Yet few theoretical results characterizing its statistical performance exist, not to mention any in high dimensions. By considering a unified framework that includes low, moderate and high dimensions, we study multidimensional scaling in the setting of clustering noisy data. Our results suggest that, in order to achieve consistent estimation of the embedding scheme, the classical multidimensional scaling needs to be modified, especially when the noise level increases. To this end, we propose {\it modified multidimensional scaling} which applies a nonlinear transformation to the sample eigenvalues. The nonlinear transformation depends on the dimensionality, sample size and unknown moment. We show that modified multidimensional scaling followed by various clustering algorithms can achieve exact recovery, i.e., all the cluster labels can be recovered correctly with probability tending to one. Numerical simulations and two real data applications lend strong support to our proposed methodology. As a byproduct, we unify and improve existing results on the $\ell_{\infty}$ bound for eigenvectors under only low bounded moment conditions. This can be of independent interest.

* 31 pages, 4 figures
Click to Read Paper
Edge features contain important information about graphs. However, current state-of-the-art neural network models designed for graph learning do not consider incorporating edge features, especially multi-dimensional edge features. In this paper, we propose an attention mechanism which combines both node features and edge features. Guided by the edge features, the attention mechanism on a pair of graph nodes will not only depend on node contents, but also ajust automatically with respect to the properties of the edge connecting these two nodes. Moreover, the edge features are adjusted by the attention function and fed to the next layer, which means our edge features are adaptive across network layers. As a result, our proposed adaptive edge features guided graph attention model can consolidate a rich source of graph information that current state-of-the-art graph learning methods cannot. We apply our proposed model to graph node classification, and experimental results on three citaion network datasets and a biological network dataset show that out method outperforms the current state-of-the-art methods, testifying to the discriminative capability of edge features and the effectiveness of our adaptive edge features guided attention model. Additional ablation experimental study further shows that both the edge features and adaptiveness components contribute to our model.

Click to Read Paper
Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks. In this paper, we give a survey for MTL. First, we classify different MTL algorithms into several categories, including feature learning approach, low-rank approach, task clustering approach, task relation learning approach, and decomposition approach, and then discuss the characteristics of each approach. In order to improve the performance of learning tasks further, MTL can be combined with other learning paradigms including semi-supervised learning, active learning, unsupervised learning, reinforcement learning, multi-view learning and graphical models. When the number of tasks is large or the data dimensionality is high, batch MTL models are difficult to handle this situation and online, parallel and distributed MTL models as well as dimensionality reduction and feature hashing are reviewed to reveal their computational and storage advantages. Many real-world applications use MTL to boost their performance and we review representative works. Finally, we present theoretical analyses and discuss several future directions for MTL.

Click to Read Paper
Machine Learning as a Service (MLaaS), such as Microsoft Azure, Amazon AWS, offers an effective DNN model to complete the machine learning task for small businesses and individuals who are restricted to the lacking data and computing power. However, here comes an issue that user privacy is ex-posed to the MLaaS server, since users need to upload their sensitive data to the MLaaS server. In order to preserve their privacy, users can encrypt their data before uploading it. This makes it difficult to run the DNN model because it is not designed for running in ciphertext domain. In this paper, using the Paillier homomorphic cryptosystem we present a new Privacy-Preserving Deep Neural Network model that we called 2P-DNN. This model can fulfill the machine leaning task in ciphertext domain. By using 2P-DNN, MLaaS is able to provide a Privacy-Preserving machine learning ser-vice for users. We build our 2P-DNN model based on LeNet-5, and test it with the encrypted MNIST dataset. The classification accuracy is more than 97%, which is close to the accuracy of LeNet-5 running with the MNIST dataset and higher than that of other existing Privacy-Preserving machine learning models

Click to Read Paper
The locations of the fiducial facial landmark points around facial components and facial contour capture the rigid and non-rigid facial deformations due to head movements and facial expressions. They are hence important for various facial analysis tasks. Many facial landmark detection algorithms have been developed to automatically detect those key points over the years, and in this paper, we perform an extensive review of them. We classify the facial landmark detection algorithms into three major categories: holistic methods, Constrained Local Model (CLM) methods, and the regression-based methods. They differ in the ways to utilize the facial appearance and shape information. The holistic methods explicitly build models to represent the global facial appearance and shape information. The CLMs explicitly leverage the global shape model but build the local appearance models. The regression-based methods implicitly capture facial shape and appearance information. For algorithms within each category, we discuss their underlying theories as well as their differences. We also compare their performances on both controlled and in the wild benchmark datasets, under varying facial expressions, head poses, and occlusion. Based on the evaluations, we point out their respective strengths and weaknesses. There is also a separate section to review the latest deep learning-based algorithms. The survey also includes a listing of the benchmark databases and existing software. Finally, we identify future research directions, including combining methods in different categories to leverage their respective strengths to solve landmark detection "in-the-wild".

* International Journal on Computer Vision, 2017
Click to Read Paper
Feature learning with deep models has achieved impressive results for both data representation and classification for various vision tasks. Deep feature learning, however, typically requires a large amount of training data, which may not be feasible for some application domains. Transfer learning can be one of the approaches to alleviate this problem by transferring data from data-rich source domain to data-scarce target domain. Existing transfer learning methods typically perform one-shot transfer learning and often ignore the specific properties that the transferred data must satisfy. To address these issues, we introduce a constrained deep transfer feature learning method to perform simultaneous transfer learning and feature learning by performing transfer learning in a progressively improving feature space iteratively in order to better narrow the gap between the target domain and the source domain for effective transfer of the data from the source domain to target domain. Furthermore, we propose to exploit the target domain knowledge and incorporate such prior knowledge as a constraint during transfer learning to ensure that the transferred data satisfies certain properties of the target domain. To demonstrate the effectiveness of the proposed constrained deep transfer feature learning method, we apply it to thermal feature learning for eye detection by transferring from the visible domain. We also applied the proposed method for cross-view facial expression recognition as a second application. The experimental results demonstrate the effectiveness of the proposed method for both applications.

* International Conference on Computer Vision and Pattern Recognition, 2016
Click to Read Paper
We propose a number of new algorithms for learning deep energy models and demonstrate their properties. We show that our SteinCD performs well in term of test likelihood, while SteinGAN performs well in terms of generating realistic looking images. Our results suggest promising directions for learning better models by combining GAN-style methods with traditional energy-based learning.

Click to Read Paper
In distributed, or privacy-preserving learning, we are often given a set of probabilistic models estimated from different local repositories, and asked to combine them into a single model that gives efficient statistical estimation. A simple method is to linearly average the parameters of the local models, which, however, tends to be degenerate or not applicable on non-convex models, or models with different parameter dimensions. One more practical strategy is to generate bootstrap samples from the local models, and then learn a joint model based on the combined bootstrap set. Unfortunately, the bootstrap procedure introduces additional noise and can significantly deteriorate the performance. In this work, we propose two variance reduction methods to correct the bootstrap noise, including a weighted M-estimator that is both statistically efficient and practically powerful. Both theoretical and empirical analysis is provided to demonstrate our methods.

* This paper is about variance reduction on Monte Carol estimation of KL divergence, NIPS, 2016
Click to Read Paper
We propose a simple algorithm to train stochastic neural networks to draw samples from given target distributions for probabilistic inference. Our method is based on iteratively adjusting the neural network parameters so that the output changes along a Stein variational gradient that maximumly decreases the KL divergence with the target distribution. Our method works for any target distribution specified by their unnormalized density function, and can train any black-box architectures that are differentiable in terms of the parameters we want to adapt. As an application of our method, we propose an amortized MLE algorithm for training deep energy model, where a neural sampler is adaptively trained to approximate the likelihood function. Our method mimics an adversarial game between the deep energy model and the neural sampler, and obtains realistic-looking images competitive with the state-of-the-art results.

* Under review as a conference paper at ICLR 2017
Click to Read Paper
We report the "Recurrent Deterioration" (RD) phenomenon observed in online recommender systems. The RD phenomenon is reflected by the trend of performance degradation when the recommendation model is always trained based on users' feedbacks of the previous recommendations. There are several reasons for the recommender systems to encounter the RD phenomenon, including the lack of negative training data and the evolution of users' interests, etc. Motivated to tackle the problems causing the RD phenomenon, we propose the POMDP-Rec framework, which is a neural-optimized Partially Observable Markov Decision Process algorithm for recommender systems. We show that the POMDP-Rec framework effectively uses the accumulated historical data from real-world recommender systems and automatically achieves comparable results with those models fine-tuned exhaustively by domain exports on public datasets.

Click to Read Paper
We present a dictionary learning approach to compensate for the transformation of faces due to changes in view point, illumination, resolution, etc. The key idea of our approach is to force domain-invariant sparse coding, i.e., design a consistent sparse representation of the same face in different domains. In this way, classifiers trained on the sparse codes in the source domain consisting of frontal faces for example can be applied to the target domain (consisting of faces in different poses, illumination conditions, etc) without much loss in recognition accuracy. The approach is to first learn a domain base dictionary, and then describe each domain shift (identity, pose, illumination) using a sparse representation over the base dictionary. The dictionary adapted to each domain is expressed as sparse linear combinations of the base dictionary. In the context of face recognition, with the proposed compositional dictionary approach, a face image can be decomposed into sparse representations for a given subject, pose and illumination respectively. This approach has three advantages: first, the extracted sparse representation for a subject is consistent across domains and enables pose and illumination insensitive face recognition. Second, sparse representations for pose and illumination can subsequently be used to estimate the pose and illumination condition of a face image. Finally, by composing sparse representations for subject and the different domains, we can also perform pose alignment and illumination normalization. Extensive experiments using two public face datasets are presented to demonstrate the effectiveness of our approach for face recognition.

* Transactions on Image Processing, 2015
Click to Read Paper
Concept hierarchy is the backbone of ontology, and the concept hierarchy acquisition has been a hot topic in the field of ontology learning. this paper proposes a hyponymy extraction method of domain ontology concept based on cascaded conditional random field(CCRFs) and hierarchy clustering. It takes free text as extracting object, adopts CCRFs identifying the domain concepts. First the low layer of CCRFs is used to identify simple domain concept, then the results are sent to the high layer, in which the nesting concepts are recognized. Next we adopt hierarchy clustering to identify the hyponymy relation between domain ontology concepts. The experimental results demonstrate the proposed method is efficient.

Click to Read Paper
Deep directed generative models have attracted much attention recently due to their expressive representation power and the ability of ancestral sampling. One major difficulty of learning directed models with many latent variables is the intractable inference. To address this problem, most existing algorithms make assumptions to render the latent variables independent of each other, either by designing specific priors, or by approximating the true posterior using a factorized distribution. We believe the correlations among latent variables are crucial for faithful data representation. Driven by this idea, we propose an inference method based on the conditional pseudo-likelihood that preserves the dependencies among the latent variables. For learning, we propose to employ the hard Expectation Maximization (EM) algorithm, which avoids the intractability of the traditional EM by max-out instead of sum-out to compute the data likelihood. Qualitative and quantitative evaluations of our model against state of the art deep models on benchmark datasets demonstrate the effectiveness of the proposed algorithm in data representation and reconstruction.

Click to Read Paper