When building large-scale machine learning (ML) programs, such as big topic models or deep neural nets, one usually assumes such tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude faster than current state-of-the-art Gibbs samplers; 2) a structure-aware model-parallel scheme, which leverages dependencies within the topic model, yielding a sampling strategy that is frugal on machine memory and network communication; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed; and 4) a bounded asynchronous data-parallel scheme, which allows efficient distributed processing of massive data via a parameter server. Our distribution strategy is an instance of the model-and-data-parallel programming model underlying the Petuum framework for general distributed ML, and was implemented on top of the Petuum open-source system. We provide experimental evidence showing how this development puts massive models within reach on a small cluster while still enjoying proportional time cost reductions with increasing cluster size, in comparison with alternative options.

Click to Read Paper
Probabilistic point-set registration methods have been gaining more attention for their robustness to noise, outliers and occlusions. However, these methods tend to be much slower than the popular iterative closest point (ICP) algorithms, which severely limits their usability. In this paper, we contribute a novel probabilistic registration method that achieves state-of-the-art robustness as well as substantially faster computational performance than modern ICP implementations. This is achieved using a rigorous yet computationally-efficient probabilistic formulation. Point-set registration is cast as a maximum likelihood estimation and solved using the EM algorithm. We show that with a simple augmentation, the E step can be formulated as a filtering problem, allowing us to leverage advances in efficient Gaussian filtering methods. We also propose a customized permutohedral filter for improved efficiency while retaining sufficient accuracy for our task. Additionally, we present a simple and efficient twist parameterization that generalizes our method to the registration of articulated and deformable objects. For articulated objects, the complexity of our method is almost independent of the Degrees Of Freedom (DOFs), which makes it highly efficient even for high DOF systems. The results demonstrate the proposed method consistently outperforms many competitive baselines on a variety of registration tasks.

* The video demo and source code are on https://sites.google.com/view/filterreg/home
Click to Read Paper
We present an optical mapping near-eye (OMNI) three-dimensional display method for wearable devices. By dividing a display screen into different sub-panels and optically mapping them to various depths, we create a multiplane volumetric image with correct focus cues for depth perception. The resultant system can drive the eye's accommodation to the distance that is consistent with binocular stereopsis, thereby alleviating the vergence-accommodation conflict, the primary cause for eye fatigue and discomfort. Compared with the previous methods, the OMNI display offers prominent advantages in adaptability, image dynamic range, and refresh rate.

* 5 pages, 6 figures, 2 tables, short article for Optics Letters
Click to Read Paper
AUC (area under ROC curve) is an important evaluation criterion, which has been popularly used in many learning tasks such as class-imbalance learning, cost-sensitive learning, learning to rank, etc. Many learning approaches try to optimize AUC, while owing to the non-convexity and discontinuousness of AUC, almost all approaches work with surrogate loss functions. Thus, the consistency of AUC is crucial; however, it has been almost untouched before. In this paper, we provide a sufficient condition for the asymptotic consistency of learning approaches based on surrogate loss functions. Based on this result, we prove that exponential loss and logistic loss are consistent with AUC, but hinge loss is inconsistent. Then, we derive the $q$-norm hinge loss and general hinge loss that are consistent with AUC. We also derive the consistent bounds for exponential loss and logistic loss, and obtain the consistent bounds for many surrogate loss functions under the non-noise setting. Further, we disclose an equivalence between the exponential surrogate loss of AUC and exponential surrogate loss of accuracy, and one straightforward consequence of such finding is that AdaBoost and RankBoost are equivalent.

Click to Read Paper
Great successes of deep neural networks have been witnessed in various real applications. Many algorithmic and implementation techniques have been developed, however, theoretical understanding of many aspects of deep neural networks is far from clear. A particular interesting issue is the usefulness of dropout, which was motivated from the intuition of preventing complex co-adaptation of feature detectors. In this paper, we study the Rademacher complexity of different types of dropout, and our theoretical results disclose that for shallow neural networks (with one or none hidden layer) dropout is able to reduce the Rademacher complexity in polynomial, whereas for deep neural networks it can amazingly lead to an exponential reduction of the Rademacher complexity.

* 20 pagea
Click to Read Paper
Margin theory provides one of the most popular explanations to the success of \texttt{AdaBoost}, where the central point lies in the recognition that \textit{margin} is the key for characterizing the performance of \texttt{AdaBoost}. This theory has been very influential, e.g., it has been used to argue that \texttt{AdaBoost} usually does not overfit since it tends to enlarge the margin even after the training error reaches zero. Previously the \textit{minimum margin bound} was established for \texttt{AdaBoost}, however, \cite{Breiman1999} pointed out that maximizing the minimum margin does not necessarily lead to a better generalization. Later, \cite{Reyzin:Schapire2006} emphasized that the margin distribution rather than minimum margin is crucial to the performance of \texttt{AdaBoost}. In this paper, we first present the \textit{$k$th margin bound} and further study on its relationship to previous work such as the minimum margin bound and Emargin bound. Then, we improve the previous empirical Bernstein bounds \citep{Maurer:Pontil2009,Audibert:Munos:Szepesvari2009}, and based on such findings, we defend the margin-based explanation against Breiman's doubts by proving a new generalization error bound that considers exactly the same factors as \cite{Schapire:Freund:Bartlett:Lee1998} but is sharper than \cite{Breiman1999}'s minimum margin bound. By incorporating factors such as average margin and variance, we present a generalization error bound that is heavily related to the whole margin distribution. We also provide margin distribution bounds for generalization error of voting classifiers in finite VC-dimension space.

* Artificial Intelligence 203:1-18 2013
* 35 pages
Click to Read Paper
Recently, with convolutional neural networks gaining significant achievements in many challenging machine learning fields, hand-crafted neural networks no longer satisfy our requirements as designing a network will cost a lot, and automatically generating architectures has attracted increasingly more attention and focus. Some research on auto-generated networks has achieved promising results. However, they mainly aim at picking a series of single layers such as convolution or pooling layers one by one. There are many elegant and creative designs in the carefully hand-crafted neural networks, such as Inception-block in GoogLeNet, residual block in residual network and dense block in dense convolutional network. Based on reinforcement learning and taking advantages of the superiority of these networks, we propose a novel automatic process to design a multi-block neural network, whose architecture contains multiple types of blocks mentioned above, with the purpose to do structure learning of deep neural networks and explore the possibility whether different blocks can be composed together to form a well-behaved neural network. The optimal network is created by the Q-learning agent who is trained to sequentially pick different types of blocks. To verify the validity of our proposed method, we use the auto-generated multi-block neural network to conduct experiments on image benchmark datasets MNIST, SVHN and CIFAR-10 image classification task with restricted computational resources. The results demonstrate that our method is very effective, achieving comparable or better performance than hand-crafted networks and advanced auto-generated neural networks.

Click to Read Paper
Argument component detection (ACD) is an important sub-task in argumentation mining. ACD aims at detecting and classifying different argument components in natural language texts. Historical annotations (HAs) are important features the human annotators consider when they manually perform the ACD task. However, HAs are largely ignored by existing automatic ACD techniques. Reinforcement learning (RL) has proven to be an effective method for using HAs in some natural language processing tasks. In this work, we propose a RL-based ACD technique, and evaluate its performance on two well-annotated corpora. Results suggest that, in terms of classification accuracy, HAs-augmented RL outperforms plain RL by at most 17.85%, and outperforms the state-of-the-art supervised learning algorithm by at most 11.94%.

Click to Read Paper
Nearest neighbor has always been one of the most appealing non-parametric approaches in machine learning, pattern recognition, computer vision, etc. Previous empirical studies partly shows that nearest neighbor is resistant to noise, yet there is a lack of deep analysis. This work presents the finite-sample and distribution-dependent bounds on the consistency of nearest neighbor in the random noise setting. The theoretical results show that, for asymmetric noises, k-nearest neighbor is robust enough to classify most data correctly, except for a handful of examples, whose labels are totally misled by random noises. For symmetric noises, however, k-nearest neighbor achieves the same consistent rate as that of noise-free setting, which verifies the resistance of k-nearest neighbor to random noisy labels. Motivated by the theoretical analysis, we propose the Robust k-Nearest Neighbor (RkNN) approach to deal with noisy labels. The basic idea is to make unilateral corrections to examples, whose labels are totally misled by random noises, and classify the others directly by utilizing the robustness of k-nearest neighbor. We verify the effectiveness of the proposed algorithm both theoretically and empirically.

* 35 pages
Click to Read Paper
In recent years, research on image generation methods has been developing fast. The auto-encoding variational Bayes method (VAEs) was proposed in 2013, which uses variational inference to learn a latent space from the image database and then generates images using the decoder. The generative adversarial networks (GANs) came out as a promising framework, which uses adversarial training to improve the generative ability of the generator. However, the generated pictures by GANs are generally blurry. The deep convolutional generative adversarial networks (DCGANs) were then proposed to leverage the quality of generated images. Since the input noise vectors are randomly sampled from a Gaussian distribution, the generator has to map from a whole normal distribution to the images. This makes DCGANs unable to reflect the inherent structure of the training data. In this paper, we propose a novel deep model, called generative adversarial networks with decoder-encoder output noise (DE-GANs), which takes advantage of both the adversarial training and the variational Bayesain inference to improve the performance of image generation. DE-GANs use a pre-trained decoder-encoder architecture to map the random Gaussian noise vectors to informative ones and pass them to the generator of the adversarial networks. Since the decoder-encoder architecture is trained by the same images as the generators, the output vectors could carry the intrinsic distribution information of the original images. Moreover, the loss function of DE-GANs is different from GANs and DCGANs. A hidden-space loss function is added to the adversarial loss function to enhance the robustness of the model. Extensive empirical results show that DE-GANs can accelerate the convergence of the adversarial training process and improve the quality of the generated images.

Click to Read Paper
Although the performance of person Re-Identification (ReID) has been significantly boosted, many challenging issues in real scenarios have not been fully investigated, e.g., the complex scenes and lighting variations, viewpoint and pose changes, and the large number of identities in a camera network. To facilitate the research towards conquering those issues, this paper contributes a new dataset called MSMT17 with many important features, e.g., 1) the raw videos are taken by an 15-camera network deployed in both indoor and outdoor scenes, 2) the videos cover a long period of time and present complex lighting variations, and 3) it contains currently the largest number of annotated identities, i.e., 4,101 identities and 126,441 bounding boxes. We also observe that, domain gap commonly exists between datasets, which essentially causes severe performance drop when training and testing on different datasets. This results in that available training data cannot be effectively leveraged for new testing domains. To relieve the expensive costs of annotating new training samples, we propose a Person Transfer Generative Adversarial Network (PTGAN) to bridge the domain gap. Comprehensive experiments show that the domain gap could be substantially narrowed-down by the PTGAN.

* 10 pages, 9 figures; accepted in CVPR 2018
Click to Read Paper
The human behavior of evaluating other individuals with respect to their personality traits and intelligence by evaluating their faces plays a crucial role in human relations. These trait judgments might influence important social outcomes in our lives such as elections and court sentences. Previous studies have reported that human can make valid inferences for at least four personality traits. In addition, some studies have demonstrated that facial trait evaluation can be learned using machine learning methods accurately. In this work, we experimentally explore whether self-reported personality traits and intelligence can be predicted reliably from a facial image. More specifically, the prediction problem is separately cast in two parts: a classification task and a regression task. A facial structural feature is constructed from the relations among facial salient points, and an appearance feature is built by five texture descriptors. In addition, a minutia-based fingerprint feature from a fingerprint image is also explored. The classification results show that the personality traits "Rule-consciousness" and "Vigilance" can be predicted reliably, and that the traits of females can be predicted more accurately than those of male. However, the regression experiments show that it is difficult to predict scores for individual personality traits and intelligence. The residual plots and the correlation results indicate no evident linear correlation between the measured scores and the predicted scores. Both the classification and the regression results reveal that "Rule-consciousness" and "Tension" can be reliably predicted from the facial features, while "Social boldness" gets the worst prediction results. The experiments results show that it is difficult to predict intelligence from either the facial features or the fingerprint feature, a finding that is in agreement with previous studies.

Click to Read Paper
Adaptive filtering algorithms operating in reproducing kernel Hilbert spaces have demonstrated superiority over their linear counterpart for nonlinear system identification. Unfortunately, an undesirable characteristic of these methods is that the order of the filters grows linearly with the number of input data. This dramatically increases the computational burden and memory requirement. A variety of strategies based on dictionary learning have been proposed to overcome this severe drawback. Few, if any, of these works analyze the problem of updating the dictionary in a time-varying environment. In this paper, we present an analytical study of the convergence behavior of the Gaussian least-mean-square algorithm in the case where the statistics of the dictionary elements only partially match the statistics of the input data. This allows us to emphasize the need for updating the dictionary in an online way, by discarding the obsolete elements and adding appropriate ones. We introduce a kernel least-mean-square algorithm with L1-norm regularization to automatically perform this task. The stability in the mean of this method is analyzed, and its performance is tested with experiments.

Click to Read Paper
Recommender system has attracted much attention during the past decade. Many attack detection algorithms have been developed for better recommendations, mostly focusing on shilling attacks, where an attack organizer produces a large number of user profiles by the same strategy to promote or demote an item. This work considers a different attack style: unorganized malicious attacks, where attackers individually utilize a small number of user profiles to attack different items without any organizer. This attack style occurs in many real applications, yet relevant study remains open. We first formulate the unorganized malicious attacks detection as a matrix completion problem, and propose the Unorganized Malicious Attacks detection (UMA) approach, a proximal alternating splitting augmented Lagrangian method. We verify, both theoretically and empirically, the effectiveness of our proposed approach.

Click to Read Paper
Single image super-resolution aims to generate a high-resolution image from a single low-resolution image, which is of great significance in extensive applications. As an ill-posed problem, numerous methods have been proposed to reconstruct the missing image details based on exemplars or priors. In this paper, we propose a fast and simple single image super-resolution strategy utilizing patch-wise sigmoid transformation as an imposed sharpening regularization term in the reconstruction, which realizes amazing reconstruction performance. Extensive experiments compared with other state-of-the-art approaches demonstrate the superior effectiveness and efficiency of the proposed algorithm.

Click to Read Paper
Despite being so vital to success of Support Vector Machines, the principle of separating margin maximisation is not used in deep learning. We show that minimisation of margin variance and not maximisation of the margin is more suitable for improving generalisation in deep architectures. We propose the Halfway loss function that minimises the Normalised Margin Variance (NMV) at the output of a deep learning models and evaluate its performance against the Softmax Cross-Entropy loss on the MNIST, smallNORB and CIFAR-10 datasets.

Click to Read Paper
AUC is an important performance measure and many algorithms have been devoted to AUC optimization, mostly by minimizing a surrogate convex loss on a training data set. In this work, we focus on one-pass AUC optimization that requires only going through the training data once without storing the entire training dataset, where conventional online learning algorithms cannot be applied directly because AUC is measured by a sum of losses defined over pairs of instances from different classes. We develop a regression-based algorithm which only needs to maintain the first and second order statistics of training data in memory, resulting a storage requirement independent from the size of training data. To efficiently handle high dimensional data, we develop a randomized algorithm that approximates the covariance matrices by low rank matrices. We verify, both theoretically and empirically, the effectiveness of the proposed algorithm.

* Proceeding of 30th International Conference on Machine Learning
Click to Read Paper
Neural generative models have become popular and achieved promising performance on short-text conversation tasks. They are generally trained to build a 1-to-1 mapping from the input post to its output response. However, a given post is often associated with multiple replies simultaneously in real applications. Previous research on this task mainly focuses on improving the relevance and informativeness of the top one generated response for each post. Very few works study generating multiple accurate and diverse responses for the same post. In this paper, we propose a novel response generation model, which considers a set of responses jointly and generates multiple diverse responses simultaneously. A reinforcement learning algorithm is designed to solve our model. Experiments on two short-text conversation tasks validate that the multiple responses generated by our model obtain higher quality and larger diversity compared with various state-of-the-art generative models.

Click to Read Paper
Real-world scenarios demand reasoning about process, more than final outcome prediction, to discover latent causal chains and better understand complex systems. It requires the learning algorithms to offer both accurate predictions and clear interpretations. We design a set of trajectory reasoning tasks on graphs with only the source and the destination observed. We present the attention flow mechanism to explicitly model the reasoning process, leveraging the relational inductive biases by basing our models on graph networks. We study the way attention flow can effectively act on the underlying information flow implemented by message passing. Experiments demonstrate that the attention flow driven by and interacting with graph networks can provide higher accuracy in prediction and better interpretation for trajectories reasoning.

Click to Read Paper
Recent learning-based super-resolution (SR) methods often focus on dictionary learning or network training. In this paper, we discuss in detail a new SR method based on local patch encoding (LPE) instead of traditional dictionary learning. The proposed method consists of a learning stage and a reconstructing stage. In the learning stage, image patches are classified into different classes by means of the proposed LPE, and then a projection matrix is computed for each class by utilizing a simple constraint. In the reconstructing stage, an input LR patch can be simply reconstructed by computing its LPE code and then multiplying the corresponding projection matrix. Furthermore, we discuss the relationship between the proposed method and the anchored neighborhood regression methods; we also analyze the extendibility of the proposed method. The experimental results on several image sets demonstrate the effectiveness of the LPE-based methods.

* Y. Zhao, R. Wang, W. Jia, J. Yang, W. Wang , W. Gao, Local patch encoding-based method for single image super-resolution, Information Sciences, vol.433, pp.292-305, 2018
* 20 pages, 8 figures
Click to Read Paper