Models, code, and papers for "Yuan Yao":

Sparse model selection is ubiquitous from linear regression to graphical models where regularization paths, as a family of estimators upon the regularization parameter varying, are computed when the regularization parameter is unknown or decided data-adaptively. Traditional computational methods rely on solving a set of optimization problems where the regularization parameters are fixed on a grid that might be inefficient. In this paper, we introduce a simple iterative regularization path, which follows the dynamics of a sparse Mirror Descent algorithm or a generalization of Linearized Bregman Iterations with nonlinear loss. Its performance is competitive to \texttt{glmnet} with a further bias reduction. A path consistency theory is presented that under the Restricted Strong Convexity (RSC) and the Irrepresentable Condition (IRR), the path will first evolve in a subspace with no false positives and reach an estimator that is sign-consistent or of minimax optimal $\ell_2$ error rate. Early stopping regularization is required to prevent overfitting. Application examples are given in sparse logistic regression and Ising models for NIPS coauthorship.

This note describes the details of our solution to the dense-captioning events in videos task of ActivityNet Challenge 2018. Specifically, we solve this problem with a two-stage way, i.e., first temporal event proposal and then sentence generation. For temporal event proposal, we directly leverage the three-stage workflow in [13, 16]. For sentence generation, we capitalize on LSTM-based captioning framework with temporal attention mechanism (dubbed as LSTM-T). Moreover, the input visual sequence to the LSTM-based video captioning model is comprised of RGB and optical flow images. At inference, we adopt a late fusion scheme to fuse the two LSTM-based captioning models for sentence generation.

We propose a scalable approach to learn video-based question answering (QA): answer a "free-form natural language question" about a video content. Our approach automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended fromMN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et al. 2015), SS (Venugopalan et al. 2015). In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. Finally, we evaluate performance on manually generated video-based QA pairs. The results show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.

This paper presents a semi-supervised learning framework for a customized semantic segmentation task using multiview image streams. A key challenge of the customized task lies in the limited accessibility of the labeled data due to the requirement of prohibitive manual annotation effort. We hypothesize that it is possible to leverage multiview image streams that are linked through the underlying 3D geometry, which can provide an additional supervisionary signal to train a segmentation model. We formulate a new cross-supervision method using a shape belief transfer---the segmentation belief in one image is used to predict that of the other image through epipolar geometry analogous to shape-from-silhouette. The shape belief transfer provides the upper and lower bounds of the segmentation for the unlabeled data where its gap approaches asymptotically to zero as the number of the labeled views increases. We integrate this theory to design a novel network that is agnostic to camera calibration, network model, and semantic category and bypasses the intermediate process of suboptimal 3D reconstruction. We validate this network by recognizing a customized semantic category per pixel from realworld visual data including non-human species and a subject of interest in social videos where attaining large-scale annotation data is infeasible.

Nonconvex optimization problems arise in different research fields and arouse lots of attention in signal processing, statistics and machine learning. In this work, we explore the accelerated proximal gradient method and some of its variants which have been shown to converge under nonconvex context recently. We show that a novel variant proposed here, which exploits adaptive momentum and block coordinate update with specific update rules, further improves the performance of a broad class of nonconvex problems. In applications to sparse linear regression with regularizations like Lasso, grouped Lasso, capped $\ell_1$ and SCAP, the proposed scheme enjoys provable local linear convergence, with experimental justification.

Knowledge graphs are important resources for many artificial intelligence tasks but often suffer from incompleteness. In this work, we propose to use pre-trained language models for knowledge graph completion. We treat triples in knowledge graphs as textual sequences and propose a novel framework named Knowledge Graph Bidirectional Encoder Representations from Transformer (KG-BERT) to model these triples. Our method takes entity and relation descriptions of a triple as input and computes scoring function of the triple with the KG-BERT language model. Experimental results on multiple benchmark knowledge graphs show that our method can achieve state-of-the-art performance in triple classification, link prediction and relation prediction tasks.

Synthesis of high resolution images using Generative Adversarial Networks (GANs) is challenging, which usually requires numbers of high-end graphic cards with large memory and long time of training. In this paper, we propose a two-stage framework to accelerate the training process of synthesizing high resolution images. High resolution images are first transformed to small codes via the trained encoder and decoder networks. The code in latent space is times smaller than the original high resolution images. Then, we train a code generation network to learn the distribution of the latent codes. In this way, the generator only learns to generate small latent codes instead of large images. Finally, we decode the generated latent codes to image space via the decoder networks so as to output the synthesized high resolution images. Experimental results show that the proposed method accelerates the training process significantly and increases the quality of the generated samples. The proposed acceleration framework makes it possible to generate high resolution images using less training time with limited hardware resource. After using the proposed acceleration method, it takes only 3 days to train a 1024 *1024 image generator on Celeba-HQ dataset using just one NVIDIA P100 graphic card.

Cross-domain recommendation has long been one of the major topics in recommender systems. Recently, various deep models have been proposed to transfer the learned knowledge across domains, but most of them focus on extracting abstract transferable features from auxilliary contents, e.g., images and review texts, and the patterns in the rating matrix itself is rarely touched. In this work, inspired by the concept of domain adaptation, we proposed a deep domain adaptation model (DARec) that is capable of extracting and transferring patterns from rating matrices {\em only} without relying on any auxillary information. We empirically demonstrate on public datasets that our method achieves the best performance among several state-of-the-art alternative cross-domain recommendation models.

Semi-supervised learning is sought for leveraging the unlabelled data when labelled data is difficult or expensive to acquire. Deep generative models (e.g., Variational Autoencoder (VAE)) and semisupervised Generative Adversarial Networks (GANs) have recently shown promising performance in semi-supervised classification for the excellent discriminative representing ability. However, the latent code learned by the traditional VAE is not exclusive (repeatable) for a specific input sample, which prevents it from excellent classification performance. In particular, the learned latent representation depends on a non-exclusive component which is stochastically sampled from the prior distribution. Moreover, the semi-supervised GAN models generate data from pre-defined distribution (e.g., Gaussian noises) which is independent of the input data distribution and may obstruct the convergence and is difficult to control the distribution of the generated data. To address the aforementioned issues, we propose a novel Adversarial Variational Embedding (AVAE) framework for robust and effective semi-supervised learning to leverage both the advantage of GAN as a high quality generative model and VAE as a posterior distribution learner. The proposed approach first produces an exclusive latent code by the model which we call VAE++, and meanwhile, provides a meaningful prior distribution for the generator of GAN. The proposed approach is evaluated over four different real-world applications and we show that our method outperforms the state-of-the-art models, which confirms that the combination of VAE++ and GAN can provide significant improvements in semisupervised classification.

Laboratory testing and medication prescription are two of the most important routines in daily clinical practice. Developing an artificial intelligence system that can automatically make lab test imputations and medication recommendations can save cost on potentially redundant lab tests and inform physicians in more effective prescription. We present an intelligent model that can automatically recommend the patients' medications based on their incomplete lab tests, and can even accurately estimate the lab values that have not been taken. We model the complex relations between multiple types of medical entities with their inherent features in a heterogeneous graph. Then we learn a distributed representation for each entity in the graph based on graph convolutional networks to make the representations integrate information from multiple types of entities. Since the entity representations incorporate multiple types of medical information, they can be used for multiple medical tasks. In our experiments, we construct a graph to associate patients, encounters, lab tests and medications, and conduct the two tasks: medication recommendation and lab test imputation. The experimental results demonstrate that our model can outperform the state-of-the-art models in both tasks.

Image representation is a fundamental task in computer vision. However, most of the existing approaches for image representation ignore the relations between images and consider each input image independently. Intuitively, relations between images can help to understand the images and maintain model consistency over related images. In this paper, we consider modeling the image-level relations to generate more informative image representations, and propose ImageGCN, an end-to-end graph convolutional network framework for multi-relational image modeling. We also apply ImageGCN to chest X-ray (CXR) images where rich relational information is available for disease identification. Unlike previous image representation models, ImageGCN learns the representation of an image using both its original pixel features and the features of related images. Besides learning informative representations for images, ImageGCN can also be used for object detection in a weakly supervised manner. The Experimental results on ChestX-ray14 dataset demonstrate that ImageGCN can outperform respective baselines in both disease identification and localization tasks and can achieve comparable and often better results than the state-of-the-art methods.

Robust scatter estimation is a fundamental task in statistics. The recent discovery on the connection between robust estimation and generative adversarial nets (GANs) by Gao et al. (2018) suggests that it is possible to compute depth-like robust estimators using similar techniques that optimize GANs. In this paper, we introduce a general learning via classification framework based on the notion of proper scoring rules. This framework allows us to understand both matrix depth function and various GANs through the lens of variational approximations of $f$-divergences induced by proper scoring rules. We then propose a new class of robust scatter estimators in this framework by carefully constructing discriminators with appropriate neural network structures. These estimators are proved to achieve the minimax rate of scatter estimation under Huber's contamination model. Our numerical results demonstrate its good performance under various settings against competitors in the literature.

Margin enlargement over training data has been an important strategy since perceptrons in machine learning for the purpose of boosting the robustness of classifiers toward a good generalization ability. Yet Breiman shows a dilemma (Breiman, 1999) that a uniform improvement on margin distribution \emph{does not} necessarily reduces generalization errors. In this paper, we revisit Breiman's dilemma in deep neural networks with recently proposed spectrally normalized margins. A novel perspective is provided to explain Breiman's dilemma based on phase transitions in dynamics of normalized margin distributions, that reflects the trade-off between expressive power of models and complexity of data. When data complexity is comparable to the model expressiveness in the sense that both training and test data share similar phase transitions in normalized margin dynamics, two efficient ways are derived to predict the trend of generalization or test error via classic margin-based generalization bounds with restricted Rademacher complexities. On the other hand, over-expressive models that exhibit uniform improvements on training margins, as a distinct phase transition to test margin dynamics, may lose such a prediction power and fail to prevent the overfitting. Experiments are conducted to show the validity of the proposed method with some basic convolutional networks, AlexNet, VGG-16, and ResNet-18, on several datasets including Cifar10/100 and mini-ImageNet.

Text Classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.

In open set learning, a model must be able to generalize to novel classes when it encounters a sample that does not belong to any of the classes it has seen before. Open set learning poses a realistic learning scenario that is receiving growing attention. Existing studies on open set learning mainly focused on detecting novel classes, but few studies tried to model them for differentiating novel classes. We recognize that novel classes should be different from each other, and propose distribution networks for open set learning that can learn and model different novel classes. We hypothesize that, through a certain mapping, samples from different classes with the same classification criterion should follow different probability distributions from the same distribution family. We estimate the probability distribution for each known class and a novel class is detected when a sample is not likely to belong to any of the known distributions. Due to the large feature dimension in the original feature space, the probability distributions in the original feature space are difficult to estimate. Distribution networks map the samples in the original feature space to a latent space where the distributions of known classes can be jointly learned with the network. In the latent space, we also propose a distribution parameter transfer strategy for novel class detection and modeling. By novel class modeling, the detected novel classes can serve as known classes to the subsequent classification. Our experimental results on image datasets MNIST and CIFAR10 and text dataset Ohsumed show that the distribution networks can detect novel classes accurately and model them well for the subsequent classification tasks.

Clinical text classification is an important problem in medical natural language processing. Existing studies have conventionally focused on rules or knowledge sources-based feature engineering, but only a few have exploited effective feature learning capability of deep learning methods. In this study, we propose a novel approach which combines rule-based features and knowledge-guided deep learning techniques for effective disease classification. Critical Steps of our method include identifying trigger phrases, predicting classes with very few examples using trigger phrases and training a convolutional neural network with word embeddings and Unified Medical Language System (UMLS) entity embeddings. We evaluated our method on the 2008 Integrating Informatics with Biology and the Bedside (i2b2) obesity challenge. The results show that our method outperforms the state of the art methods.

Gaussian processes (GPs) provide a nonparametric representation of functions. However, classical GP inference suffers from high computational cost and it is difficult to design nonstationary GP priors in practice. In this paper, we propose a sparse Gaussian process model, EigenGP, based on the Karhunen-Loeve (KL) expansion of a GP prior. We use the Nystrom approximation to obtain data dependent eigenfunctions and select these eigenfunctions by evidence maximization. This selection reduces the number of eigenfunctions in our model and provides a nonstationary covariance function. To handle nonlinear likelihoods, we develop an efficient expectation propagation (EP) inference algorithm, and couple it with expectation maximization for eigenfunction selection. Because the eigenfunctions of a Gaussian kernel are associated with clusters of samples - including both the labeled and unlabeled - selecting relevant eigenfunctions enables EigenGP to conduct semi-supervised learning. Our experimental results demonstrate improved predictive performance of EigenGP over alternative state-of-the-art sparse GP and semisupervised learning methods for regression, classification, and semisupervised classification.

Efficient training of deep neural networks (DNNs) is a challenge due to the associated highly nonconvex optimization. The alternating direction method of multipliers (ADMM) has attracted rising attention in deep learning for its potential of distributed computing. However, it remains an open problem to establish the convergence of ADMM in DNN training due to the nonlinear constraints involved. In this paper, we provide an answer to this problem by establishing the convergence of some nonlinearly constrained ADMM for DNNs with smooth activations. To be specific, we establish the global convergence to a Karush-Kuhn-Tucker (KKT) point at a ${\cal O}(1/k)$ rate. To achieve this goal, the key development lies in a new local linear approximation technique which enables us to overcome the hurdle of nonlinear constraints in ADMM for DNNs.

This paper presents MONET---an end-to-end semi-supervised learning framework for a pose detector using multiview image streams. What differentiates MONET from existing models is its capability of detecting general subjects including non-human species without a pre-trained model. A key challenge of such subjects lies in the limited availability of expert manual annotations, which often leads to a large bias in the detection model. We address this challenge by using the epipolar constraint embedded in the unlabeled data in two ways. First, given a set of the labeled data, the keypoint trajectories can be reliably reconstructed in 3D using multiview optical flows, resulting in considerable data augmentation in space and time from nearly exhaustive views. Second, the detection across views must geometrically agree with each other. We introduce a new measure of geometric consistency in keypoint distributions called epipolar divergence---a generalized distance from the epipolar lines to the corresponding keypoint distribution. Epipolar divergence characterizes when two view keypoint distributions produces zero reprojection error. We design a twin network that minimizes the epipolar divergence through stereo rectification that can significantly alleviate computational complexity and sampling aliasing in training. We demonstrate that our framework can localize customized keypoints of diverse species, e.g., humans, dogs, and monkeys.