Models, code, and papers for "Yuan Cao":

We propose a novel parameter estimation procedure that works efficiently for conditional random fields (CRF). This algorithm is an extension to the maximum likelihood estimation (MLE), using loss functions defined by Bregman divergences which measure the proximity between the model expectation and the empirical mean of the feature vectors. This leads to a flexible training framework from which multiple update strategies can be derived using natural gradient descent (NGD). We carefully choose the convex function inducing the Bregman divergence so that the types of updates are reduced, while making the optimization procedure more effective by transforming the gradients of the log-likelihood loss function. The derived algorithms are very simple and can be easily implemented on top of the existing stochastic gradient descent (SGD) optimization procedure, yet it is very effective as illustrated by experimental results.

We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a neural tangent random feature (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of $\tilde{\mathcal{O}}(n^{-1/2})$ that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.

Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. While a line of recent work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs, it does not explain the good generalization performance of the gradient-based methods for learning over-parameterized DNNs. In this work, we take a step further, and prove that under certain assumption on the data distribution that is milder than linear separability, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error (i.e., population error). This leads to an algorithmic-dependent generalization error bound for deep learning. To the best of our knowledge, this is the first result of its kind that can explain the good generalization performance of over-parameterized deep neural networks learned by gradient descent.

Empirical studies show that gradient based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. While a line of recent work explains in theory that gradient-based methods with proper random initialization can find the global minima of the training loss in over-parameterized DNNs, it does not explain the good generalization performance of the gradient-based methods for learning over-parameterized DNNs. In this work, we take a step further, and prove that under certain assumption on the data distribution that is milder than linear separability, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error (i.e., population error). This leads to an algorithmic-dependent generalization error bound for deep learning. To the best of our knowledge, this is the first result of its kind that can explain the good generalization performance of over-parameterized deep neural networks learned by gradient descent.

The skip-connections used in residual networks have become a standard architecture choice in deep learning due to the increased training stability and generalization performance with this architecture, although there has been limited theoretical understanding for this improvement. In this work, we analyze overparameterized deep residual networks trained by gradient descent following random initialization, and demonstrate that (i) the class of networks learned by gradient descent constitutes a small subset of the entire neural network function class, and (ii) this subclass of networks is sufficiently large to guarantee small training error. By showing (i) we are able to demonstrate that deep residual networks trained with gradient descent have a small generalization gap between training and test error, and together with (ii) this guarantees that the test error will be small. Our optimization and generalization guarantees require overparameterization that is only logarithmic in the depth of the network, while all known generalization bounds for deep non-residual networks have overparameterization requirements that are at least polynomial in the depth. This provides an explanation for why residual networks are preferable to non-residual ones.

In this paper we propose a novel neural approach for automatic decipherment of lost languages. To compensate for the lack of strong supervision signal, our model design is informed by patterns in language change documented in historical linguistics. The model utilizes an expressive sequence-to-sequence model to capture character-level correspondences between cognates. To effectively train the model in an unsupervised manner, we innovate the training procedure by formalizing it as a minimum-cost flow problem. When applied to the decipherment of Ugaritic, we achieve a 5.5% absolute improvement over state-of-the-art results. We also report the first automatic results in deciphering Linear B, a syllabic language related to ancient Greek, where our model correctly translates 67.3% of cognates.

Facial caricature is an art form of drawing faces in an exaggerated way to convey humor or sarcasm. In this paper, we propose the first Generative Adversarial Network (GAN) for unpaired photo-to-caricature translation, which we call "CariGANs". It explicitly models geometric exaggeration and appearance stylization using two components: CariGeoGAN, which only models the geometry-to-geometry transformation from face photos to caricatures, and CariStyGAN, which transfers the style appearance from caricatures to face photos without any geometry deformation. In this way, a difficult cross-domain translation problem is decoupled into two easier tasks. The perceptual study shows that caricatures generated by our CariGANs are closer to the hand-drawn ones, and at the same time better persevere the identity, compared to state-of-the-art methods. Moreover, our CariGANs allow users to control the shape exaggeration degree and change the color/texture style by tuning the parameters or giving an example caricature.

This paper studies structure detection problems in high temperature ferromagnetic (positive interaction only) Ising models. The goal is to distinguish whether the underlying graph is empty, i.e., the model consists of independent Rademacher variables, versus the alternative that the underlying graph contains a subgraph of a certain structure. We give matching upper and lower minimax bounds under which testing this problem is possible/impossible respectively. Our results reveal that a key quantity called graph arboricity drives the testability of the problem. On the computational front, under a conjecture of the computational hardness of sparse principal component analysis, we prove that, unless the signal is strong enough, there are no polynomial time linear tests on the sample covariance matrix which are capable of testing this problem.

In this paper, we propose a novel approach to recover the missing entries of incomplete data represented by a high-dimension tensor. Tensor-train decomposition, which has powerful tensor representation ability and is free from `the curse of dimensionality', is employed in our approach. By observed entries of incomplete data, we consider to find the factors which can capture the latent features of the data and then reconstruct the missing entries. With low-rank assumption to the original data, tensor completion problem is cast into solving optimization models. Gradient descent methods are applied to optimize the core tensors of tensor-train decomposition. We propose two algorithms: Tensor-train Weighted Optimization (TT-WOPT) and Tensor-train Stochastic Gradient Descent (TT-SGD) to solve tensor completion problems. A high-order tensorization method named visual data tensorization (VDT) is proposed to transform visual data to higher-order forms by which the performance of our algorithms can be improved. The synthetic data experiments and visual data experiments show that our algorithms outperform the state-of-the-art completion algorithms. Especially in high-dimension, high missing rate and large-scale data cases, significant performance can be obtained from our algorithms.

In this paper, we aim at the problem of tensor data completion. Tensor-train decomposition is adopted because of its powerful representation ability and linear scalability to tensor order. We propose an algorithm named Sparse Tensor-train Optimization (STTO) which considers incomplete data as sparse tensor and uses first-order optimization method to find the factors of tensor-train decomposition. Our algorithm is shown to perform well in simulation experiments at both low-order cases and high-order cases. We also employ a tensorization method to transform data to a higher-order form to enhance the performance of our algorithm. The results of image recovery experiments in various cases manifest that our method outperforms other completion algorithms. Especially when the missing rate is very high, e.g., 90\% to 99\%, our method is significantly better than the state-of-the-art methods.

In this paper, we aim at the completion problem of high order tensor data with missing entries. The existing tensor factorization and completion methods suffer from the curse of dimensionality when the order of tensor N>>3. To overcome this problem, we propose an efficient algorithm called TT-WOPT (Tensor-train Weighted OPTimization) to find the latent core tensors of tensor data and recover the missing entries. Tensor-train decomposition, which has the powerful representation ability with linear scalability to tensor order, is employed in our algorithm. The experimental results on synthetic data and natural image completion demonstrate that our method significantly outperforms the other related methods. Especially when the missing rate of data is very high, e.g., 85% to 99%, our algorithm can achieve much better performance than other state-of-the-art algorithms.

Extreme learning machine (ELM), proposed by Huang et al., has been shown a promising learning algorithm for single-hidden layer feedforward neural networks (SLFNs). Nevertheless, because of the random choice of input weights and biases, the ELM algorithm sometimes makes the hidden layer output matrix H of SLFN not full column rank, which lowers the effectiveness of ELM. This paper discusses the effectiveness of ELM and proposes an improved algorithm called EELM that makes a proper selection of the input weights and bias before calculating the output weights, which ensures the full column rank of H in theory. This improves to some extend the learning rate (testing accuracy, prediction accuracy, learning time) and the robustness property of the networks. The experimental results based on both the benchmark function approximation and real-world problems including classification and regression applications show the good performances of EELM.

Dimensionality reduction is an essential technique for multi-way large-scale data, i.e., tensor. Tensor ring (TR) decomposition has become popular due to its high representation ability and flexibility. However, the traditional TR decomposition algorithms suffer from high computational cost when facing large-scale data. In this paper, taking advantages of the recently proposed tensor random projection method, we propose two TR decomposition algorithms. By employing random projection on every mode of the large-scale tensor, the TR decomposition can be processed at a much smaller scale. The simulation experiment shows that the proposed algorithms are $4-25$ times faster than traditional algorithms without loss of accuracy, and our algorithms show superior performance in deep learning dataset compression and hyperspectral image reconstruction experiments compared to other randomized algorithms.

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.

In crowdsourced preference aggregation, it is often assumed that all the annotators are subject to a common preference or utility function which generates their comparison behaviors in experiments. However, in reality annotators are subject to variations due to multi-criteria, abnormal, or a mixture of such behaviors. In this paper, we propose a parsimonious mixed-effects model based on HodgeRank, which takes into account both the fixed effect that the majority of annotators follows a common linear utility model, and the random effect that a small subset of annotators might deviate from the common significantly and exhibits strongly personalized preferences. HodgeRank has been successfully applied to subjective quality evaluation of multimedia and resolves pairwise crowdsourced ranking data into a global consensus ranking and cyclic conflicts of interests. As an extension, our proposed methodology further explores the conflicts of interests through the random effect in annotator specific variations. The key algorithm in this paper establishes a dynamic path from the common utility to individual variations, with different levels of parsimony or sparsity on personalization, based on newly developed Linearized Bregman Algorithms with Inverse Scale Space method. Finally the validity of the methodology are supported by experiments with both simulated examples and three real-world crowdsourcing datasets, which shows that our proposed method exhibits better performance (i.e. smaller test error) compared with HodgeRank due to its parsimonious property.

With the rapid growth of crowdsourcing platforms it has become easy and relatively inexpensive to collect a dataset labeled by multiple annotators in a short time. However due to the lack of control over the quality of the annotators, some abnormal annotators may be affected by position bias which can potentially degrade the quality of the final consensus labels. In this paper we introduce a statistical framework to model and detect annotator's position bias in order to control the false discovery rate (FDR) without a prior knowledge on the amount of biased annotators - the expected fraction of false discoveries among all discoveries being not too high, in order to assure that most of the discoveries are indeed true and replicable. The key technical development relies on some new knockoff filters adapted to our problem and new algorithms based on the Inverse Scale Space dynamics whose discretization is potentially suitable for large scale crowdsourcing data analysis. Our studies are supported by experiments with both simulated examples and real-world data. The proposed framework provides us a useful tool for quantitatively studying annotator's abnormal behavior in crowdsourcing data arising from machine learning, sociology, computer vision, multimedia, etc.

This paper proposes a unified framework to quantify local and global inferential uncertainty for high dimensional nonparanormal graphical models. In particular, we consider the problems of testing the presence of a single edge and constructing a uniform confidence subgraph. Due to the presence of unknown marginal transformations, we propose a pseudo likelihood based inferential approach. In sharp contrast to the existing high dimensional score test method, our method is free of tuning parameters given an initial estimator, and extends the scope of the existing likelihood based inferential framework. Furthermore, we propose a U-statistic multiplier bootstrap method to construct the confidence subgraph. We show that the constructed subgraph is contained in the true graph with probability greater than a given nominal level. Compared with existing methods for constructing confidence subgraphs, our method does not rely on Gaussian or sub-Gaussian assumptions. The theoretical properties of the proposed inferential methods are verified by thorough numerical experiments and real data analysis.

Video prediction, which aims to synthesize new consecutive frames subsequent to an existing video. However, its performance suffers from uncertainty of the future. As a potential weather application for video prediction, short time precipitation nowcasting is a more challenging task than other ones as its uncertainty is highly influenced by temperature, atmospheric, wind, humidity and such like. To address this issue, we propose a star-bridge neural network (StarBriNet). Specifically, we first construct a simple yet effective star-shape information bridge for RNN to transfer features across time-steps. We also propose a novel loss function designed for precipitaion nowcasting task. Furthermore, we utilize group normalization to refine the predictive performance of our network. Experiments in a Moving-Digital dataset and a weather predicting dataset demonstrate that our model outperforms the state-of-the-art algorithms for video prediction and precipitation nowcasting, achieving satisfied weather forecasting performance.