Models, code, and papers for "Mia X. Chen":

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.

In a recent paper, the authors proposed a general methodology for probabilistic learning on manifolds. The method was used to generate numerical samples that are statistically consistent with an existing dataset construed as a realization from a non-Gaussian random vector. The manifold structure is learned using diffusion manifolds and the statistical sample generation is accomplished using a projected Ito stochastic differential equation. This probabilistic learning approach has been extended to polynomial chaos representation of databases on manifolds and to probabilistic nonconvex constrained optimization with a fixed budget of function evaluations. The methodology introduces an isotropic-diffusion kernel with hyperparameter {\epsilon}. Currently, {\epsilon} is more or less arbitrarily chosen. In this paper, we propose a selection criterion for identifying an optimal value of {\epsilon}, based on a maximum entropy argument. The result is a comprehensive, closed, probabilistic model for characterizing data sets with hidden constraints. This entropy argument ensures that out of all possible models, this is the one that is the most uncertain beyond any specified constraints, which is selected. Applications are presented for several databases.

In machine learning, we are given a dataset of the form $\{(\mathbf{x}_j,y_j)\}_{j=1}^M$, drawn as i.i.d. samples from an unknown probability distribution $\mu$; the marginal distribution for the $\mathbf{x}_j$'s being $\mu^*$. We propose that rather than using a positive kernel such as the Gaussian for estimation of these measures, using a non-positive kernel that preserves a large number of moments of these measures yields an optimal approximation. We use multi-variate Hermite polynomials for this purpose, and prove optimal and local approximation results in a supremum norm in a probabilistic sense. Together with a permutation test developed with the same kernel, we prove that the kernel estimator serves as a `witness function' in classification problems. Thus, if the value of this estimator at a point $\mathbf{x}$ exceeds a certain threshold, then the point is reliably in a certain class. This approach can be used to modify pretrained algorithms, such as neural networks or nonlinear dimension reduction techniques, to identify in-class vs out-of-class regions for the purposes of generative models, classification uncertainty, or finding robust centroids. This fact is demonstrated in a number of real world data sets including MNIST, CIFAR10, Science News documents, and LaLonde data sets.

This paper considers the matrix completion problem. We show that it is not necessary to assume joint incoherence, which is a standard but unintuitive and restrictive condition that is imposed by previous studies. This leads to a sample complexity bound that is order-wise optimal with respect to the incoherence parameter (as well as to the rank $r$ and the matrix dimension $n$ up to a log factor). As a consequence, we improve the sample complexity of recovering a semidefinite matrix from $O(nr^{2}\log^{2}n)$ to $O(nr\log^{2}n)$, and the highest allowable rank from $\Theta(\sqrt{n}/\log n)$ to $\Theta(n/\log^{2}n)$. The key step in proof is to obtain new bounds on the $\ell_{\infty,2}$-norm, defined as the maximum of the row and column norms of a matrix. To illustrate the applicability of our techniques, we discuss extensions to SVD projection, structured matrix completion and semi-supervised clustering, for which we provide order-wise improvements over existing results. Finally, we turn to the closely-related problem of low-rank-plus-sparse matrix decomposition. We show that the joint incoherence condition is unavoidable here for polynomial-time algorithms conditioned on the Planted Clique conjecture. This means it is intractable in general to separate a rank-$\omega(\sqrt{n})$ positive semidefinite matrix and a sparse matrix. Interestingly, our results show that the standard and joint incoherence conditions are associated respectively with the information (statistical) and computational aspects of the matrix decomposition problem.

For node level graph encoding, a recent important state-of-art method is the graph convolutional networks (GCN), which nicely integrate local vertex features and graph topology in the spectral domain. However, current studies suffer from several drawbacks: (1) graph CNNs relies on Chebyshev polynomial approximation which results in oscillatory approximation at jump discontinuities; (2) Increasing the order of Chebyshev polynomial can reduce the oscillations issue, but also incurs unaffordable computational cost; (3) Chebyshev polynomials require degree $\Omega$(poly(1/$\epsilon$)) to approximate a jump signal such as $|x|$, while rational function only needs $\mathcal{O}$(poly log(1/$\epsilon$))\cite{liang2016deep,telgarsky2017neural}. However, it's non-trivial to apply rational approximation without increasing computational complexity due to the denominator. In this paper, the superiority of rational approximation is exploited for graph signal recovering. RatioanlNet is proposed to integrate rational function and neural networks. We show that rational function of eigenvalues can be rewritten as a function of graph Laplacian, which can avoid multiplication by the eigenvector matrix. Focusing on the analysis of approximation on graph convolution operation, a graph signal regression task is formulated. Under graph signal regression task, its time complexity can be significantly reduced by graph Fourier transform. To overcome the local minimum problem of neural networks model, a relaxed Remez algorithm is utilized to initialize the weight parameters. Convergence rate of RatioanlNet and polynomial based methods on jump signal is analyzed for a theoretical guarantee. The extensive experimental results demonstrated that our approach could effectively characterize the jump discontinuities, outperforming competing methods by a substantial margin on both synthetic and real-world graphs.

We consider two closely related problems: planted clustering and submatrix localization. The planted clustering problem assumes that a random graph is generated based on some underlying clusters of the nodes; the task is to recover these clusters given the graph. The submatrix localization problem concerns locating hidden submatrices with elevated means inside a large real-valued random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. These formulations cover several classical models such as planted clique, planted densest subgraph, planted partition, planted coloring, and stochastic block model, which are widely used for studying community detection and clustering/bi-clustering. For both problems, we show that the space of the model parameters (cluster/submatrix size, cluster density, and submatrix mean) can be partitioned into four disjoint regions corresponding to decreasing statistical and computational complexities: (1) the \emph{impossible} regime, where all algorithms fail; (2) the \emph{hard} regime, where the computationally expensive Maximum Likelihood Estimator (MLE) succeeds; (3) the \emph{easy} regime, where the polynomial-time convexified MLE succeeds; (4) the \emph{simple} regime, where a simple counting/thresholding procedure succeeds. Moreover, we show that each of these algorithms provably fails in the previous harder regimes. Our theorems establish the minimax recovery limit, which are tight up to constants and hold with a growing number of clusters/submatrices, and provide a stronger performance guarantee than previously known for polynomial-time algorithms. Our study demonstrates the tradeoffs between statistical and computational considerations, and suggests that the minimax recovery limit may not be achievable by polynomial-time algorithms.

In this paper, we consider the problem of learning high-dimensional tensor regression problems with low-rank structure. One of the core challenges associated with learning high-dimensional models is computation since the underlying optimization problems are often non-convex. While convex relaxations could lead to polynomial-time algorithms they are often slow in practice. On the other hand, limited theoretical guarantees exist for non-convex methods. In this paper we provide a general framework that provides theoretical guarantees for learning high-dimensional tensor regression models under different low-rank structural assumptions using the projected gradient descent algorithm applied to a potentially non-convex constraint set $\Theta$ in terms of its \emph{localized Gaussian width}. We juxtapose our theoretical results for non-convex projected gradient descent algorithms with previous results on regularized convex approaches. The two main differences between the convex and non-convex approach are: (i) from a computational perspective whether the non-convex projection operator is computable and whether the projection has desirable contraction properties and (ii) from a statistical upper bound perspective, the non-convex approach has a superior rate for a number of examples. We provide three concrete examples of low-dimensional structure which address these issues and explain the pros and cons for the non-convex and convex approaches. We supplement our theoretical results with simulations which show that, under several common settings of generalized low rank tensor regression, the projected gradient descent approach is superior both in terms of statistical error and run-time provided the step-sizes of the projected descent algorithm are suitably chosen.

The common approach to radial distortion is by the means of polynomial approximation, which introduces distortion-specific parameters into the camera model and requires estimation of these distortion parameters. The task of estimating radial distortion is to find a radial distortion model that allows easy undistortion as well as satisfactory accuracy. This paper presents a new piecewise radial distortion model with easy analytical undistortion formula. The motivation for seeking a piecewise radial distortion model is that, when a camera is resulted in a low quality during manufacturing, the nonlinear radial distortion can be complex. Using low order polynomials to approximate the radial distortion might not be precise enough. On the other hand, higher order polynomials suffer from the inverse problem. With the new piecewise radial distortion function, more flexibility is obtained and the radial undistortion can be performed analytically. Experimental results are presented to show that with this new piecewise radial distortion model, better performance is achieved than that using the single function. Furthermore, a comparable performance with the conventional polynomial model using 2 coefficients can also be accomplished.

We present MMDetection, an object detection toolbox that contains a rich set of object detection and instance segmentation methods as well as related components and modules. The toolbox started from a codebase of MMDet team who won the detection track of COCO Challenge 2018. It gradually evolves into a unified platform that covers many popular detection methods and contemporary modules. It not only includes training and inference codes, but also provides weights for more than 200 network models. We believe this toolbox is by far the most complete detection toolbox. In this paper, we introduce the various features of this toolbox. In addition, we also conduct a benchmarking study on different methods, components, and their hyper-parameters. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and develop their own new detectors. Code and models are available at https://github.com/open-mmlab/mmdetection. The project is under active development and we will keep this document updated.

We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wider adoption has been hampered by the lack of polynomial time sampling algorithms. We remove this hindrance by developing an exact (randomized) polynomial time sampling algorithm as well as its derandomization. Thereafter, we study dual volume sampling via the theory of real stable polynomials and prove that its distribution satisfies the "Strong Rayleigh" property. This result has numerous consequences, including a provably fast-mixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. This sampler is closely related to classical algorithms for popular experimental design methods that are to date lacking theoretical analysis but are known to empirically work well.