Research papers and code for "Martin J. Wainwright":
We introduce and analyze a form of variance-reduced $Q$-learning. For $\gamma$-discounted MDPs with finite state space $\mathcal{X}$ and action space $\mathcal{U}$, we prove that it yields an $\epsilon$-accurate estimate of the optimal $Q$-function in the $\ell_\infty$-norm using $\mathcal{O} \left(\left(\frac{D}{ \epsilon^2 (1-\gamma)^3} \right) \; \log \left( \frac{D}{(1-\gamma)} \right) \right)$ samples, where $D = |\mathcal{X}| \times |\mathcal{U}|$. This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity, and is the first form of model-free $Q$-learning proven to achieve the worst-case optimal cubic scaling in the discount complexity parameter $1/(1-\gamma)$ accompanied by optimal linear scaling in the state and action space sizes. By contrast, our past work shows that ordinary $Q$-learning has worst-case quartic scaling in the discount complexity.

Click to Read Paper and Get Code
Motivated by the study of $Q$-learning algorithms in reinforcement learning, we study a class of stochastic approximation procedures based on operators that satisfy monotonicity and quasi-contractivity conditions with respect to an underlying cone. We prove a general sandwich relation on the iterate error at each time, and use it to derive non-asymptotic bounds on the error in terms of a cone-induced gauge norm. These results are derived within a deterministic framework, requiring no assumptions on the noise. We illustrate these general bounds in application to synchronous $Q$-learning for discounted Markov decision processes with discrete state-action spaces, in particular by deriving non-asymptotic bounds on the $\ell_\infty$-norm for a range of stepsizes. These results are the sharpest known to date, and we show via simulation that the dependence of our bounds cannot be improved in a worst-case sense. These results show that relative to a model-based $Q$-iteration, the $\ell_\infty$-based sample complexity of $Q$-learning is suboptimal in terms of the discount factor $\gamma$.

Click to Read Paper and Get Code
Discussion of "Latent variable graphical model selection via convex optimization" by Venkat Chandrasekaran, Pablo A. Parrilo and Alan S. Willsky [arXiv:1008.1290].

* Annals of Statistics 2012, Vol. 40, No. 4, 1978-1983
* Published in at http://dx.doi.org/10.1214/12-AOS981 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
Click to Read Paper and Get Code
Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the \emph{same convex variational relaxation} is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the ``wrong'' model even in the infinite data limit) can be provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product.

* UC Berkeley, Department of Statistics; Technical Report 690
Click to Read Paper and Get Code
We consider the problem of finding critical points of functions that are non-convex and non-smooth. Studying a fairly broad class of such problems, we analyze the behavior of three gradient-based methods (gradient descent, proximal update, and Frank-Wolfe update). For each of these methods, we establish rates of convergence for general problems, and also prove faster rates for continuous sub-analytic functions. We also show that our algorithms can escape strict saddle points for a class of non-smooth functions, thereby generalizing known results for smooth functions. Our analysis leads to a simplification of the popular CCCP algorithm, used for optimizing functions that can be written as a difference of two convex functions. Our simplified algorithm retains all the convergence properties of CCCP, along with a significantly lower cost per iteration. We illustrate our methods and theory via applications to the problems of best subset selection, robust estimation, mixture density estimation, and shape-from-shading reconstruction.

* 50 pages, 2 figures
Click to Read Paper and Get Code
Optimization problems with rank constraints arise in many applications, including matrix regression, structured PCA, matrix completion and matrix decomposition problems. An attractive heuristic for solving such problems is to factorize the low-rank matrix, and to run projected gradient descent on the nonconvex factorized optimization problem. The goal of this problem is to provide a general theoretical framework for understanding when such methods work well, and to characterize the nature of the resulting fixed point. We provide a simple set of conditions under which projected gradient descent, when given a suitable initialization, converges geometrically to a statistically useful solution. Our results are applicable even when the initial solution is outside any region of local convexity, and even when the problem is globally concave. Working in a non-asymptotic framework, we show that our conditions are satisfied for a wide range of concrete models, including matrix regression, structured PCA, matrix completion with real and quantized observations, matrix decomposition, and graph clustering problems. Simulation results show excellent agreement with the theoretical predictions.

Click to Read Paper and Get Code
We propose a randomized second-order method for optimization known as the Newton Sketch: it is based on performing an approximate Newton step using a randomly projected or sub-sampled Hessian. For self-concordant functions, we prove that the algorithm has super-linear convergence with exponentially high probability, with convergence and complexity guarantees that are independent of condition numbers and related problem-dependent quantities. Given a suitable initialization, similar guarantees also hold for strongly convex and smooth objectives without self-concordance. When implemented using randomized projections based on a sub-sampled Hadamard basis, the algorithm typically has substantially lower complexity than Newton's method. We also describe extensions of our methods to programs involving convex constraints that are equipped with self-concordant barriers. We discuss and illustrate applications to linear programs, quadratic programs with convex constraints, logistic regression and other generalized linear models, as well as semidefinite programs.

Click to Read Paper and Get Code
We study randomized sketching methods for approximately solving least-squares problem with a general convex constraint. The quality of a least-squares approximation can be assessed in different ways: either in terms of the value of the quadratic objective function (cost approximation), or in terms of some distance measure between the approximate minimizer and the true minimizer (solution approximation). Focusing on the latter criterion, our first main result provides a general lower bound on any randomized method that sketches both the data matrix and vector in a least-squares problem; as a surprising consequence, the most widely used least-squares sketch is sub-optimal for solution approximation. We then present a new method known as the iterative Hessian sketch, and show that it can be used to obtain approximations to the original least-squares problem using a projection dimension proportional to the statistical complexity of the least-squares minimizer, and a logarithmic number of iterations. We illustrate our general theory with simulations for both unconstrained and constrained versions of least-squares, including $\ell_1$-regularization and nuclear norm constraints. We also numerically demonstrate the practicality of our approach in a real face expression classification experiment.

Click to Read Paper and Get Code
Random projection (RP) is a classical technique for reducing storage and computational costs. We analyze RP-based approximations of convex programs, in which the original optimization problem is approximated by the solution of a lower-dimensional problem. Such dimensionality reduction is essential in computation-limited settings, since the complexity of general convex programming can be quite high (e.g., cubic for quadratic programs, and substantially higher for semidefinite programs). In addition to computational savings, random projection is also useful for reducing memory usage, and has useful properties for privacy-sensitive optimization. We prove that the approximation ratio of this procedure can be bounded in terms of the geometry of constraint set. For a broad class of random projections, including those based on various sub-Gaussian distributions as well as randomized Hadamard and Fourier transforms, the data matrix defining the cost function can be projected down to the statistical dimension of the tangent cone of the constraints at the original solution, which is often substantially smaller than the original dimension. We illustrate consequences of our theory for various cases, including unconstrained and $\ell_1$-constrained least squares, support vector machines, low-rank matrix estimation, and discuss implications on privacy-sensitive optimization and some connections with de-noising and compressed sensing.

Click to Read Paper and Get Code
We consider the sampling problem for functional PCA (fPCA), where the simplest example is the case of taking time samples of the underlying functional components. More generally, we model the sampling operation as a continuous linear map from $\mathcal{H}$ to $\mathbb{R}^m$, where the functional components to lie in some Hilbert subspace $\mathcal{H}$ of $L^2$, such as a reproducing kernel Hilbert space of smooth functions. This model includes time and frequency sampling as special cases. In contrast to classical approach in fPCA in which access to entire functions is assumed, having a limited number m of functional samples places limitations on the performance of statistical procedures. We study these effects by analyzing the rate of convergence of an M-estimator for the subspace spanned by the leading components in a multi-spiked covariance model. The estimator takes the form of regularized PCA, and hence is computationally attractive. We analyze the behavior of this estimator within a nonasymptotic framework, and provide bounds that hold with high probability as a function of the number of statistical samples n and the number of functional samples m. We also derive lower bounds showing that the rates obtained are minimax optimal.

* Annals of Statistics 2012, Vol. 40, No. 5, 2483-2510
* Published in at http://dx.doi.org/10.1214/12-AOS1033 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
Click to Read Paper and Get Code
The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series expansion, and a stochastic approximation via Monte Carlo estimates of the integral updates of the basis coefficients. We prove that the SOSMP iterates converge to a \delta-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy \delta and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation.

* Portions of the results were presented at the International Symposium on Information Theory 2012. The results were also submitted to the Journal of Machine Learning Research on December 16th 2012
Click to Read Paper and Get Code
The sum-product or belief propagation (BP) algorithm is a widely-used message-passing algorithm for computing marginal distributions in graphical models with discrete variables. At the core of the BP message updates, when applied to a graphical model with pairwise interactions, lies a matrix-vector product with complexity that is quadratic in the state dimension $d$, and requires transmission of a $(d-1)$-dimensional vector of real numbers (messages) to its neighbors. Since various applications involve very large state dimensions, such computation and communication complexities can be prohibitively complex. In this paper, we propose a low-complexity variant of BP, referred to as stochastic belief propagation (SBP). As suggested by the name, it is an adaptively randomized version of the BP message updates in which each node passes randomly chosen information to each of its neighbors. The SBP message updates reduce the computational complexity (per iteration) from quadratic to linear in $d$, without assuming any particular structure of the potentials, and also reduce the communication complexity significantly, requiring only $\log{d}$ bits transmission per edge. Moreover, we establish a number of theoretical guarantees for the performance of SBP, showing that it converges almost surely to the BP fixed point for any tree-structured graph, and for graphs with cycles satisfying a contractivity condition. In addition, for these graphical models, we provide non-asymptotic upper bounds on the convergence rate, showing that the $\ell_{\infty}$ norm of the error vector decays no slower than $O(1/\sqrt{t})$ with the number of iterations $t$ on trees and the mean square error decays as $O(1/t)$ for general graphs. These analysis show that SBP can provably yield reductions in computational and communication complexities for various classes of graphical models.

* Portions of the results were initially reported at the Allerton Conference on Communications, Control, and Computing (September 2011). The work was also submitted to IEEE Transaction on Information Theory in November 2011
Click to Read Paper and Get Code
We consider a class of operator-induced norms, acting as finite-dimensional surrogates to the L2 norm, and study their approximation properties over Hilbert subspaces of L2 . The class includes, as a special case, the usual empirical norm encountered, for example, in the context of nonparametric regression in reproducing kernel Hilbert spaces (RKHS). Our results have implications to the analysis of M-estimators in models based on finite-dimensional linear approximation of functions, and also to some related packing problems.

Click to Read Paper and Get Code
The problem of graphical model selection is to correctly estimate the graph structure of a Markov random field given samples from the underlying distribution. We analyze the information-theoretic limitations of the problem of graph selection for binary Markov random fields under high-dimensional scaling, in which the graph size $p$ and the number of edges $k$, and/or the maximal node degree $d$ are allowed to increase to infinity as a function of the sample size $n$. For pairwise binary Markov random fields, we derive both necessary and sufficient conditions for correct graph selection over the class $\mathcal{G}_{p,k}$ of graphs on $p$ vertices with at most $k$ edges, and over the class $\mathcal{G}_{p,d}$ of graphs on $p$ vertices with maximum degree at most $d$. For the class $\mathcal{G}_{p, k}$, we establish the existence of constants $c$ and $c'$ such that if $\numobs < c k \log p$, any method has error probability at least 1/2 uniformly over the family, and we demonstrate a graph decoder that succeeds with high probability uniformly over the family for sample sizes $\numobs > c' k^2 \log p$. Similarly, for the class $\mathcal{G}_{p,d}$, we exhibit constants $c$ and $c'$ such that for $n < c d^2 \log p$, any method fails with probability at least 1/2, and we demonstrate a graph decoder that succeeds with high probability for $n > c' d^3 \log p$.

* 27 pages
Click to Read Paper and Get Code
We consider the problem of estimating the support of a vector $\beta^* \in \mathbb{R}^{p}$ based on observations contaminated by noise. A significant body of work has studied behavior of $\ell_1$-relaxations when applied to measurement matrices drawn from standard dense ensembles (e.g., Gaussian, Bernoulli). In this paper, we analyze \emph{sparsified} measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction $\gamma$ of non-zero entries, and the statistical efficiency, as measured by the minimal number of observations $n$ required for exact support recovery with probability converging to one. Our main result is to prove that it is possible to let $\gamma \to 0$ at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions.

Click to Read Paper and Get Code
We describe message-passing and decimation approaches for lossy source coding using low-density generator matrix (LDGM) codes. In particular, this paper addresses the problem of encoding a Bernoulli(0.5) source: for randomly generated LDGM codes with suitably irregular degree distributions, our methods yield performance very close to the rate distortion limit over a range of rates. Our approach is inspired by the survey propagation (SP) algorithm, originally developed by Mezard et al. for solving random satisfiability problems. Previous work by Maneva et al. shows how SP can be understood as belief propagation (BP) for an alternative representation of satisfiability problems. In analogy to this connection, our approach is to define a family of Markov random fields over generalized codewords, from which local message-passing rules can be derived in the standard way. The overall source encoding method is based on message-passing, setting a subset of bits to their preferred values (decimation), and reducing the code.

* To appear in the Proceedings of the International Symposium on Information Theory, Adelaide, Australia; September, 2005
Click to Read Paper and Get Code
We consider data in the form of pairwise comparisons of n items, with the goal of precisely identifying the top k items for some value of k < n, or alternatively, recovering a ranking of all the items. We analyze the Copeland counting algorithm that ranks the items in order of the number of pairwise comparisons won, and show it has three attractive features: (a) its computational efficiency leads to speed-ups of several orders of magnitude in computation time as compared to prior work; (b) it is robust in that theoretical guarantees impose no conditions on the underlying matrix of pairwise-comparison probabilities, in contrast to some prior work that applies only to the BTL parametric model; and (c) it is an optimal method up to constant factors, meaning that it achieves the information-theoretic limits for recovering the top k-subset. We extend our results to obtain sharp guarantees for approximate recovery under the Hamming distortion metric, and more generally, to any arbitrary error requirement that satisfies a simple and natural monotonicity condition.

* Changes in version 2: In addition to recovery in the exact and Hamming metrics, v2 analyzes a general, abstract recovery criterion based on a notion of "allowed sets"
Click to Read Paper and Get Code
We provide novel theoretical results regarding local optima of regularized $M$-estimators, allowing for nonconvexity in both loss and penalty functions. Under restricted strong convexity on the loss and suitable regularity conditions on the penalty, we prove that \emph{any stationary point} of the composite objective function will lie within statistical precision of the underlying parameter vector. Our theory covers many nonconvex objective functions of interest, including the corrected Lasso for errors-in-variables linear models; regression for generalized linear models with nonconvex penalties such as SCAD, MCP, and capped-$\ell_1$; and high-dimensional graphical model estimation. We quantify statistical accuracy by providing bounds on the $\ell_1$-, $\ell_2$-, and prediction error between stationary points and the population-level optimum. We also propose a simple modification of composite gradient descent that may be used to obtain a near-global optimum within statistical precision $\epsilon$ in $\log(1/\epsilon)$ steps, which is the fastest possible rate of any first-order method. We provide simulation studies illustrating the sharpness of our theoretical results.

* 58 pages, 13 figures. To appear in JMLR
Click to Read Paper and Get Code
We demonstrate that the primal-dual witness proof method may be used to establish variable selection consistency and $\ell_\infty$-bounds for sparse regression problems, even when the loss function and/or regularizer are nonconvex. Using this method, we derive two theorems concerning support recovery and $\ell_\infty$-guarantees for the regression estimator in a general setting. Our results provide rigorous theoretical justification for the use of nonconvex regularization: For certain nonconvex regularizers with vanishing derivative away from the origin, support recovery consistency may be guaranteed without requiring the typical incoherence conditions present in $\ell_1$-based methods. We then derive several corollaries that illustrate the wide applicability of our method to analyzing composite objective functions involving losses such as least squares, nonconvex modified least squares for errors-in variables linear regression, the negative log likelihood for generalized linear models, and the graphical Lasso. We conclude with empirical studies to corroborate our theoretical predictions.

* 51 pages, 13 figures
Click to Read Paper and Get Code
We investigate the relationship between the structure of a discrete graphical model and the support of the inverse of a generalized covariance matrix. We show that for certain graph structures, the support of the inverse covariance matrix of indicator variables on the vertices of a graph reflects the conditional independence structure of the graph. Our work extends results that have previously been established only in the context of multivariate Gaussian graphical models, thereby addressing an open question about the significance of the inverse covariance matrix of a non-Gaussian distribution. The proof exploits a combination of ideas from the geometry of exponential families, junction tree theory and convex analysis. These population-level results have various consequences for graph selection methods, both known and novel, including a novel method for structure estimation for missing or corrupted observations. We provide nonasymptotic guarantees for such methods and illustrate the sharpness of these predictions via simulations.

* Annals of Statistics 2013, Vol. 41, No. 6, 3022-3049
* Published in at http://dx.doi.org/10.1214/13-AOS1162 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
Click to Read Paper and Get Code