We introduce and analyze a form of variance-reduced $Q$-learning. For $\gamma$-discounted MDPs with finite state space $\mathcal{X}$ and action space $\mathcal{U}$, we prove that it yields an $\epsilon$-accurate estimate of the optimal $Q$-function in the $\ell_\infty$-norm using $\mathcal{O} \left(\left(\frac{D}{ \epsilon^2 (1-\gamma)^3} \right) \; \log \left( \frac{D}{(1-\gamma)} \right) \right)$ samples, where $D = |\mathcal{X}| \times |\mathcal{U}|$. This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity, and is the first form of model-free $Q$-learning proven to achieve the worst-case optimal cubic scaling in the discount complexity parameter $1/(1-\gamma)$ accompanied by optimal linear scaling in the state and action space sizes. By contrast, our past work shows that ordinary $Q$-learning has worst-case quartic scaling in the discount complexity.

**Click to Read Paper and Get Code**
Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

May 15, 2019

Martin J. Wainwright

Motivated by the study of $Q$-learning algorithms in reinforcement learning, we study a class of stochastic approximation procedures based on operators that satisfy monotonicity and quasi-contractivity conditions with respect to an underlying cone. We prove a general sandwich relation on the iterate error at each time, and use it to derive non-asymptotic bounds on the error in terms of a cone-induced gauge norm. These results are derived within a deterministic framework, requiring no assumptions on the noise. We illustrate these general bounds in application to synchronous $Q$-learning for discounted Markov decision processes with discrete state-action spaces, in particular by deriving non-asymptotic bounds on the $\ell_\infty$-norm for a range of stepsizes. These results are the sharpest known to date, and we show via simulation that the dependence of our bounds cannot be improved in a worst-case sense. These results show that relative to a model-based $Q$-iteration, the $\ell_\infty$-based sample complexity of $Q$-learning is suboptimal in terms of the discount factor $\gamma$.
May 15, 2019

Martin J. Wainwright

**Click to Read Paper and Get Code**

Discussion: Latent variable graphical model selection via convex optimization

Nov 05, 2012

Martin J. Wainwright

Discussion of "Latent variable graphical model selection via convex optimization" by Venkat Chandrasekaran, Pablo A. Parrilo and Alan S. Willsky [arXiv:1008.1290].
Nov 05, 2012

Martin J. Wainwright

* Annals of Statistics 2012, Vol. 40, No. 4, 1978-1983

* Published in at http://dx.doi.org/10.1214/12-AOS981 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

**Click to Read Paper and Get Code**

Inconsistent parameter estimation in Markov random fields: Benefits in the computation-limited setting

Feb 27, 2006

Martin J. Wainwright

Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the \emph{same convex variational relaxation} is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the ``wrong'' model even in the infinite data limit) can be provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product.
Feb 27, 2006

Martin J. Wainwright

* UC Berkeley, Department of Statistics; Technical Report 690

**Click to Read Paper and Get Code**

Convergence guarantees for a class of non-convex and non-smooth optimization problems

Apr 25, 2018

Koulik Khamaru, Martin J. Wainwright

We consider the problem of finding critical points of functions that are non-convex and non-smooth. Studying a fairly broad class of such problems, we analyze the behavior of three gradient-based methods (gradient descent, proximal update, and Frank-Wolfe update). For each of these methods, we establish rates of convergence for general problems, and also prove faster rates for continuous sub-analytic functions. We also show that our algorithms can escape strict saddle points for a class of non-smooth functions, thereby generalizing known results for smooth functions. Our analysis leads to a simplification of the popular CCCP algorithm, used for optimizing functions that can be written as a difference of two convex functions. Our simplified algorithm retains all the convergence properties of CCCP, along with a significantly lower cost per iteration. We illustrate our methods and theory via applications to the problems of best subset selection, robust estimation, mixture density estimation, and shape-from-shading reconstruction.
Apr 25, 2018

Koulik Khamaru, Martin J. Wainwright

* 50 pages, 2 figures

**Click to Read Paper and Get Code**

Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees

Sep 10, 2015

Yudong Chen, Martin J. Wainwright

Optimization problems with rank constraints arise in many applications, including matrix regression, structured PCA, matrix completion and matrix decomposition problems. An attractive heuristic for solving such problems is to factorize the low-rank matrix, and to run projected gradient descent on the nonconvex factorized optimization problem. The goal of this problem is to provide a general theoretical framework for understanding when such methods work well, and to characterize the nature of the resulting fixed point. We provide a simple set of conditions under which projected gradient descent, when given a suitable initialization, converges geometrically to a statistically useful solution. Our results are applicable even when the initial solution is outside any region of local convexity, and even when the problem is globally concave. Working in a non-asymptotic framework, we show that our conditions are satisfied for a wide range of concrete models, including matrix regression, structured PCA, matrix completion with real and quantized observations, matrix decomposition, and graph clustering problems. Simulation results show excellent agreement with the theoretical predictions.
Sep 10, 2015

Yudong Chen, Martin J. Wainwright

**Click to Read Paper and Get Code**

Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence

May 09, 2015

Mert Pilanci, Martin J. Wainwright

We propose a randomized second-order method for optimization known as the Newton Sketch: it is based on performing an approximate Newton step using a randomly projected or sub-sampled Hessian. For self-concordant functions, we prove that the algorithm has super-linear convergence with exponentially high probability, with convergence and complexity guarantees that are independent of condition numbers and related problem-dependent quantities. Given a suitable initialization, similar guarantees also hold for strongly convex and smooth objectives without self-concordance. When implemented using randomized projections based on a sub-sampled Hadamard basis, the algorithm typically has substantially lower complexity than Newton's method. We also describe extensions of our methods to programs involving convex constraints that are equipped with self-concordant barriers. We discuss and illustrate applications to linear programs, quadratic programs with convex constraints, logistic regression and other generalized linear models, as well as semidefinite programs.
May 09, 2015

Mert Pilanci, Martin J. Wainwright

**Click to Read Paper and Get Code**

Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares

Nov 03, 2014

Mert Pilanci, Martin J. Wainwright

We study randomized sketching methods for approximately solving least-squares problem with a general convex constraint. The quality of a least-squares approximation can be assessed in different ways: either in terms of the value of the quadratic objective function (cost approximation), or in terms of some distance measure between the approximate minimizer and the true minimizer (solution approximation). Focusing on the latter criterion, our first main result provides a general lower bound on any randomized method that sketches both the data matrix and vector in a least-squares problem; as a surprising consequence, the most widely used least-squares sketch is sub-optimal for solution approximation. We then present a new method known as the iterative Hessian sketch, and show that it can be used to obtain approximations to the original least-squares problem using a projection dimension proportional to the statistical complexity of the least-squares minimizer, and a logarithmic number of iterations. We illustrate our general theory with simulations for both unconstrained and constrained versions of least-squares, including $\ell_1$-regularization and nuclear norm constraints. We also numerically demonstrate the practicality of our approach in a real face expression classification experiment.
Nov 03, 2014

Mert Pilanci, Martin J. Wainwright

**Click to Read Paper and Get Code**

Randomized Sketches of Convex Programs with Sharp Guarantees

Apr 29, 2014

Mert Pilanci, Martin J. Wainwright

Random projection (RP) is a classical technique for reducing storage and computational costs. We analyze RP-based approximations of convex programs, in which the original optimization problem is approximated by the solution of a lower-dimensional problem. Such dimensionality reduction is essential in computation-limited settings, since the complexity of general convex programming can be quite high (e.g., cubic for quadratic programs, and substantially higher for semidefinite programs). In addition to computational savings, random projection is also useful for reducing memory usage, and has useful properties for privacy-sensitive optimization. We prove that the approximation ratio of this procedure can be bounded in terms of the geometry of constraint set. For a broad class of random projections, including those based on various sub-Gaussian distributions as well as randomized Hadamard and Fourier transforms, the data matrix defining the cost function can be projected down to the statistical dimension of the tangent cone of the constraints at the original solution, which is often substantially smaller than the original dimension. We illustrate consequences of our theory for various cases, including unconstrained and $\ell_1$-constrained least squares, support vector machines, low-rank matrix estimation, and discuss implications on privacy-sensitive optimization and some connections with de-noising and compressed sensing.
Apr 29, 2014

Mert Pilanci, Martin J. Wainwright

**Click to Read Paper and Get Code**

Sampled forms of functional PCA in reproducing kernel Hilbert spaces

Feb 13, 2013

Arash A. Amini, Martin J. Wainwright

We consider the sampling problem for functional PCA (fPCA), where the simplest example is the case of taking time samples of the underlying functional components. More generally, we model the sampling operation as a continuous linear map from $\mathcal{H}$ to $\mathbb{R}^m$, where the functional components to lie in some Hilbert subspace $\mathcal{H}$ of $L^2$, such as a reproducing kernel Hilbert space of smooth functions. This model includes time and frequency sampling as special cases. In contrast to classical approach in fPCA in which access to entire functions is assumed, having a limited number m of functional samples places limitations on the performance of statistical procedures. We study these effects by analyzing the rate of convergence of an M-estimator for the subspace spanned by the leading components in a multi-spiked covariance model. The estimator takes the form of regularized PCA, and hence is computationally attractive. We analyze the behavior of this estimator within a nonasymptotic framework, and provide bounds that hold with high probability as a function of the number of statistical samples n and the number of functional samples m. We also derive lower bounds showing that the rates obtained are minimax optimal.
Feb 13, 2013

Arash A. Amini, Martin J. Wainwright

* Annals of Statistics 2012, Vol. 40, No. 5, 2483-2510

* Published in at http://dx.doi.org/10.1214/12-AOS1033 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

**Click to Read Paper and Get Code**

Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

Dec 16, 2012

Nima Noorshams, Martin J. Wainwright

The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series expansion, and a stochastic approximation via Monte Carlo estimates of the integral updates of the basis coefficients. We prove that the SOSMP iterates converge to a \delta-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy \delta and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation.
Dec 16, 2012

Nima Noorshams, Martin J. Wainwright

* Portions of the results were presented at the International Symposium on Information Theory 2012. The results were also submitted to the Journal of Machine Learning Research on December 16th 2012

**Click to Read Paper and Get Code**

Stochastic Belief Propagation: A Low-Complexity Alternative to the Sum-Product Algorithm

May 25, 2012

Nima Noorshams, Martin J. Wainwright

The sum-product or belief propagation (BP) algorithm is a widely-used message-passing algorithm for computing marginal distributions in graphical models with discrete variables. At the core of the BP message updates, when applied to a graphical model with pairwise interactions, lies a matrix-vector product with complexity that is quadratic in the state dimension $d$, and requires transmission of a $(d-1)$-dimensional vector of real numbers (messages) to its neighbors. Since various applications involve very large state dimensions, such computation and communication complexities can be prohibitively complex. In this paper, we propose a low-complexity variant of BP, referred to as stochastic belief propagation (SBP). As suggested by the name, it is an adaptively randomized version of the BP message updates in which each node passes randomly chosen information to each of its neighbors. The SBP message updates reduce the computational complexity (per iteration) from quadratic to linear in $d$, without assuming any particular structure of the potentials, and also reduce the communication complexity significantly, requiring only $\log{d}$ bits transmission per edge. Moreover, we establish a number of theoretical guarantees for the performance of SBP, showing that it converges almost surely to the BP fixed point for any tree-structured graph, and for graphs with cycles satisfying a contractivity condition. In addition, for these graphical models, we provide non-asymptotic upper bounds on the convergence rate, showing that the $\ell_{\infty}$ norm of the error vector decays no slower than $O(1/\sqrt{t})$ with the number of iterations $t$ on trees and the mean square error decays as $O(1/t)$ for general graphs. These analysis show that SBP can provably yield reductions in computational and communication complexities for various classes of graphical models.
May 25, 2012

Nima Noorshams, Martin J. Wainwright

* Portions of the results were initially reported at the Allerton Conference on Communications, Control, and Computing (September 2011). The work was also submitted to IEEE Transaction on Information Theory in November 2011

**Click to Read Paper and Get Code**

Approximation properties of certain operator-induced norms on Hilbert spaces

May 31, 2011

Arash A. Amini, Martin J. Wainwright

We consider a class of operator-induced norms, acting as finite-dimensional surrogates to the L2 norm, and study their approximation properties over Hilbert subspaces of L2 . The class includes, as a special case, the usual empirical norm encountered, for example, in the context of nonparametric regression in reproducing kernel Hilbert spaces (RKHS). Our results have implications to the analysis of M-estimators in models based on finite-dimensional linear approximation of functions, and also to some related packing problems.
May 31, 2011

Arash A. Amini, Martin J. Wainwright

**Click to Read Paper and Get Code**

Information-theoretic limits of selecting binary graphical models in high dimensions

May 16, 2009

Narayana Santhanam, Martin J. Wainwright

The problem of graphical model selection is to correctly estimate the graph structure of a Markov random field given samples from the underlying distribution. We analyze the information-theoretic limitations of the problem of graph selection for binary Markov random fields under high-dimensional scaling, in which the graph size $p$ and the number of edges $k$, and/or the maximal node degree $d$ are allowed to increase to infinity as a function of the sample size $n$. For pairwise binary Markov random fields, we derive both necessary and sufficient conditions for correct graph selection over the class $\mathcal{G}_{p,k}$ of graphs on $p$ vertices with at most $k$ edges, and over the class $\mathcal{G}_{p,d}$ of graphs on $p$ vertices with maximum degree at most $d$. For the class $\mathcal{G}_{p, k}$, we establish the existence of constants $c$ and $c'$ such that if $\numobs < c k \log p$, any method has error probability at least 1/2 uniformly over the family, and we demonstrate a graph decoder that succeeds with high probability uniformly over the family for sample sizes $\numobs > c' k^2 \log p$. Similarly, for the class $\mathcal{G}_{p,d}$, we exhibit constants $c$ and $c'$ such that for $n < c d^2 \log p$, any method fails with probability at least 1/2, and we demonstrate a graph decoder that succeeds with high probability for $n > c' d^3 \log p$.
May 16, 2009

Narayana Santhanam, Martin J. Wainwright

* 27 pages

**Click to Read Paper and Get Code**

High-dimensional subset recovery in noise: Sparsified measurements without loss of statistical efficiency

May 20, 2008

Dapo Omidiran, Martin J. Wainwright

We consider the problem of estimating the support of a vector $\beta^* \in \mathbb{R}^{p}$ based on observations contaminated by noise. A significant body of work has studied behavior of $\ell_1$-relaxations when applied to measurement matrices drawn from standard dense ensembles (e.g., Gaussian, Bernoulli). In this paper, we analyze \emph{sparsified} measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction $\gamma$ of non-zero entries, and the statistical efficiency, as measured by the minimal number of observations $n$ required for exact support recovery with probability converging to one. Our main result is to prove that it is possible to let $\gamma \to 0$ at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions.
May 20, 2008

Dapo Omidiran, Martin J. Wainwright

**Click to Read Paper and Get Code**

Lossy source encoding via message-passing and decimation over generalized codewords of LDGM codes

Aug 15, 2005

Martin J. Wainwright, Elitza Maneva

We describe message-passing and decimation approaches for lossy source coding using low-density generator matrix (LDGM) codes. In particular, this paper addresses the problem of encoding a Bernoulli(0.5) source: for randomly generated LDGM codes with suitably irregular degree distributions, our methods yield performance very close to the rate distortion limit over a range of rates. Our approach is inspired by the survey propagation (SP) algorithm, originally developed by Mezard et al. for solving random satisfiability problems. Previous work by Maneva et al. shows how SP can be understood as belief propagation (BP) for an alternative representation of satisfiability problems. In analogy to this connection, our approach is to define a family of Markov random fields over generalized codewords, from which local message-passing rules can be derived in the standard way. The overall source encoding method is based on message-passing, setting a subset of bits to their preferred values (decimation), and reducing the code.
Aug 15, 2005

Martin J. Wainwright, Elitza Maneva

* To appear in the Proceedings of the International Symposium on Information Theory, Adelaide, Australia; September, 2005

**Click to Read Paper and Get Code**

Simple, Robust and Optimal Ranking from Pairwise Comparisons

Apr 27, 2016

Nihar B. Shah, Martin J. Wainwright

We consider data in the form of pairwise comparisons of n items, with the goal of precisely identifying the top k items for some value of k < n, or alternatively, recovering a ranking of all the items. We analyze the Copeland counting algorithm that ranks the items in order of the number of pairwise comparisons won, and show it has three attractive features: (a) its computational efficiency leads to speed-ups of several orders of magnitude in computation time as compared to prior work; (b) it is robust in that theoretical guarantees impose no conditions on the underlying matrix of pairwise-comparison probabilities, in contrast to some prior work that applies only to the BTL parametric model; and (c) it is an optimal method up to constant factors, meaning that it achieves the information-theoretic limits for recovering the top k-subset. We extend our results to obtain sharp guarantees for approximate recovery under the Hamming distortion metric, and more generally, to any arbitrary error requirement that satisfies a simple and natural monotonicity condition.
Apr 27, 2016

Nihar B. Shah, Martin J. Wainwright

* Changes in version 2: In addition to recovery in the exact and Hamming metrics, v2 analyzes a general, abstract recovery criterion based on a notion of "allowed sets"

**Click to Read Paper and Get Code**

Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

Jan 01, 2015

Po-Ling Loh, Martin J. Wainwright

We provide novel theoretical results regarding local optima of regularized $M$-estimators, allowing for nonconvexity in both loss and penalty functions. Under restricted strong convexity on the loss and suitable regularity conditions on the penalty, we prove that \emph{any stationary point} of the composite objective function will lie within statistical precision of the underlying parameter vector. Our theory covers many nonconvex objective functions of interest, including the corrected Lasso for errors-in-variables linear models; regression for generalized linear models with nonconvex penalties such as SCAD, MCP, and capped-$\ell_1$; and high-dimensional graphical model estimation. We quantify statistical accuracy by providing bounds on the $\ell_1$-, $\ell_2$-, and prediction error between stationary points and the population-level optimum. We also propose a simple modification of composite gradient descent that may be used to obtain a near-global optimum within statistical precision $\epsilon$ in $\log(1/\epsilon)$ steps, which is the fastest possible rate of any first-order method. We provide simulation studies illustrating the sharpness of our theoretical results.
Jan 01, 2015

Po-Ling Loh, Martin J. Wainwright

* 58 pages, 13 figures. To appear in JMLR

**Click to Read Paper and Get Code**

Support recovery without incoherence: A case for nonconvex regularization

Dec 17, 2014

Po-Ling Loh, Martin J. Wainwright

We demonstrate that the primal-dual witness proof method may be used to establish variable selection consistency and $\ell_\infty$-bounds for sparse regression problems, even when the loss function and/or regularizer are nonconvex. Using this method, we derive two theorems concerning support recovery and $\ell_\infty$-guarantees for the regression estimator in a general setting. Our results provide rigorous theoretical justification for the use of nonconvex regularization: For certain nonconvex regularizers with vanishing derivative away from the origin, support recovery consistency may be guaranteed without requiring the typical incoherence conditions present in $\ell_1$-based methods. We then derive several corollaries that illustrate the wide applicability of our method to analyzing composite objective functions involving losses such as least squares, nonconvex modified least squares for errors-in variables linear regression, the negative log likelihood for generalized linear models, and the graphical Lasso. We conclude with empirical studies to corroborate our theoretical predictions.
Dec 17, 2014

Po-Ling Loh, Martin J. Wainwright

* 51 pages, 13 figures

**Click to Read Paper and Get Code**

Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

Jan 06, 2014

Po-Ling Loh, Martin J. Wainwright

We investigate the relationship between the structure of a discrete graphical model and the support of the inverse of a generalized covariance matrix. We show that for certain graph structures, the support of the inverse covariance matrix of indicator variables on the vertices of a graph reflects the conditional independence structure of the graph. Our work extends results that have previously been established only in the context of multivariate Gaussian graphical models, thereby addressing an open question about the significance of the inverse covariance matrix of a non-Gaussian distribution. The proof exploits a combination of ideas from the geometry of exponential families, junction tree theory and convex analysis. These population-level results have various consequences for graph selection methods, both known and novel, including a novel method for structure estimation for missing or corrupted observations. We provide nonasymptotic guarantees for such methods and illustrate the sharpness of these predictions via simulations.
Jan 06, 2014

Po-Ling Loh, Martin J. Wainwright

* Annals of Statistics 2013, Vol. 41, No. 6, 3022-3049

* Published in at http://dx.doi.org/10.1214/13-AOS1162 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

**Click to Read Paper and Get Code**