The modern data analyst must cope with data encoded in various forms, vectors, matrices, strings, graphs, or more. Consequently, statistical and machine learning models tailored to different data encodings are important. We focus on data encoded as normalized vectors, so that their "direction" is more important than their magnitude. Specifically, we consider high-dimensional vectors that lie either on the surface of the unit hypersphere or on the real projective plane. For such data, we briefly review common mathematical models prevalent in machine learning, while also outlining some technical aspects, software, applications, and open mathematical challenges.

* 12 pages, slightly modified version of submitted book chapter

* 12 pages, slightly modified version of submitted book chapter

**Click to Read Paper**
Positive definite matrices abound in a dazzling variety of applications. This ubiquity can be in part attributed to their rich geometric structure: positive definite matrices form a self-dual convex cone whose strict interior is a Riemannian manifold. The manifold view is endowed with a "natural" distance function while the conic view is not. Nevertheless, drawing motivation from the conic view, we introduce the S-Divergence as a "natural" distance-like function on the open cone of positive definite matrices. We motivate the S-divergence via a sequence of results that connect it to the Riemannian distance. In particular, we show that (a) this divergence is the square of a distance; and (b) that it has several geometric properties similar to those of the Riemannian distance, though without being computationally as demanding. The S-divergence is even more intriguing: although nonconvex, we can still compute matrix means and medians using it to global optimality. We complement our results with some numerical experiments illustrating our theorems and our optimization algorithm for computing matrix medians.

* 24 pages with several new results; a fraction of this paper also appeared at the Neural Information Processing Systems (NIPS) Conference, Dec. 2012

* 24 pages with several new results; a fraction of this paper also appeared at the Neural Information Processing Systems (NIPS) Conference, Dec. 2012

**Click to Read Paper**
Within the unmanageably large class of nonconvex optimization, we consider the rich subclass of nonsmooth problems that have composite objectives---this already includes the extensively studied convex, composite objective problems as a special case. For this subclass, we introduce a powerful, new framework that permits asymptotically non-vanishing perturbations. In particular, we develop perturbation-based batch and incremental (online like) nonconvex proximal splitting algorithms. To our knowledge, this is the first time that such perturbation-based nonconvex splitting algorithms are being proposed and analyzed. While the main contribution of the paper is the theoretical framework, we complement our results by presenting some empirical results on matrix factorization.

* revised version 12 pages, 2 figures; superset of shorter counterpart in NIPS 2012

* revised version 12 pages, 2 figures; superset of shorter counterpart in NIPS 2012

**Click to Read Paper*** Preprint of paper under review

**Click to Read Paper**

* Published in 31th Annual Conference on Learning Theory (COLT'18)

**Click to Read Paper**

Riemannian Frank-Wolfe with application to the geometric mean of positive definite matrices

May 09, 2018

Melanie Weber, Suvrit Sra

May 09, 2018

Melanie Weber, Suvrit Sra

* Under review; 21 pages, 2 figures

**Click to Read Paper**

Modular proximal optimization for multidimensional total-variation regularization

Dec 30, 2017

Álvaro Barbero, Suvrit Sra

We study \emph{TV regularization}, a widely used technique for eliciting structured sparsity. In particular, we propose efficient algorithms for computing prox-operators for $\ell_p$-norm TV. The most important among these is $\ell_1$-norm TV, for whose prox-operator we present a new geometric analysis which unveils a hitherto unknown connection to taut-string methods. This connection turns out to be remarkably useful as it shows how our geometry guided implementation results in efficient weighted and unweighted 1D-TV solvers, surpassing state-of-the-art methods. Our 1D-TV solvers provide the backbone for building more complex (two or higher-dimensional) TV solvers within a modular proximal optimization approach. We review the literature for an array of methods exploiting this strategy, and illustrate the benefits of our modular design through extensive suite of experiments on (i) image denoising, (ii) image deconvolution, (iii) four variants of fused-lasso, and (iv) video denoising. To underscore our claims and permit easy reproducibility, we provide all the reviewed and our new TV solvers in an easy to use multi-threaded C++, Matlab and Python library.
Dec 30, 2017

Álvaro Barbero, Suvrit Sra

* 67 pages, 32 figures, new non-iterative fast TV algorithm, extensive new experiments, corresponds to the github proxtv repository now

**Click to Read Paper**

An Alternative to EM for Gaussian Mixture Models: Batch and Stochastic Riemannian Optimization

Jun 10, 2017

Reshad Hosseini, Suvrit Sra

Jun 10, 2017

Reshad Hosseini, Suvrit Sra

* 21 pages, 6 figures

**Click to Read Paper**

Diversity Networks: Neural Network Compression Using Determinantal Point Processes

Apr 18, 2017

Zelda Mariet, Suvrit Sra

Apr 18, 2017

Zelda Mariet, Suvrit Sra

* This paper appeared under the shorter title Diversity Networks at ICLR 2016 (http://www.iclr.cc/doku.php?id=iclr2016:main#accepted_papers_conference_track)

**Click to Read Paper**

**Click to Read Paper**

* 21 pages

**Click to Read Paper**

Riemannian Dictionary Learning and Sparse Coding for Positive Definite Matrices

Dec 17, 2015

Anoop Cherian, Suvrit Sra

Dec 17, 2015

Anoop Cherian, Suvrit Sra

**Click to Read Paper**

Fixed-point algorithms for learning determinantal point processes

Oct 08, 2015

Zelda Mariet, Suvrit Sra

Determinantal point processes (DPPs) offer an elegant tool for encoding probabilities over subsets of a ground set. Discrete DPPs are parametrized by a positive semidefinite matrix (called the DPP kernel), and estimating this kernel is key to learning DPPs from observed data. We consider the task of learning the DPP kernel, and develop for it a surprisingly simple yet effective new algorithm. Our algorithm offers the following benefits over previous approaches: (a) it is much simpler; (b) it yields equally good and sometimes even better local maxima; and (c) it runs an order of magnitude faster on large problems. We present experimental results on both real and simulated data to illustrate the numerical performance of our technique.
Oct 08, 2015

Zelda Mariet, Suvrit Sra

* ICML, 2015

**Click to Read Paper**

* 19 pages

**Click to Read Paper**

Statistical estimation for optimization problems on graphs

Nov 29, 2013

Mikhail Langovoy, Suvrit Sra

Large graphs abound in machine learning, data mining, and several related areas. A useful step towards analyzing such graphs is that of obtaining certain summary statistics - e.g., or the expected length of a shortest path between two nodes, or the expected weight of a minimum spanning tree of the graph, etc. These statistics provide insight into the structure of a graph, and they can help predict global properties of a graph. Motivated thus, we propose to study statistical properties of structured subgraphs (of a given graph), in particular, to estimate the expected objective function value of a combinatorial optimization problem over these subgraphs. The general task is very difficult, if not unsolvable; so for concreteness we describe a more specific statistical estimation problem based on spanning trees. We hope that our position paper encourages others to also study other types of graphical structures for which one can prove nontrivial statistical estimates.
Nov 29, 2013

Mikhail Langovoy, Suvrit Sra

* Paper for the NIPS Workshop on Discrete Optimization for Machine Learning (DISCML) (2011): Uncertainty, Generalization and Feedback

**Click to Read Paper**

Sparse Inverse Covariance Estimation via an Adaptive Gradient-Based Method

Jun 25, 2011

Suvrit Sra, Dongmin Kim

We study the problem of estimating from data, a sparse approximation to the inverse covariance matrix. Estimating a sparsity constrained inverse covariance matrix is a key component in Gaussian graphical model learning, but one that is numerically very challenging. We address this challenge by developing a new adaptive gradient-based method that carefully combines gradient information with an adaptive step-scaling strategy, which results in a scalable, highly competitive method. Our algorithm, like its predecessors, maximizes an $\ell_1$-norm penalized log-likelihood and has the same per iteration arithmetic complexity as the best methods in its class. Our experiments reveal that our approach outperforms state-of-the-art competitors, often significantly so, for large problems.
Jun 25, 2011

Suvrit Sra, Dongmin Kim

* 13 pages

**Click to Read Paper**

R-SPIDER: A Fast Riemannian Stochastic Optimization Algorithm with Curvature Independent Rate

Nov 28, 2018

Jingzhao Zhang, Hongyi Zhang, Suvrit Sra

Nov 28, 2018

Jingzhao Zhang, Hongyi Zhang, Suvrit Sra

* arXiv admin note: text overlap with arXiv:1605.07147

**Click to Read Paper**

A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings.

**Click to Read Paper**
Learning Determinantal Point Processes by Sampling Inferred Negatives

Nov 02, 2018

Zelda Mariet, Mike Gartrell, Suvrit Sra

Nov 02, 2018

Zelda Mariet, Mike Gartrell, Suvrit Sra

**Click to Read Paper**

Finite sample expressive power of small-width ReLU networks

Oct 17, 2018

Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Oct 17, 2018

Chulhee Yun, Suvrit Sra, Ali Jadbabaie

* 17 pages, 2 figures

**Click to Read Paper**