Models, code, and papers for "Stefanie Jegelka":

Distributionally Robust Optimization and Generalization in Kernel Methods

May 27, 2019
Matthew Staib, Stefanie Jegelka

Distributionally robust optimization (DRO) has attracted attention in machine learning due to its connections to regularization, generalization, and robustness. Existing work has considered uncertainty sets based on phi-divergences and Wasserstein distances, each of which have drawbacks. In this paper, we study DRO with uncertainty sets measured via maximum mean discrepancy (MMD). We show that MMD DRO is roughly equivalent to regularization by the Hilbert norm and, as a byproduct, reveal deep connections to classic results in statistical learning. In particular, we obtain an alternative proof of a generalization bound for Gaussian kernel ridge regression via a DRO lense. The proof also suggests a new regularizer. Our results apply beyond kernel methods: we derive a generically applicable approximation of MMD DRO, and show that it generalizes recent work on variance-based regularization.


  Click for Model/Code and Paper
ResNet with one-neuron hidden layers is a Universal Approximator

Jul 04, 2018
Hongzhou Lin, Stefanie Jegelka

We demonstrate that a very deep ResNet with stacked modules with one neuron per hidden layer and ReLU activation functions can uniformly approximate any Lebesgue integrable function in $d$ dimensions, i.e. $\ell_1(\mathbb{R}^d)$. Because of the identity mapping inherent to ResNets, our network has alternating layers of dimension one and $d$. This stands in sharp contrast to fully connected networks, which are not universal approximators if their width is the input dimension $d$ [Lu et al, 2017; Hanin and Sellke, 2017]. Hence, our result implies an increase in representational power for narrow deep networks by the ResNet architecture.


  Click for Model/Code and Paper
Max-value Entropy Search for Efficient Bayesian Optimization

Jan 02, 2018
Zi Wang, Stefanie Jegelka

Entropy Search (ES) and Predictive Entropy Search (PES) are popular and empirically successful Bayesian Optimization techniques. Both rely on a compelling information-theoretic motivation, and maximize the information gained about the $\arg\max$ of the unknown function; yet, both are plagued by the expensive computation for estimating entropies. We propose a new criterion, Max-value Entropy Search (MES), that instead uses the information about the maximum function value. We show relations of MES to other Bayesian optimization methods, and establish a regret bound. We observe that MES maintains or improves the good empirical performance of ES/PES, while tremendously lightening the computational burden. In particular, MES is much more robust to the number of samples used for computing the entropy, and hence more efficient for higher dimensional problems.

* Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017 

  Click for Model/Code and Paper
Robust Budget Allocation via Continuous Submodular Functions

Jun 13, 2017
Matthew Staib, Stefanie Jegelka

The optimal allocation of resources for maximizing influence, spread of information or coverage, has gained attention in the past years, in particular in machine learning and data mining. But in applications, the parameters of the problem are rarely known exactly, and using wrong parameters can lead to undesirable outcomes. We hence revisit a continuous version of the Budget Allocation or Bipartite Influence Maximization problem introduced by Alon et al. (2012) from a robust optimization perspective, where an adversary may choose the least favorable parameters within a confidence set. The resulting problem is a nonconvex-concave saddle point problem (or game). We show that this nonconvex problem can be solved exactly by leveraging connections to continuous submodular functions, and by solving a constrained submodular minimization problem. Although constrained submodular minimization is hard in general, here, we establish conditions under which such a problem can be solved to arbitrary precision $\epsilon$.

* ICML 2017 

  Click for Model/Code and Paper
Graph Cuts with Interacting Edge Costs - Examples, Approximations, and Algorithms

Mar 26, 2016
Stefanie Jegelka, Jeff Bilmes

We study an extension of the classical graph cut problem, wherein we replace the modular (sum of edge weights) cost function by a submodular set function defined over graph edges. Special cases of this problem have appeared in different applications in signal processing, machine learning, and computer vision. In this paper, we connect these applications via the generic formulation of "cooperative graph cuts", for which we study complexity, algorithms, and connections to polymatroidal network flows. Finally, we compare the proposed algorithms empirically.

* 46 pages 

  Click for Model/Code and Paper
Minimizing approximately submodular functions

May 29, 2019
Marwa El Halabi, Stefanie Jegelka

The problem of minimizing a submodular function is well studied; several polynomial-time algorithms have been developed to solve it exactly or up to arbitrary accuracy. However, in many applications, the objective functions are not exactly submodular. In this paper, we show that a classical algorithm used for submodular minimization performs well even for a class of non-submodular functions, namely weakly DR-submodular functions. We provide the first approximation guarantee for non-submodular minimization. This broadly expands the range of applications of submodular minimization techniques.


  Click for Model/Code and Paper
Flexible Modeling of Diversity with Strongly Log-Concave Distributions

Jun 12, 2019
Joshua Robinson, Suvrit Sra, Stefanie Jegelka

Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance. We develop two fundamental tools needed to apply SLC distributions to learning and inference: sampling and mode finding. For sampling we develop an MCMC sampler and give theoretical mixing time bounds. For mode finding, we establish a weak log-submodularity property for SLC functions and derive optimization guarantees for a distorted greedy algorithm.


  Click for Model/Code and Paper
Optimization as Estimation with Gaussian Processes in Bandit Settings

Aug 12, 2018
Zi Wang, Bolei Zhou, Stefanie Jegelka

Recently, there has been rising interest in Bayesian optimization -- the optimization of an unknown function with assumptions usually expressed by a Gaussian Process (GP) prior. We study an optimization strategy that directly uses an estimate of the argmax of the function. This strategy offers both practical and theoretical advantages: no tradeoff parameter needs to be selected, and, moreover, we establish close connections to the popular GP-UCB and GP-PI strategies. Our approach can be understood as automatically and adaptively trading off exploration and exploitation in GP-UCB and GP-PI. We illustrate the effects of this adaptive tuning via bounds on the regret as well as an extensive empirical evaluation on robotics and vision tasks, demonstrating the robustness of this strategy for a range of performance criteria.

* Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain 

  Click for Model/Code and Paper
Distributionally Robust Submodular Maximization

Jun 06, 2018
Matthew Staib, Bryan Wilder, Stefanie Jegelka

Submodular functions have applications throughout machine learning, but in many settings, we do not have direct access to the underlying function $f$. We focus on stochastic functions that are given as an expectation of functions over a distribution $P$. In practice, we often have only a limited set of samples $f_i$ from $P$. The standard approach indirectly optimizes $f$ by maximizing the sum of $f_i$. However, this ignores generalization to the true (unknown) distribution. In this paper, we achieve better performance on the actual underlying function $f$ by directly optimizing a combination of bias and variance. Algorithmically, we accomplish this by showing how to carry out distributionally robust optimization (DRO) for submodular functions, providing efficient algorithms backed by theoretical guarantees which leverage several novel contributions to the general theory of DRO. We also show compelling empirical evidence that DRO improves generalization to the unknown stochastic submodular function.


  Click for Model/Code and Paper
Robust GANs against Dishonest Adversaries

Feb 27, 2018
Zhi Xu, Chengtao Li, Stefanie Jegelka

Robustness of deep learning models is a property that has recently gained increasing attention. We formally define a notion of robustness for generative adversarial models, and show that, perhaps surprisingly, the GAN in its original form is not robust. Indeed, the discriminator in GANs may be viewed as merely offering "teaching feedback". Our notion of robustness relies on a dishonest discriminator, or noisy, adversarial interference with its feedback. We explore, theoretically and empirically, the effect of model and training properties on this robustness. In particular, we show theoretical conditions for robustness that are supported by empirical evidence. We also test the effect of regularization. Our results suggest variations of GANs that are indeed more robust to noisy attacks, and have overall more stable training behavior.


  Click for Model/Code and Paper
Polynomial Time Algorithms for Dual Volume Sampling

Nov 16, 2017
Chengtao Li, Stefanie Jegelka, Suvrit Sra

We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wider adoption has been hampered by the lack of polynomial time sampling algorithms. We remove this hindrance by developing an exact (randomized) polynomial time sampling algorithm as well as its derandomization. Thereafter, we study dual volume sampling via the theory of real stable polynomials and prove that its distribution satisfies the "Strong Rayleigh" property. This result has numerous consequences, including a provably fast-mixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. This sampler is closely related to classical algorithms for popular experimental design methods that are to date lacking theoretical analysis but are known to empirically work well.


  Click for Model/Code and Paper
Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling

Jan 08, 2017
Chengtao Li, Stefanie Jegelka, Suvrit Sra

We study probability measures induced by set functions with constraints. Such measures arise in a variety of real-world settings, where prior knowledge, resource limitations, or other pragmatic considerations impose constraints. We consider the task of rapidly sampling from such constrained measures, and develop fast Markov chain samplers for them. Our first main result is for MCMC sampling from Strongly Rayleigh (SR) measures, for which we present sharp polynomial bounds on the mixing time. As a corollary, this result yields a fast mixing sampler for Determinantal Point Processes (DPPs), yielding (to our knowledge) the first provably fast MCMC sampler for DPPs since their inception over four decades ago. Beyond SR measures, we develop MCMC samplers for probabilistic models with hard constraints and identify sufficient conditions under which their chains mix rapidly. We illustrate our claims by empirically verifying the dependence of mixing times on the key factors governing our theoretical bounds.

* The present version subsumes arXiv:1607.03559 

  Click for Model/Code and Paper
Fast Sampling for Strongly Rayleigh Measures with Application to Determinantal Point Processes

Jul 13, 2016
Chengtao Li, Stefanie Jegelka, Suvrit Sra

In this note we consider sampling from (non-homogeneous) strongly Rayleigh probability measures. As an important corollary, we obtain a fast mixing Markov Chain sampler for Determinantal Point Processes.


  Click for Model/Code and Paper
Gauss quadrature for matrix inverse forms with applications

May 28, 2016
Chengtao Li, Suvrit Sra, Stefanie Jegelka

We present a framework for accelerating a spectrum of machine learning algorithms that require computation of bilinear inverse forms $u^\top A^{-1}u$, where $A$ is a positive definite matrix and $u$ a given vector. Our framework is built on Gauss-type quadrature and easily scales to large, sparse matrices. Further, it allows retrospective computation of lower and upper bounds on $u^\top A^{-1}u$, which in turn accelerates several algorithms. We prove that these bounds tighten iteratively and converge at a linear (geometric) rate. To our knowledge, ours is the first work to demonstrate these key properties of Gauss-type quadrature, which is a classical and deeply studied topic. We illustrate empirical consequences of our results by using quadrature to accelerate machine learning tasks involving determinantal point processes and submodular optimization, and observe tremendous speedups in several instances.


  Click for Model/Code and Paper
Fast DPP Sampling for Nyström with Application to Kernel Methods

May 28, 2016
Chengtao Li, Stefanie Jegelka, Suvrit Sra

The Nystr\"om method has long been popular for scaling up kernel methods. Its theoretical guarantees and empirical performance rely critically on the quality of the landmarks selected. We study landmark selection for Nystr\"om using Determinantal Point Processes (DPPs), discrete probability models that allow tractable generation of diverse samples. We prove that landmarks selected via DPPs guarantee bounds on approximation errors; subsequently, we analyze implications for kernel ridge regression. Contrary to prior reservations due to cubic complexity of DPPsampling, we show that (under certain conditions) Markov chain DPP sampling requires only linear time in the size of the data. We present several empirical results that support our theoretical analysis, and demonstrate the superior performance of DPP-based landmark selection compared with existing approaches.


  Click for Model/Code and Paper
Efficient Sampling for k-Determinantal Point Processes

May 28, 2016
Chengtao Li, Stefanie Jegelka, Suvrit Sra

Determinantal Point Processes (DPPs) are elegant probabilistic models of repulsion and diversity over discrete sets of items. But their applicability to large sets is hindered by expensive cubic-complexity matrix operations for basic tasks such as sampling. In light of this, we propose a new method for approximate sampling from discrete $k$-DPPs. Our method takes advantage of the diversity property of subsets sampled from a DPP, and proceeds in two stages: first it constructs coresets for the ground set of items; thereafter, it efficiently samples subsets based on the constructed coresets. As opposed to previous approaches, our algorithm aims to minimize the total variation distance to the original distribution. Experiments on both synthetic and real datasets indicate that our sampling algorithm works efficiently on large data sets, and yields more accurate samples than previous approaches.


  Click for Model/Code and Paper
Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets

Nov 06, 2014
Adarsh Prasad, Stefanie Jegelka, Dhruv Batra

To cope with the high level of ambiguity faced in domains such as Computer Vision or Natural Language processing, robust prediction methods often search for a diverse set of high-quality candidate solutions or proposals. In structured prediction problems, this becomes a daunting task, as the solution space (image labelings, sentence parses, etc.) is exponentially large. We study greedy algorithms for finding a diverse subset of solutions in structured-output spaces by drawing new connections between submodular functions over combinatorial item sets and High-Order Potentials (HOPs) studied for graphical models. Specifically, we show via examples that when marginal gains of submodular diversity functions allow structured representations, this enables efficient (sub-linear time) approximate maximization by reducing the greedy augmentation step to inference in a factor graph with appropriately constructed HOPs. We discuss benefits, tradeoffs, and show that our constructions lead to significantly better proposals.


  Click for Model/Code and Paper
On the Convergence Rate of Decomposable Submodular Function Minimization

Nov 05, 2014
Robert Nishihara, Stefanie Jegelka, Michael I. Jordan

Submodular functions describe a variety of discrete problems in machine learning, signal processing, and computer vision. However, minimizing submodular functions poses a number of algorithmic challenges. Recent work introduced an easy-to-use, parallelizable algorithm for minimizing submodular functions that decompose as the sum of "simple" submodular functions. Empirically, this algorithm performs extremely well, but no theoretical analysis was given. In this paper, we show that the algorithm converges linearly, and we provide upper and lower bounds on the rate of convergence. Our proof relies on the geometry of submodular polyhedra and draws on results from spectral graph theory.

* Neural Information Processing Systems 27, 2014 
* 17 pages, 3 figures 

  Click for Model/Code and Paper
Reflection methods for user-friendly submodular optimization

Nov 18, 2013
Stefanie Jegelka, Francis Bach, Suvrit Sra

Recently, it has become evident that submodularity naturally captures widely occurring concepts in machine learning, signal processing and computer vision. Consequently, there is need for efficient optimization procedures for submodular functions, especially for minimization problems. While general submodular minimization is challenging, we propose a new method that exploits existing decomposability of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize. A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reflections, and its solution can be easily thresholded to obtain an optimal discrete solution. This method solves both the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruction. In our experiments, we illustrate the benefits of our method on two image segmentation tasks.

* Neural Information Processing Systems (NIPS), \'Etats-Unis (2013) 

  Click for Model/Code and Paper
Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

Nov 08, 2013
Rishabh Iyer, Stefanie Jegelka, Jeff Bilmes

We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [53]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the 'curvature' of the submodular function, and provide lower and upper bounds that refine and improve previous results [3, 16, 18, 52]. Our proof techniques are fairly generic. We either use a black-box transformation of the function (for approximation and learning), or a transformation of algorithms to use an appropriate surrogate function (for minimization). Curiously, curvature has been known to influence approximations for submodular maximization [7, 55], but its effect on minimization, approximation and learning has hitherto been open. We complete this picture, and also support our theoretical claims by empirical results.

* 21 pages. A shorter version appeared in Advances of NIPS-2013 

  Click for Model/Code and Paper