Models, code, and papers for "Greg Ver Steeg":

Unsupervised Learning via Total Correlation Explanation

Jun 27, 2017
Greg Ver Steeg

Learning by children and animals occurs effortlessly and largely without obvious supervision. Successes in automating supervised learning have not translated to the more ambiguous realm of unsupervised learning where goals and labels are not provided. Barlow (1961) suggested that the signal that brains leverage for unsupervised learning is dependence, or redundancy, in the sensory environment. Dependence can be characterized using the information-theoretic multivariate mutual information measure called total correlation. The principle of Total Cor-relation Ex-planation (CorEx) is to learn representations of data that "explain" as much dependence in the data as possible. We review some manifestations of this principle along with successes in unsupervised learning problems across diverse domains including human behavior, biology, and language.

* Invited contribution for IJCAI 2017 Early Career Spotlight. 5 pages, 1 figure 

  Click for Model/Code and Paper
Low Complexity Gaussian Latent Factor Models and a Blessing of Dimensionality

Jan 19, 2018
Greg Ver Steeg, Aram Galstyan

Learning the structure of graphical models from data usually incurs a heavy curse of dimensionality that renders this problem intractable in many real-world situations. The rare cases where the curse becomes a blessing provide insight into the limits of the efficiently computable and augment the scarce options for treating very under-sampled, high-dimensional data. We study a special class of Gaussian latent factor models where each (non-iid) observed variable depends on at most one of a set of latent variables. We derive information-theoretic lower bounds on the sample complexity for structure recovery that suggest complexity actually decreases as the dimensionality increases. Contrary to this prediction, we observe that existing structure recovery methods deteriorate with increasing dimension. Therefore, we design a new approach to learning Gaussian latent factor models that benefits from dimensionality. Our approach relies on an unconstrained information-theoretic objective whose global optima correspond to structured latent factor generative models. In addition to improved structure recovery, we also show that we are able to outperform state-of-the-art approaches for covariance estimation on both synthetic and real data in the very under-sampled, high-dimensional regime.

* 15 pages, 7 figures. Fixed some typos in equations, revised presentation 

  Click for Model/Code and Paper
The Information Sieve

Jun 09, 2016
Greg Ver Steeg, Aram Galstyan

We introduce a new framework for unsupervised learning of representations based on a novel hierarchical decomposition of information. Intuitively, data is passed through a series of progressively fine-grained sieves. Each layer of the sieve recovers a single latent factor that is maximally informative about multivariate dependence in the data. The data is transformed after each pass so that the remaining unexplained information trickles down to the next layer. Ultimately, we are left with a set of latent factors explaining all the dependence in the original data and remainder information consisting of independent noise. We present a practical implementation of this framework for discrete variables and apply it to a variety of fundamental tasks in unsupervised learning including independent component analysis, lossy and lossless compression, and predicting missing values in data.

* Appearing in Proceedings of the International Conference on Machine Learning (ICML), 2016. Updated reference to continuous version: http://arxiv.org/abs/1606.02307 

  Click for Model/Code and Paper
Maximally Informative Hierarchical Representations of High-Dimensional Data

Jan 31, 2015
Greg Ver Steeg, Aram Galstyan

We consider a set of probabilistic functions of some input variables as a representation of the inputs. We present bounds on how informative a representation is about input data. We extend these bounds to hierarchical representations so that we can quantify the contribution of each layer towards capturing the information in the original data. The special form of these bounds leads to a simple, bottom-up optimization procedure to construct hierarchical representations that are also maximally informative about the data. This optimization has linear computational complexity and constant sample complexity in the number of variables. These results establish a new approach to unsupervised learning of deep representations that is both principled and practical. We demonstrate the usefulness of the approach on both synthetic and real-world data.

* 13 pages, 8 figures. Appearing in Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015 

  Click for Model/Code and Paper
Discovering Structure in High-Dimensional Data Through Correlation Explanation

Oct 31, 2014
Greg Ver Steeg, Aram Galstyan

We introduce a method to learn a hierarchy of successively more abstract representations of complex data based on optimizing an information-theoretic objective. Intuitively, the optimization searches for a set of latent factors that best explain the correlations in the data as measured by multivariate mutual information. The method is unsupervised, requires no model assumptions, and scales linearly with the number of variables which makes it an attractive approach for very high dimensional systems. We demonstrate that Correlation Explanation (CorEx) automatically discovers meaningful structure for data from diverse sources including personality tests, DNA, and human language.

* 15 pages, 6 figures. Includes supplementary material and link to code. Published in the proceedings of the 28th Annual Conference on Neural Information Processing Systems, NIPS 2014 

  Click for Model/Code and Paper
A Sequence of Relaxations Constraining Hidden Variable Models

Jul 20, 2011
Greg Ver Steeg, Aram Galstyan

Many widely studied graphical models with latent variables lead to nontrivial constraints on the distribution of the observed variables. Inspired by the Bell inequalities in quantum mechanics, we refer to any linear inequality whose violation rules out some latent variable model as a "hidden variable test" for that model. Our main contribution is to introduce a sequence of relaxations which provides progressively tighter hidden variable tests. We demonstrate applicability to mixtures of sequences of i.i.d. variables, Bell inequalities, and homophily models in social networks. For the last, we demonstrate that our method provides a test that is able to rule out latent homophily as the sole explanation for correlations on a real social network that are known to be due to influence.

* UAI 2011 Best Paper Runner-Up; Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011) 

  Click for Model/Code and Paper
A Forest Mixture Bound for Block-Free Parallel Inference

May 17, 2018
Neal Lawton, Aram Galstyan, Greg Ver Steeg

Coordinate ascent variational inference is an important algorithm for inference in probabilistic models, but it is slow because it updates only a single variable at a time. Block coordinate methods perform inference faster by updating blocks of variables in parallel. However, the speed and stability of these algorithms depends on how the variables are partitioned into blocks. In this paper, we give a stable parallel algorithm for inference in deep exponential families that doesn't require the variables to be partitioned into blocks. We achieve this by lower bounding the ELBO by a new objective we call the forest mixture bound (FM bound) that separates the inference problem for variables within a hidden layer. We apply this to the simple case when all random variables are Gaussian and show empirically that the algorithm converges faster for models that are inherently more forest-like.


  Click for Model/Code and Paper
Stochastic Learning of Nonstationary Kernels for Natural Language Modeling

Feb 01, 2018
Sahil Garg, Greg Ver Steeg, Aram Galstyan

Natural language processing often involves computations with semantic or syntactic graphs to facilitate sophisticated reasoning based on structural relationships. While convolution kernels provide a powerful tool for comparing graph structure based on node (word) level relationships, they are difficult to customize and can be computationally expensive. We propose a generalization of convolution kernels, with a nonstationary model, for better expressibility of natural languages in supervised settings. For a scalable learning of the parameters introduced with our model, we propose a novel algorithm that leverages stochastic sampling on k-nearest neighbor graphs, along with approximations based on locality-sensitive hashing. We demonstrate the advantages of our approach on a challenging real-world (structured inference) problem of automatically extracting biological models from the text of scientific papers.


  Click for Model/Code and Paper
Variational Information Maximization for Feature Selection

Jun 09, 2016
Shuyang Gao, Greg Ver Steeg, Aram Galstyan

Feature selection is one of the most fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. Practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. We demonstrate that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which we prove to be optimal under tree graphical models with proper choice of variational distributions. Our experiments demonstrate that the proposed method strongly outperforms existing information-theoretic feature selection approaches.

* 15 pages, 9 figures 

  Click for Model/Code and Paper
Understanding confounding effects in linguistic coordination: an information-theoretic approach

Aug 27, 2015
Shuyang Gao, Greg Ver Steeg, Aram Galstyan

We suggest an information-theoretic approach for measuring stylistic coordination in dialogues. The proposed measure has a simple predictive interpretation and can account for various confounding factors through proper conditioning. We revisit some of the previous studies that reported strong signatures of stylistic accommodation, and find that a significant part of the observed coordination can be attributed to a simple confounding effect - length coordination. Specifically, longer utterances tend to be followed by longer responses, which gives rise to spurious correlations in the other stylistic features. We propose a test to distinguish correlations in length due to contextual factors (topic of conversation, user verbosity, etc.) and turn-by-turn coordination. We also suggest a test to identify whether stylistic coordination persists even after accounting for length coordination and contextual factors.

* PLoS ONE 10(6): e0130167, 2015 

  Click for Model/Code and Paper
Efficient Estimation of Mutual Information for Strongly Dependent Variables

Mar 05, 2015
Shuyang Gao, Greg Ver Steeg, Aram Galstyan

We demonstrate that a popular class of nonparametric mutual information (MI) estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accurate estimation of MI between two strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the underlying joint distribution. We introduce a new estimator that is robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude. We demonstrate the superior performance of the proposed estimator on both synthetic and real-world data.

* 13 pages, to appear in International Conference on Artificial Intelligence and Statistics (AISTATS) 2015 

  Click for Model/Code and Paper
Measures of Tractography Convergence

Jun 12, 2018
Daniel Moyer, Paul M. Thompson, Greg Ver Steeg

In the present work, we use information theory to understand the empirical convergence rate of tractography, a widely-used approach to reconstruct anatomical fiber pathways in the living brain. Based on diffusion MRI data, tractography is the starting point for many methods to study brain connectivity. Of the available methods to perform tractography, most reconstruct a finite set of streamlines, or 3D curves, representing probable connections between anatomical regions, yet relatively little is known about how the sampling of this set of streamlines affects downstream results, and how exhaustive the sampling should be. Here we provide a method to measure the information theoretic surprise (self-cross entropy) for tract sampling schema. We then empirically assess four streamline methods. We demonstrate that the relative information gain is very low after a moderate number of streamlines have been generated for each tested method. The results give rise to several guidelines for optimal sampling in brain connectivity analyses.

* 11 pages 

  Click for Model/Code and Paper
Statistical Mechanics of Semi-Supervised Clustering in Sparse Graphs

Oct 31, 2011
Greg Ver Steeg, Aram Galstyan, Armen E. Allahverdyan

We theoretically study semi-supervised clustering in sparse graphs in the presence of pairwise constraints on the cluster assignments of nodes. We focus on bi-cluster graphs, and study the impact of semi-supervision for varying constraint density and overlap between the clusters. Recent results for unsupervised clustering in sparse graphs indicate that there is a critical ratio of within-cluster and between-cluster connectivities below which clusters cannot be recovered with better than random accuracy. The goal of this paper is to examine the impact of pairwise constraints on the clustering accuracy. Our results suggests that the addition of constraints does not provide automatic improvement over the unsupervised case. When the density of the constraints is sufficiently small, their only impact is to shift the detection threshold while preserving the criticality. Conversely, if the density of (hard) constraints is above the percolation threshold, the criticality is suppressed and the detection threshold disappears.

* J. Stat. Mech. (2011) P08009 
* 8 pages, 4 figures 

  Click for Model/Code and Paper
Co-evolution of Selection and Influence in Social Networks

Jun 14, 2011
Yoon-Sik Cho, Greg Ver Steeg, Aram Galstyan

Many networks are complex dynamical systems, where both attributes of nodes and topology of the network (link structure) can change with time. We propose a model of co-evolving networks where both node at- tributes and network structure evolve under mutual influence. Specifically, we consider a mixed membership stochastic blockmodel, where the probability of observing a link between two nodes depends on their current membership vectors, while those membership vectors themselves evolve in the presence of a link between the nodes. Thus, the network is shaped by the interaction of stochastic processes describing the nodes, while the processes themselves are influenced by the changing network structure. We derive an efficient variational inference procedure for our model, and validate the model on both synthetic and real-world data.

* In Proc. of the Twenty-Fifth Conference on Artificial Intelligence (AAAI-11) 

  Click for Model/Code and Paper
Improving Generalization by Controlling Label-Noise Information in Neural Network Weights

Feb 19, 2020
Hrayr Harutyunyan, Kyle Reing, Greg Ver Steeg, Aram Galstyan

In the presence of noisy or incorrect labels, neural networks have the undesirable tendency to memorize information about the noise. Standard regularization techniques such as dropout, weight decay or data augmentation sometimes help, but do not prevent this behavior. If one considers neural network weights as random variables that depend on the data and stochasticity of training, the amount of memorized information can be quantified with the Shannon mutual information between weights and the vector of all training labels given inputs, $I(w : \mathbf{y} \mid \mathbf{x})$. We show that for any training algorithm, low values of this term correspond to reduction in memorization of label-noise and better generalization bounds. To obtain these low values, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels. We illustrate the effectiveness of our approach on versions of MNIST, CIFAR-10, and CIFAR-100 corrupted with various noise models, and on a large-scale dataset Clothing1M that has noisy labels.


  Click for Model/Code and Paper
Nearly-Unsupervised Hashcode Representations for Relation Extraction

Sep 09, 2019
Sahil Garg, Aram Galstyan, Greg Ver Steeg, Guillermo Cecchi

Recently, kernelized locality sensitive hashcodes have been successfully employed as representations of natural language text, especially showing high relevance to biomedical relation extraction tasks. In this paper, we propose to optimize the hashcode representations in a nearly unsupervised manner, in which we only use data points, but not their class labels, for learning. The optimized hashcode representations are then fed to a supervised classifier following the prior work. This nearly unsupervised approach allows fine-grained optimization of each hash function, which is particularly suitable for building hashcode representations generalizing from a training set to a test set. We empirically evaluate the proposed approach for biomedical relation extraction tasks, obtaining significant accuracy improvements w.r.t. state-of-the-art supervised and semi-supervised approaches.

* Proceedings of EMNLP-19 

  Click for Model/Code and Paper
Exact Rate-Distortion in Autoencoders via Echo Noise

Apr 15, 2019
Rob Brekelmans, Daniel Moyer, Aram Galstyan, Greg Ver Steeg

Compression is at the heart of effective representation learning. However, lossy compression is typically achieved through simple parametric models like Gaussian noise to preserve analytic tractability, and the limitations this imposes on learning are largely unexplored. Further, the Gaussian prior assumptions in models such as variational autoencoders (VAEs) provide only an upper bound on the compression rate in general. We introduce a new noise channel, Echo noise, that admits a simple, exact expression for mutual information for arbitrary input distributions. The noise is constructed in a data-driven fashion that does not require restrictive distributional assumptions. With its complex encoding mechanism and exact rate regularization, Echo leads to improved bounds on log-likelihood and dominates $\beta$-VAEs across the achievable range of rate-distortion trade-offs. Further, we show that Echo noise can outperform state-of-the-art flow methods without the need to train complex distributional transformations


  Click for Model/Code and Paper
Auto-Encoding Total Correlation Explanation

Feb 16, 2018
Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, Aram Galstyan

Advances in unsupervised learning enable reconstruction and generation of samples from complex distributions, but this success is marred by the inscrutability of the representations learned. We propose an information-theoretic approach to characterizing disentanglement and dependence in representation learning using multivariate mutual information, also called total correlation. The principle of total Cor-relation Ex-planation (CorEx) has motivated successful unsupervised learning applications across a variety of domains, but under some restrictive assumptions. Here we relax those restrictions by introducing a flexible variational lower bound to CorEx. Surprisingly, we find that this lower bound is equivalent to the one in variational autoencoders (VAE) under certain conditions. This information-theoretic view of VAE deepens our understanding of hierarchical VAE and motivates a new algorithm, AnchorVAE, that makes latent codes more interpretable through information maximization and enables generation of richer and more realistic samples.


  Click for Model/Code and Paper
Disentangled Representations via Synergy Minimization

Oct 10, 2017
Greg Ver Steeg, Rob Brekelmans, Hrayr Harutyunyan, Aram Galstyan

Scientists often seek simplified representations of complex systems to facilitate prediction and understanding. If the factors comprising a representation allow us to make accurate predictions about our system, but obscuring any subset of the factors destroys our ability to make predictions, we say that the representation exhibits informational synergy. We argue that synergy is an undesirable feature in learned representations and that explicitly minimizing synergy can help disentangle the true factors of variation underlying data. We explore different ways of quantifying synergy, deriving new closed-form expressions in some cases, and then show how to modify learning to produce representations that are minimally synergistic. We introduce a benchmark task to disentangle separate characters from images of words. We demonstrate that Minimally Synergistic (MinSyn) representations correctly disentangle characters while methods relying on statistical independence fail.

* 8 pages, 4 figures, 55th Annual Allerton Conference on Communication, Control, and Computing, 2017 

  Click for Model/Code and Paper
Unifying Local and Global Change Detection in Dynamic Networks

Oct 09, 2017
Wenzhe Li, Dong Guo, Greg Ver Steeg, Aram Galstyan

Many real-world networks are complex dynamical systems, where both local (e.g., changing node attributes) and global (e.g., changing network topology) processes unfold over time. Local dynamics may provoke global changes in the network, and the ability to detect such effects could have profound implications for a number of real-world problems. Most existing techniques focus individually on either local or global aspects of the problem or treat the two in isolation from each other. In this paper we propose a novel network model that simultaneously accounts for both local and global dynamics. To the best of our knowledge, this is the first attempt at modeling and detecting local and global change points on dynamic networks via a unified generative framework. Our model is built upon the popular mixed membership stochastic blockmodels (MMSB) with sparse co-evolving patterns. We derive an efficient stochastic gradient Langevin dynamics (SGLD) sampler for our proposed model, which allows it to scale to potentially very large networks. Finally, we validate our model on both synthetic and real-world data and demonstrate its superiority over several baselines.


  Click for Model/Code and Paper