Models, code, and papers for "Robert C":

Recent neural network architectures such as the basic recurrent neural network (RNN) and Gated Recurrent Unit (GRU) have gained prominence as end-to-end learning architectures for natural language processing tasks. But what is the computational power of such systems? We prove that finite precision RNNs with one hidden layer and ReLU activation and finite precision GRUs are exactly as computationally powerful as deterministic finite automata. Allowing arbitrary precision, we prove that RNNs with one hidden layer and ReLU activation are at least as computationally powerful as pushdown automata. If we also allow infinite precision, infinite edge weights, and nonlinear output activation functions, we prove that GRUs are at least as computationally powerful as pushdown automata. All results are shown constructively.

While long short-term memory (LSTM) neural net architectures are designed to capture sequence information, human language is generally composed of hierarchical structures. This raises the question as to whether LSTMs can learn hierarchical structures. We explore this question with a well-formed bracket prediction task using two types of brackets modeled by an LSTM. Demonstrating that such a system is learnable by an LSTM is the first step in demonstrating that the entire class of CFLs is also learnable. We observe that the model requires exponential memory in terms of the number of characters and embedded depth, where a sub-linear memory should suffice. Still, the model does more than memorize the training input. It learns how to distinguish between relevant and irrelevant information. On the other hand, we also observe that the model does not generalize well. We conclude that LSTMs do not learn the relevant underlying context-free rules, suggesting the good overall performance is attained rather by an efficient way of evaluating nuisance variables. LSTMs are a way to quickly reach good results for many natural language tasks, but to understand and generate natural language one has to investigate other concepts that can make more direct use of natural language's structural nature.

We consider the setting of prediction with expert advice; a learner makes predictions by aggregating those of a group of experts. Under this setting, and for the right choice of loss function and "mixing" algorithm, it is possible for the learner to achieve a constant regret regardless of the number of prediction rounds. For example, a constant regret can be achieved for \emph{mixable} losses using the \emph{aggregating algorithm}. The \emph{Generalized Aggregating Algorithm} (GAA) is a name for a family of algorithms parameterized by convex functions on simplices (entropies), which reduce to the aggregating algorithm when using the \emph{Shannon entropy} $\operatorname{S}$. For a given entropy $\Phi$, losses for which a constant regret is possible using the \textsc{GAA} are called $\Phi$-mixable. Which losses are $\Phi$-mixable was previously left as an open question. We fully characterize $\Phi$-mixability and answer other open questions posed by \cite{Reid2015}. We show that the Shannon entropy $\operatorname{S}$ is fundamental in nature when it comes to mixability; any $\Phi$-mixable loss is necessarily $\operatorname{S}$-mixable, and the lowest worst-case regret of the \textsc{GAA} is achieved using the Shannon entropy. Finally, by leveraging the connection between the \emph{mirror descent algorithm} and the update step of the GAA, we suggest a new \emph{adaptive} generalized aggregating algorithm and analyze its performance in terms of the regret bound.

We revisit the initialization of deep residual networks (ResNets) by introducing a novel analytical tool in free probability to the community of deep learning. This tool deals with non-Hermitian random matrices, rather than their conventional Hermitian counterparts in the literature. As a consequence, this new tool enables us to evaluate the singular value spectrum of the input-output Jacobian of a fully- connected deep ResNet for both linear and nonlinear cases. With the powerful tool of free probability, we conduct an asymptotic analysis of the spectrum on the single-layer case, and then extend this analysis to the multi-layer case of an arbitrary number of layers. In particular, we propose to rescale the classical random initialization by the number of residual units, so that the spectrum has the order of $O(1)$, when compared with the large width and depth of the network. We empirically demonstrate that the proposed initialization scheme learns at a speed of orders of magnitudes faster than the classical ones, and thus attests a strong practical relevance of this investigation.

The cost-sensitive classification problem plays a crucial role in mission-critical machine learning applications, and differs with traditional classification by taking the misclassification costs into consideration. Although being studied extensively in the literature, the fundamental limits of this problem are still not well understood. We investigate the hardness of this problem by extending the standard minimax lower bound of balanced binary classification problem (due to \cite{massart2006risk}), and emphasize the impact of cost terms on the hardness.

Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution $\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\mathcal{F}$ with small loss $\ell$. In the parametric setting, depending upon $(\ell, \mathcal{F},\mathsf{P})$ ERM can have slow $(1/\sqrt{n})$ or fast $(1/n)$ rates of convergence of the excess risk as a function of the sample size $n$. There exist several results that give sufficient conditions for fast rates in terms of joint properties of $\ell$, $\mathcal{F}$, and $\mathsf{P}$, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss $\ell$ (there being no role there for $\mathcal{F}$ or $\mathsf{P}$). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of $(\ell,\mathcal{F}, \mathsf{P})$, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.

Kernel method is a very powerful tool in machine learning. The trick of kernel has been effectively and extensively applied in many areas of machine learning, such as support vector machine (SVM) and kernel principal component analysis (kernel PCA). Kernel trick is to define a kernel function which relies on the inner-product of data in the feature space without knowing these feature space data. In this paper, the kernel trick will be employed to extend the algorithm of spectrum sensing with leading eigenvector under the framework of PCA to a higher dimensional feature space. Namely, the leading eigenvector of the sample covariance matrix in the feature space is used for spectrum sensing without knowing the leading eigenvector explicitly. Spectrum sensing with leading eigenvector under the framework of kernel PCA is proposed with the inner-product as a measure of similarity. A modified kernel GLRT algorithm based on matched subspace model will be the first time applied to spectrum sensing. The experimental results on simulated sinusoidal signal show that spectrum sensing with kernel PCA is about 4 dB better than PCA, besides, kernel GLRT is also better than GLRT. The proposed algorithms are also tested on the measured DTV signal. The simulation results show that kernel methods are 4 dB better than the corresponding linear methods. The leading eigenvector of the sample covariance matrix learned by kernel PCA is more stable than that learned by PCA for different segments of DTV signal.

In Phys. Rev. Letters (73:2, 5 Dec. 94), Mantegna et al. conclude on the basis of Zipf rank frequency data that noncoding DNA sequence regions are more like natural languages than coding regions. We argue on the contrary that an empirical fit to Zipf's ``law'' cannot be used as a criterion for similarity to natural languages. Although DNA is a presumably an ``organized system of signs'' in Mandelbrot's (1961) sense, an observation of statistical features of the sort presented in the Mantegna et al. paper does not shed light on the similarity between DNA's ``grammar'' and natural language grammars, just as the observation of exact Zipf-like behavior cannot distinguish between the underlying processes of tossing an $M$ sided die or a finite-state branching process.

Ensuring that classifiers are non-discriminatory or fair with respect to a sensitive feature (e.g., race or gender) is a topical problem. Progress in this task requires fixing a definition of fairness, and there have been several proposals in this regard over the past few years. Several of these, however, assume either binary sensitive features (thus precluding categorical or real-valued sensitive groups), or result in non-convex objectives (thus adversely affecting the optimisation landscape). In this paper, we propose a new definition of fairness that generalises some existing proposals, while allowing for generic sensitive features and resulting in a convex objective. The key idea is to enforce that the expected losses (or risks) across each subgroup induced by the sensitive feature are commensurate. We show how this relates to the rich literature on risk measures from mathematical finance. As a special case, this leads to a new convex fairness-aware objective based on minimising the conditional value at risk (CVaR).

We study the problem of learning classifiers with a fairness constraint, with three main contributions towards the goal of quantifying the problem's inherent tradeoffs. First, we relate two existing fairness measures to cost-sensitive risks. Second, we show that for cost-sensitive classification and fairness measures, the optimal classifier is an instance-dependent thresholding of the class-probability function. Third, we show how the tradeoff between accuracy and fairness is determined by the alignment between the class-probabilities for the target and sensitive features. Underpinning our analysis is a general framework that casts the problem of learning with a fairness requirement as one of minimising the difference of two statistical risks.

In supervised learning one wishes to identify a pattern present in a joint distribution $P$, of instances, label pairs, by providing a function $f$ from instances to labels that has low risk $\mathbb{E}_{P}\ell(y,f(x))$. To do so, the learner is given access to $n$ iid samples drawn from $P$. In many real world problems clean samples are not available. Rather, the learner is given access to samples from a corrupted distribution $\tilde{P}$ from which to learn, while the goal of predicting the clean pattern remains. There are many different types of corruption one can consider, and as of yet there is no general means to compare the relative ease of learning under these different corruption processes. In this paper we develop a general framework for tackling such problems as well as introducing upper and lower bounds on the risk for learning in the presence of corruption. Our ultimate goal is to be able to make informed economic decisions in regards to the acquisition of data sets. For a certain subclass of corruption processes (those that are \emph{reconstructible}) we achieve this goal in a particular sense. Our lower bounds are in terms of the coefficient of ergodicity, a simple to calculate property of stochastic matrices. Our upper bounds proceed via a generalization of the method of unbiased estimators appearing in recent work of Natarajan et al and implicit in the earlier work of Kearns.

Feature Learning aims to extract relevant information contained in data sets in an automated fashion. It is driving force behind the current deep learning trend, a set of methods that have had widespread empirical success. What is lacking is a theoretical understanding of different feature learning schemes. This work provides a theoretical framework for feature learning and then characterizes when features can be learnt in an unsupervised fashion. We also provide means to judge the quality of features via rate-distortion theory and its generalizations.

"Deep Learning" methods attempt to learn generic features in an unsupervised fashion from a large unlabelled data set. These generic features should perform as well as the best hand crafted features for any learning problem that makes use of this data. We provide a definition of generic features, characterize when it is possible to learn them and provide methods closely related to the autoencoder and deep belief network of deep learning. In order to do so we use the notion of deficiency and illustrate its value in studying certain general learning problems.

Concept Analysis provides a principled approach to effective management of wide area information systems, such as the Nebula File System and Interface. This not only offers evidence to support the assertion that a digital library is a bounded collection of incommensurate information sources in a logical space, but also sheds light on techniques for collaboration through coordinated access to the shared organization of knowledge.

We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and ``classification calibrated'' losses. We also consider the question of the ``best'' surrogate binary loss. We introduce a precise notion of ``best'' and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of ``surrogate tuning''. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise.

We unify f-divergences, Bregman divergences, surrogate loss bounds (regret bounds), proper scoring rules, matching losses, cost curves, ROC-curves and information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their primitives which all are related to cost-sensitive binary classification. As well as clarifying relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate loss bounds and generalised Pinsker inequalities relating f-divergences to variational divergence. The new viewpoint illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants. It also suggests new techniques for estimating f-divergences.

Machine learning methods may have the potential to significantly accelerate drug discovery. However, the increasing rate of new methodological approaches being published in the literature raises the fundamental question of how models should be benchmarked and validated. We reanalyze the data generated by a recently published large-scale comparison of machine learning models for bioactivity prediction and arrive at a somewhat different conclusion. We show that the performance of support vector machines is competitive with that of deep learning methods. Additionally, using a series of numerical experiments, we question the relevance of area under the receiver operating characteristic curve as a metric in virtual screening, and instead suggest that area under the precision-recall curve should be used in conjunction with the receiver operating characteristic. Our numerical experiments also highlight challenges in estimating the uncertainty in model performance via scaffold-split nested cross validation.

The study of a machine learning problem is in many ways is difficult to separate from the study of the loss function being used. One avenue of inquiry has been to look at these loss functions in terms of their properties as scoring rules via the proper-composite representation, in which predictions are mapped to probability distributions which are then scored via a scoring rule. However, recent research so far has primarily been concerned with analysing the (typically) finite-dimensional conditional risk problem on the output space, leaving aside the larger total risk minimisation. We generalise a number of these results to an infinite dimensional setting and in doing so we are able to exploit the familial resemblance of density and conditional density estimation to provide a simple characterisation of the canonical link.

Since the introduction of Generative Adversarial Networks (GANs) and Variational Autoencoders (VAE), the literature on generative modelling has witnessed an overwhelming resurgence. The impressive, yet elusive empirical performance of GANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN level performance and additional benefits of VAE, such as an encoder for feature reduction, which is not offered by GANs. Recently, the Wasserstein Autoencoder (WAE) was proposed, achieving performance similar to that of GANs, yet it is still unclear whether the two are fundamentally different or can be further improved into a unified model. In this work, we study the $f$-GAN and WAE models and make two main discoveries. First, we find that the $f$-GAN objective is equivalent to an autoencoder-like objective, which has close links, and is in some cases equivalent to the WAE objective - we refer to this as the $f$-WAE. This equivalence allows us to explicate the success of WAE. Second, the equivalence result allows us to, for the first time, prove generalization bounds for Autoencoder models (WAE and $f$-WAE), which is a pertinent problem when it comes to theoretical analyses of generative models. Furthermore, we show that the $f$-WAE objective is related to other statistical quantities such as the $f$-divergence and in particular, upper bounded by the Wasserstein distance, which then allows us to tap into existing efficient (regularized) OT solvers to minimize $f$-WAE. Our findings thus recommend the $f$-WAE as a tighter alternative to WAE, comment on generalization abilities and make a step towards unifying these models.