Models, code, and papers for "Robert West":

Privacy-Preserving Distributed Learning with Secret Gradient Descent

Jun 27, 2019
Valentin Hartmann, Robert West

In many important application domains of machine learning, data is a privacy-sensitive resource. In addition, due to the growing complexity of the models, single actors typically do not have sufficient data to train a model on their own. Motivated by these challenges, we propose Secret Gradient Descent (SecGD), a method for training machine learning models on data that is spread over different clients while preserving the privacy of the training data. We achieve this by letting each client add temporary noise to the information they send to the server during the training process. They also share this noise in separate messages with the server, which can then subtract it from the previously received values. By routing all data through an anonymization network such as Tor, we prevent the server from knowing which messages originate from the same client, which in turn allows us to show that breaking a client's privacy is computationally intractable as it would require solving a hard instance of the subset sum problem. This setup allows SecGD to work in the presence of only two honest clients and a malicious server, and without the need for peer-to-peer connections.

* 13 pages, 1 figure
Reverse-Engineering Satire, or "Paper on Computational Humor Accepted Despite Making Serious Advances"

Jan 10, 2019
Robert West, Eric Horvitz

* Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019
Learning High Order Feature Interactions with Fine Control Kernels

Feb 09, 2020

We provide a methodology for learning sparse statistical models that use as features all possible multiplicative interactions among an underlying atomic set of features. While the resulting optimization problems are exponentially sized, our methodology leads to algorithms that can often solve these problems exactly or provide approximate solutions based on combining highly correlated features. We also introduce an algorithmic paradigm, the Fine Control Kernel framework, so named because it is based on Fenchel Duality and is reminiscent of kernel methods. Its theory is tailored to large sparse learning problems, and it leads to efficient feature screening rules for interactions. These rules are inspired by the Apriori algorithm for market basket analysis -- which also falls under the purview of Fine Control Kernels, and can be applied to a plurality of learning problems including the Lasso and sparse matrix estimation. Experiments on biomedical datasets demonstrate the efficacy of our methodology in deriving algorithms that efficiently produce interactions models which achieve state-of-the-art accuracy and are interpretable.

Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping

Apr 07, 2018
Dario Pavllo, Tiziano Piccardi, Robert West

We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.

* Accepted at the 12th International Conference on Web and Social Media (ICWSM), 2018
Behavior Cloning in OpenAI using Case Based Reasoning

Feb 23, 2020

Learning from Observation (LfO), also known as Behavioral Cloning, is an approach for building software agents by recording the behavior of an expert (human or artificial) and using the recorded data to generate the required behavior. jLOAF is a platform that uses Case-Based Reasoning to achieve LfO. In this paper we interface jLOAF with the popular OpenAI Gym environment. Our experimental results show how our approach can be used to provide a baseline for comparison in this domain, as well as identify the strengths and weaknesses when dealing with environmental complexity.

Privacy-Preserving Classification with Secret Vector Machines

Jul 08, 2019
Valentin Hartmann, Konark Modi, Josep M. Pujol, Robert West

Today, large amounts of valuable data are distributed among millions of user-held devices, such as personal computers, phones, or Internet-of-things devices. Many companies collect such data with the goal of using it for training machine learning models allowing them to improve their services. However, user-held data is often sensitive, and collecting it is problematic in terms of privacy. We address this issue by proposing a novel way of training a supervised classifier in a distributed setting akin to the recently proposed federated learning paradigm (McMahan et al. 2017), but under the stricter privacy requirement that the server that trains the model is assumed to be untrusted and potentially malicious; we thus preserve user privacy by design, rather than by trust. In particular, our framework, called secret vector machine (SecVM), provides an algorithm for training linear support vector machines (SVM) in a setting in which data-holding clients communicate with an untrusted server by exchanging messages designed to not reveal any personally identifiable information. We evaluate our model in two ways. First, in an offline evaluation, we train SecVM to predict user gender from tweets, showing that we can preserve user privacy without sacrificing classification performance. Second, we implement SecVM's distributed framework for the Cliqz web browser and deploy it for predicting user gender in a large-scale online evaluation with thousands of clients, outperforming baselines by a large margin and thus showcasing that SecVM is practicable in production environments. Overall, this work demonstrates the feasibility of machine learning on data from thousands of users without collecting any personal data. We believe this is an innovative approach that will help reconcile machine learning with data privacy.

* 10 pages, 7 figures
Crosslingual Document Embedding as Reduced-Rank Ridge Regression

There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.

* In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19)
Exploiting Social Network Structure for Person-to-Person Sentiment Analysis

Sep 08, 2014
Robert West, Hristo S. Paskov, Jure Leskovec, Christopher Potts

Person-to-person evaluations are prevalent in all kinds of discourse and important for establishing reputations, building social bonds, and shaping public opinion. Such evaluations can be analyzed separately using signed social networks and textual sentiment analysis, but this misses the rich interactions between language and social context. To capture such interactions, we develop a model that predicts individual A's opinion of individual B by synthesizing information from the signed social network in which A and B are embedded with sentiment analysis of the evaluative texts relating A to B. We prove that this problem is NP-hard but can be relaxed to an efficiently solvable hinge-loss Markov random field, and we show that this implementation outperforms text-only and network-only versions in two very different datasets involving community-level decision-making: the Wikipedia Requests for Adminship corpus and the Convote U.S. Congressional speech corpus.

Expanding the Text Classification Toolbox with Cross-Lingual Embeddings

Most work in text classification and Natural Language Processing (NLP) focuses on English or a handful of other languages that have text corpora of hundreds of millions of words. This is creating a new version of the digital divide: the artificial intelligence (AI) divide. Transfer-based approaches, such as Cross-Lingual Text Classification (CLTC) - the task of categorizing texts written in different languages into a common taxonomy, are a promising solution to the emerging AI divide. Recent work on CLTC has focused on demonstrating the benefits of using bilingual word embeddings as features, relegating the CLTC problem to a mere benchmark based on a simple averaged perceptron. In this paper, we explore more extensively and systematically two flavors of the CLTC problem: news topic classification and textual churn intent detection (TCID) in social media. In particular, we test the hypothesis that embeddings with context are more effective, by multi-tasking the learning of multilingual word embeddings and text classification; we explore neural architectures for CLTC; and we move from bi- to multi-lingual word embeddings. For all architectures, types of word embeddings and datasets, we notice a consistent gain trend in favor of multilingual joint training, especially for low-resourced languages.

Robust Cross-lingual Embeddings from Parallel Sentences

Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation. However, these approaches assume word embedding spaces are isomorphic between different languages, which has been shown not to hold in practice (S{\o}gaard et al., 2018), and fundamentally limits their performance. This motivates investigating joint learning methods which can overcome this impediment, by simultaneously learning embeddings across languages via a cross-lingual term in the training objective. Given the abundance of parallel data available (Tiedemann, 2012), we propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations. Our approach significantly improves cross-lingual sentence retrieval performance over all other approaches, as well as convincingly outscores mapping methods while maintaining parity with jointly trained methods on word-translation. It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task, requiring far fewer computational resources for training and inference. As an additional advantage, our bilingual method also improves the quality of monolingual word vectors despite training on much smaller datasets. We make our code and models publicly available.

Churn Intent Detection in Multilingual Chatbot Conversations and Social Media

We propose a new method to detect when users express the intent to leave a service, also known as churn. While previous work focuses solely on social media, we show that this intent can be detected in chatbot conversations. As companies increasingly rely on chatbots they need an overview of potentially churny users. To this end, we crowdsource and publish a dataset of churn intent expressions in chatbot interactions in German and English. We show that classifiers trained on social media data can detect the same intent in the context of chatbots. We introduce a classification architecture that outperforms existing work on churn intent detection in social media. Moreover, we show that, using bilingual word embeddings, a system trained on combined English and German data outperforms monolingual approaches. As the only existing dataset is in English, we crowdsource and publish a novel dataset of German tweets. We thus underline the universal aspect of the problem, as examples of churn intent in English help us identify churn in German tweets and chatbot conversations.

* 10 pages
Deep Learning for Prostate Pathology

The current study detects different morphologies related to prostate pathology using deep learning models; these models were evaluated on 2,121 hematoxylin and eosin (H&E) stain histology images captured using bright field microscopy, which spanned a variety of image qualities, origins (whole slide, tissue micro array, whole mount, Internet), scanning machines, timestamps, H&E staining protocols, and institutions. For case usage, these models were applied for the annotation tasks in clinician-oriented pathology reports for prostatectomy specimens. The true positive rate (TPR) for slides with prostate cancer was 99.7% by a false positive rate of 0.785%. The F1-scores of Gleason patterns reported in pathology reports ranged from 0.795 to 1.0 at the case level. TPR was 93.6% for the cribriform morphology and 72.6% for the ductal morphology. The correlation between the ground truth and the prediction for the relative tumor volume was 0.987 n. Our models cover the major components of prostate pathology and successfully accomplish the annotation tasks.

Ultrametric Distance in Syntax

Jul 08, 2001
Mark D. Roberts

Phrase structure trees have a hierarchical structure. In many subjects, most notably in Taxonomy such tree structures have been studied using ultrametrics. Here syntactical hierarchical phrase trees are subject to a similar analysis, which is much siompler as the branching structure is more readily discernible and switched. The occurence of hierarchical structure elsewhere in linguistics is mentioned. The phrase tree can be represented by a matrix and the elements of the matrix can be represented by triangles. The height at which branching occurs is not prescribed in previous syntatic models, but it is by using the ultrametric matrix. The ambiguity of which branching height to choose is resolved by postulating that branching occurs at the lowest height available. An ultrametric produces a measure of the complexity of sentences: presumably the complexity of sentence increases as a language is aquired so that this can be tested. A All ultrametric triangles are equilateral or isocles, here it is shown that X structur implies that there are no equilateral triangles. Restricting attention to simple syntax a minium ultrametric distance between lexical categories is calculatex. This ultrametric distance is shown to be different than the matrix obtasined from feaures. It is shown that the definition of c-commabnd can be replaced by an equivalent ultrametric definition. The new definition invokes a minimum distance between nodes and this is more aesthetically satisfing than previouv varieties of definitions. From the new definition of c-command follows a new definition of government.

* Prague Bulletin of Mathematical Linguistics 103 (2015) 111-130
* 28 pages, 55508 bytes, 16 eps diagrams, 39 references, some small changes from the previous version, matrices reset, background to this work can be found at: http://cosmology.mth.uct.ac.za/~roberts/pastresearch/ultrametric.html
Constant Regret, Generalized Mixability, and Mirror Descent

Oct 31, 2018
Zakaria Mhammedi, Robert C. Williamson

We consider the setting of prediction with expert advice; a learner makes predictions by aggregating those of a group of experts. Under this setting, and for the right choice of loss function and "mixing" algorithm, it is possible for the learner to achieve a constant regret regardless of the number of prediction rounds. For example, a constant regret can be achieved for \emph{mixable} losses using the \emph{aggregating algorithm}. The \emph{Generalized Aggregating Algorithm} (GAA) is a name for a family of algorithms parameterized by convex functions on simplices (entropies), which reduce to the aggregating algorithm when using the \emph{Shannon entropy} $\operatorname{S}$. For a given entropy $\Phi$, losses for which a constant regret is possible using the \textsc{GAA} are called $\Phi$-mixable. Which losses are $\Phi$-mixable was previously left as an open question. We fully characterize $\Phi$-mixability and answer other open questions posed by \cite{Reid2015}. We show that the Shannon entropy $\operatorname{S}$ is fundamental in nature when it comes to mixability; any $\Phi$-mixable loss is necessarily $\operatorname{S}$-mixable, and the lowest worst-case regret of the \textsc{GAA} is achieved using the Shannon entropy. Finally, by leveraging the connection between the \emph{mirror descent algorithm} and the update step of the GAA, we suggest a new \emph{adaptive} generalized aggregating algorithm and analyze its performance in terms of the regret bound.

* 48 pages, accepted to NIPS 2018
Bayesian Neural Networks at Finite Temperature

Apr 08, 2019
Robert J. N. Baldock, Nicola Marzari

We recapitulate the Bayesian formulation of neural network based classifiers and show that, while sampling from the posterior does indeed lead to better generalisation than is obtained by standard optimisation of the cost function, even better performance can in general be achieved by sampling finite temperature ($T$) distributions derived from the posterior. Taking the example of two different deep (3 hidden layers) classifiers for MNIST data, we find quite different $T$ values to be appropriate in each case. In particular, for a typical neural network classifier a clear minimum of the test error is observed at $T>0$. This suggests an early stopping criterion for full batch simulated annealing: cool until the average validation error starts to increase, then revert to the parameters with the lowest validation error. As $T$ is increased classifiers transition from accurate classifiers to classifiers that have higher training error than assigning equal probability to each class. Efficient studies of these temperature-induced effects are enabled using a replica-exchange Hamiltonian Monte Carlo simulation technique. Finally, we show how thermodynamic integration can be used to perform model selection for deep neural networks. Similar to the Laplace approximation, this approach assumes that the posterior is dominated by a single mode. Crucially, however, no assumption is made about the shape of that mode and it is not required to precisely compute and invert the Hessian.

* 11 pages, 4 figures

Apr 29, 2014
Matthew L. Malloy, Robert D. Nowak

This paper proposes a simple adaptive sensing and group testing algorithm for sparse signal recovery. The algorithm, termed Compressive Adaptive Sense and Search (CASS), is shown to be near-optimal in that it succeeds at the lowest possible signal-to-noise-ratio (SNR) levels, improving on previous work in adaptive compressed sensing. Like traditional compressed sensing based on random non-adaptive design matrices, the CASS algorithm requires only k log n measurements to recover a k-sparse signal of dimension n. However, CASS succeeds at SNR levels that are a factor log n less than required by standard compressed sensing. From the point of view of constructing and implementing the sensing operation as well as computing the reconstruction, the proposed algorithm is substantially less computationally intensive than standard compressed sensing. CASS is also demonstrated to perform considerably better in practice through simulation. To the best of our knowledge, this is the first demonstration of an adaptive compressed sensing algorithm with near-optimal theoretical guarantees and excellent practical performance. This paper also shows that methods like compressed sensing, group testing, and pooling have an advantage beyond simply reducing the number of measurements or tests -- adaptive versions of such methods can also improve detection and estimation performance when compared to non-adaptive direct (uncompressed) sensing.

Applying Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) spectral indices for geological mapping and mineral identification on the Tibetan Plateau

Jul 18, 2011
Robert Corrie, Yoshiki Ninomiya, Jonathan Aitchison

The Tibetan Plateau holds clues to understanding the dynamics and mechanisms associated with continental growth. Part of the region is characterized by zones of ophiolitic melange believed to represent the remnants of ancient oceanic crust and underlying upper mantle emplaced during oceanic closures. However, due to the remoteness of the region and the inhospitable terrain many areas have not received detailed investigation. Increased spatial and spectral resolution of satellite sensors have made it possible to map in greater detail the mineralogy and lithology than in the past. Recent work by Yoshiki Ninomiya of the Geological Survey of Japan has pioneered the use of several spectral indices for the mapping of quartzose, carbonate, and silicate rocks using Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) thermal infrared (TIR) data. In this study, ASTER TIR indices have been applied to a region in western-central Tibet for the purposes of assessing their effectiveness for differentiating ophiolites and other lithologies. The results agree well with existing geological maps and other published data. The study area was chosen due to its diverse range of rock types, including an ophiolitic melange, associated with the Bangong-Nujiang suture (BNS) that crops out on the northern shores of Lagkor Tso and Dong Tso ("Tso" is Tibetan for lake). The techniques highlighted in this paper could be applied to other geographical regions where similar geological questions need to be resolved. The results of this study aim to show the utility of ASTER TIR imagery for geological mapping in semi-arid and sparsely vegetated areas on the Tibetan Plateau.

* International Archives of the Photogrammetry, Remote Sensing, and Spatial Information Science, XXXVIII (2010) 464-469
* 6 pages, 4 figures, 2 tables, Published in the International Archives of the Photogrammetry, Remote Sensing, and Spatial Information Science, Volume XXXVIII, pp. 464-469. For associated web page, see http://www.isprs.org/proceedings/XXXVIII/part8/headline/PS-1%20Interactive%20PresentationWG%20VIII5.html
PCA 4 DCA: The Application Of Principal Component Analysis To The Dendritic Cell Algorithm

Apr 20, 2010
Feng Gu, Julie Greensmith, Robert Oates, Uwe Aickelin

As one of the newest members in the field of artificial immune systems (AIS), the Dendritic Cell Algorithm (DCA) is based on behavioural models of natural dendritic cells (DCs). Unlike other AIS, the DCA does not rely on training data, instead domain or expert knowledge is required to predetermine the mapping between input signals from a particular instance to the three categories used by the DCA. This data preprocessing phase has received the criticism of having manually over-?tted the data to the algorithm, which is undesirable. Therefore, in this paper we have attempted to ascertain if it is possible to use principal component analysis (PCA) techniques to automatically categorise input data while still generating useful and accurate classication results. The integrated system is tested with a biometrics dataset for the stress recognition of automobile drivers. The experimental results have shown the application of PCA to the DCA for the purpose of automated data preprocessing is successful.

* Proceedings of the 9th Annual Workshop on Computational Intelligence (UKCI 2009), Nottingham, UK, 2009
* 6 pages, 4 figures, 3 tables, (UKCI 2009)
Biomedical Mention Disambiguation using a Deep Learning Approach

Sep 23, 2019
Chih-Hsuan Wei, Kyubum Lee, Robert Leaman, Zhiyong Lu

Automatically locating named entities in natural language text - named entity recognition - is an important task in the biomedical domain. Many named entity mentions are ambiguous between several bioconcept types, however, causing text spans to be annotated as more than one type when simultaneously recognizing multiple entity types. The straightforward solution is a rule-based approach applying a priority order based on the precision of each entity tagger (from highest to lowest). While this method is straightforward and useful, imprecise disambiguation remains a significant source of error. We address this issue by generating a partially labeled corpus of ambiguous concept mentions. We first collect named entity mentions from multiple human-curated databases (e.g. CTDbase, gene2pubmed), then correlate them with the text mined span from PubTator to provide the context where the mention appears. Our corpus contains more than 3 million concept mentions that ambiguous between one or more concept types in PubTator (about 3% of all mentions). We approached this task as a classification problem and developed a deep learning-based method which uses the semantics of the span being classified and the surrounding words to identify the most likely bioconcept type. More specifically, we develop a convolutional neural network (CNN) and along short-term memory (LSTM) network to respectively handle the semantic syntax features, then concatenate these within a fully connected layer for final classification. The priority ordering rule-based approach demonstrated F1-scores of 71.29% (micro-averaged) and 41.19% (macro-averaged), while the new disambiguation method demonstrated F1-scores of 91.94% (micro-averaged) and 85.42% (macro-averaged), a very substantial increase.

Measuring an Artificial Intelligence System's Performance on a Verbal IQ Test For Young Children

Sep 11, 2015
Stellan Ohlsson, Robert H. Sloan, György Turán, Aaron Urasky

We administered the Verbal IQ (VIQ) part of the Wechsler Preschool and Primary Scale of Intelligence (WPPSI-III) to the ConceptNet 4 AI system. The test questions (e.g., "Why do we shake hands?") were translated into ConceptNet 4 inputs using a combination of the simple natural language processing tools that come with ConceptNet together with short Python programs that we wrote. The question answering used a version of ConceptNet based on spectral methods. The ConceptNet system scored a WPPSI-III VIQ that is average for a four-year-old child, but below average for 5 to 7 year-olds. Large variations among subtests indicate potential areas of improvement. In particular, results were strongest for the Vocabulary and Similarities subtests, intermediate for the Information subtest, and lowest for the Comprehension and Word Reasoning subtests. Comprehension is the subtest most strongly associated with common sense. The large variations among subtests and ordinary common sense strongly suggest that the WPPSI-III VIQ results do not show that "ConceptNet has the verbal abilities a four-year-old." Rather, children's IQ tests offer one objective metric for the evaluation and comparison of AI systems. Also, this work continues previous research on Psychometric AI.

* 17 pages, 3 figures