We describe classical analogues to quantum algorithms for principal component analysis and nearest-centroid clustering. Given sampling assumptions, our classical algorithms run in time polylogarithmic in input, matching the runtime of the quantum algorithms with only polynomial slowdown. These algorithms are evidence that their corresponding problems do not yield exponential quantum speedups. To build our classical algorithms, we use the same techniques as applied in our previous work dequantizing a quantum recommendation systems algorithm. Thus, we provide further evidence for the strength of classical $\ell^2$-norm sampling assumptions when replacing quantum state preparation assumptions, in the machine learning domain. Click to Read Paper
A recommendation system suggests products to users based on data about user preferences. It is typically modeled by a problem of completing an $m\times n$ matrix of small rank $k$. We give the first classical algorithm to produce a recommendation in $O(\text{poly}(k)\text{polylog}(m,n))$ time, which is an exponential improvement on previous algorithms that run in time linear in $m$ and $n$. Our strategy is inspired by a quantum algorithm by Kerenidis and Prakash: like the quantum algorithm, instead of reconstructing a user's full list of preferences, we only seek a randomized sample from the user's preferences. Our main result is an algorithm that samples high-weight entries from a low-rank approximation of the input matrix in time independent of $m$ and $n$, given natural sampling assumptions on that input matrix. As a consequence, we show that Kerenidis and Prakash's quantum machine learning (QML) algorithm, one of the strongest candidates for provably exponential speedups in QML, does not in fact give an exponential speedup over classical algorithms. Click to Read Paper
Segments that span contiguous parts of inputs, such as phonemes in speech, named entities in sentences, actions in videos, occur frequently in sequence prediction problems. Segmental models, a class of models that explicitly hypothesizes segments, have allowed the exploration of rich segment features for sequence prediction. However, segmental models suffer from slow decoding, hampering the use of computationally expensive features. In this thesis, we introduce discriminative segmental cascades, a multi-pass inference framework that allows us to improve accuracy by adding higher-order features and neural segmental features while maintaining efficiency. We also show that instead of including more features to obtain better accuracy, segmental cascades can be used to speed up training and decoding. Segmental models, similarly to conventional speech recognizers, are typically trained in multiple stages. In the first stage, a frame classifier is trained with manual alignments, and then in the second stage, segmental models are trained with manual alignments and the out- puts of the frame classifier. However, obtaining manual alignments are time-consuming and expensive. We explore end-to-end training for segmental models with various loss functions, and show how end-to-end training with marginal log loss can eliminate the need for detailed manual alignments. We draw the connections between the marginal log loss and a popular end-to-end training approach called connectionist temporal classification. We present a unifying framework for various end-to-end graph search-based models, such as hidden Markov models, connectionist temporal classification, and segmental models. Finally, we discuss possible extensions of segmental models to large-vocabulary sequence prediction tasks. Click to Read Paper
We derive the limiting distribution for the largest eigenvalues of the adjacency matrix for a stochastic blockmodel graph when the number of vertices tends to infinity. We show that, in the limit, these eigenvalues are jointly multivariate normal with bounded covariances. Our result extends the classic result of F\"{u}redi and Koml\'{o}s on the fluctuation of the largest eigenvalue for Erd\H{o}s-R\'{e}nyi graphs. Click to Read Paper
This paper reviews the state-of-the-art of semantic change computation, one emerging research field in computational linguistics, proposing a framework that summarizes the literature by identifying and expounding five essential components in the field: diachronic corpus, diachronic word sense characterization, change modelling, evaluation data and data visualization. Despite the potential of the field, the review shows that current studies are mainly focused on testifying hypotheses proposed in theoretical linguistics and that several core issues remain to be solved: the need for diachronic corpora of languages other than English, the need for comprehensive evaluation data for evaluation, the comparison and construction of approaches to diachronic word sense characterization and change modelling, and further exploration of data visualization techniques for hypothesis justification. Click to Read Paper
TF.Learn is a high-level Python module for distributed machine learning inside TensorFlow. It provides an easy-to-use Scikit-learn style interface to simplify the process of creating, configuring, training, evaluating, and experimenting a machine learning model. TF.Learn integrates a wide range of state-of-art machine learning algorithms built on top of TensorFlow's low level APIs for small to large-scale supervised and unsupervised problems. This module focuses on bringing machine learning to non-specialists using a general-purpose high-level language as well as researchers who want to implement, benchmark, and compare their new methods in a structured environment. Emphasis is put on ease of use, performance, documentation, and API consistency. Click to Read Paper
This work focuses on top-k recommendation in domains where underlying data distribution shifts overtime. We propose to learn a time-dependent bias for each item over whatever existing recommendation engine. Such a bias learning process alleviates data sparsity in constructing the engine, and at the same time captures recent trend shift observed in data. We present an alternating optimization framework to resolve the bias learning problem, and develop methods to handle a variety of commonly used recommendation evaluation criteria, as well as large number of items and users in practice. The proposed algorithm is examined, both offline and online, using real world data sets collected from the largest retailer worldwide. Empirical results demonstrate that the bias learning can almost always boost recommendation performance. We encourage other practitioners to adopt it as a standard component in recommender systems where temporal dynamics is a norm. Click to Read Paper
Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and bioinformatics. For classification tasks, most of these "deep learning" models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neural nets and SVMs in prior art, our results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge. Click to Read Paper
This paper revisits the classic iterative proportional scaling (IPS) from a modern optimization perspective. In contrast to the criticisms made in the literature, we show that based on a coordinate descent characterization, IPS can be slightly modified to deliver coefficient estimates, and from a majorization-minimization standpoint, IPS can be extended to handle log-affine models with features not necessarily binary-valued or nonnegative. Furthermore, some state-of-the-art optimization techniques such as block-wise computation, randomization and momentum-based acceleration can be employed to provide more scalable IPS algorithms, as well as some regularized variants of IPS for concurrent feature selection. Click to Read Paper
This paper presents a simple but effective density-based outlier detection approach with the local kernel density estimation (KDE). A Relative Density-based Outlier Score (RDOS) is introduced to measure the local outlierness of objects, in which the density distribution at the location of an object is estimated with a local KDE method based on extended nearest neighbors of the object. Instead of using only $k$ nearest neighbors, we further consider reverse nearest neighbors and shared nearest neighbors of an object for density distribution estimation. Some theoretical properties of the proposed RDOS including its expected value and false alarm probability are derived. A comprehensive experimental study on both synthetic and real-life data sets demonstrates that our approach is more effective than state-of-the-art outlier detection methods. Click to Read Paper
In this paper, we present a new wrapper feature selection approach based on Jensen-Shannon (JS) divergence, termed feature selection with maximum JS-divergence (FSMJ), for text categorization. Unlike most existing feature selection approaches, the proposed FSMJ approach is based on real-valued features which provide more information for discrimination than binary-valued features used in conventional approaches. We show that the FSMJ is a greedy approach and the JS-divergence monotonically increases when more features are selected. We conduct several experiments on real-life data sets, compared with the state-of-the-art feature selection approaches for text categorization. The superior performance of the proposed FSMJ approach demonstrates its effectiveness and further indicates its wide potential applications on data mining. Click to Read Paper
In recent years, we have witnessed a dramatic shift towards techniques driven by neural networks for a variety of NLP tasks. Undoubtedly, neural language models (NLMs) have reduced perplexity by impressive amounts. This progress, however, comes at a substantial cost in performance, in terms of inference latency and energy consumption, which is particularly of concern in deployments on mobile devices. This paper, which examines the quality-performance tradeoff of various language modeling techniques, represents to our knowledge the first to make this observation. We compare state-of-the-art NLMs with "classic" Kneser-Ney (KN) LMs in terms of energy usage, latency, perplexity, and prediction accuracy using two standard benchmarks. On a Raspberry Pi, we find that orders of increase in latency and energy usage correspond to less change in perplexity, while the difference is much less pronounced on a desktop. Click to Read Paper
Recurrent neural networks have been the dominant models for many speech and language processing tasks. However, we understand little about the behavior and the class of functions recurrent networks can realize. Moreover, the heuristics used during training complicate the analyses. In this paper, we study recurrent networks' ability to learn long-term dependency in the context of speech recognition. We consider two decoding approaches, online and batch decoding, and show the classes of functions to which the decoding approaches correspond. We then draw a connection between batch decoding and a popular training approach for recurrent networks, truncated backpropagation through time. Changing the decoding approach restricts the amount of past history recurrent networks can use for prediction, allowing us to analyze their ability to remember. Empirically, we utilize long-term dependency in subphonetic states, phonemes, and words, and show how the design decisions, such as the decoding approach, lookahead, context frames, and consecutive prediction, characterize the behavior of recurrent networks. Finally, we draw a connection between Markov processes and vanishing gradients. These results have implications for studying the long-term dependency in speech data and how these properties are learned by recurrent networks. Click to Read Paper
Acoustics-to-word models are end-to-end speech recognizers that use words as targets without relying on pronunciation dictionaries or graphemes. These models are notoriously difficult to train due to the lack of linguistic knowledge. It is also unclear how the amount of training data impacts the optimization and generalization of such models. In this work, we study the optimization and generalization of acoustics-to-word models under different amounts of training data. In addition, we study three types of inductive bias, leveraging a pronunciation dictionary, word boundary annotations, and constraints on word durations. We find that constraining word durations leads to the most improvement. Finally, we analyze the word embedding space learned by the model, and find that the space has a structure dominated by the pronunciation of words. This suggests that the contexts of words, instead of their phonetic structure, should be the future focus of inductive bias in acoustics-to-word models. Click to Read Paper
We propose to improve trust region policy search with normalizing flows policy. We illustrate that when the trust region is constructed by KL divergence constraint, normalizing flows policy can generate samples far from the 'center' of the previous policy iterate, which potentially enables better exploration and helps avoid bad local optima. We show that normalizing flows policy significantly improves upon factorized Gaussian policy baseline, with both TRPO and ACKTR, especially on tasks with complex dynamics such as Humanoid. Click to Read Paper
Neural language models (NLMs) exist in an accuracy-efficiency tradeoff space where better perplexity typically comes at the cost of greater computation complexity. In a software keyboard application on mobile devices, this translates into higher power consumption and shorter battery life. This paper represents the first attempt, to our knowledge, in exploring accuracy-efficiency tradeoffs for NLMs. Building on quasi-recurrent neural networks (QRNNs), we apply pruning techniques to provide a "knob" to select different operating points. In addition, we propose a simple technique to recover some perplexity using a negligible amount of memory. Our empirical evaluations consider both perplexity as well as energy consumption on a Raspberry Pi, where we demonstrate which methods provide the best perplexity-power consumption operating point. At one operating point, one of the techniques is able to provide energy savings of 40% over the state of the art with only a 17% relative increase in perplexity. Click to Read Paper
Action recognition has attracted increasing attention from RGB input in computer vision partially due to potential applications on somatic simulation and statistics of sport such as virtual tennis game and tennis techniques and tactics analysis by video. Recently, deep learning based methods have achieved promising performance for action recognition. In this paper, we propose weighted Long Short-Term Memory adopted with convolutional neural network representations for three dimensional tennis shots recognition. First, the local two-dimensional convolutional neural network spatial representations are extracted from each video frame individually using a pre-trained Inception network. Then, a weighted Long Short-Term Memory decoder is introduced to take the output state at time t and the historical embedding feature at time t-1 to generate feature vector using a score weighting scheme. Finally, we use the adopted CNN and weighted LSTM to map the original visual features into a vector space to generate the spatial-temporal semantical description of visual sequences and classify the action video content. Experiments on the benchmark demonstrate that our method using only simple raw RGB video can achieve better performance than the state-of-the-art baselines for tennis shot recognition. Click to Read Paper
We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark. Our best residual network (ResNet) implementation significantly outperforms Google's previous convolutional neural networks in terms of accuracy. By varying model depth and width, we can achieve compact models that also outperform previous small-footprint variants. To our knowledge, we are the first to examine these approaches for keyword spotting, and our results establish an open-source state-of-the-art reference to support the development of future speech-based interfaces. Click to Read Paper
We propose a novel way to train ranking models, such as recommender systems, that are both effective and efficient. Knowledge distillation (KD) was shown to be successful in image recognition to achieve both effectiveness and efficiency. We propose a KD technique for learning to rank problems, called \emph{ranking distillation (RD)}. Specifically, we train a smaller student model to learn to rank documents/items from both the training data and the supervision of a larger teacher model. The student model achieves a similar ranking performance to that of the large teacher model, but its smaller model size makes the online inference more efficient. RD is flexible because it is orthogonal to the choices of ranking models for the teacher and student. We address the challenges of RD for ranking problems. The experiments on public data sets and state-of-the-art recommendation models showed that RD achieves its design purposes: the student model learnt with RD has a model size less than half of the teacher model while achieving a ranking performance similar to the teacher model and much better than the student model learnt without RD. Click to Read Paper
Top-$N$ sequential recommendation models each user as a sequence of items interacted in the past and aims to predict top-$N$ ranked items that a user will likely interact in a `near future'. The order of interaction implies that sequential patterns play an important role where more recent items in a sequence have a larger impact on the next item. In this paper, we propose a Convolutional Sequence Embedding Recommendation Model (\emph{Caser}) as a solution to address this requirement. The idea is to embed a sequence of recent items into an `image' in the time and latent spaces and learn sequential patterns as local features of the image using convolutional filters. This approach provides a unified and flexible network structure for capturing both general preferences and sequential patterns. The experiments on public datasets demonstrated that Caser consistently outperforms state-of-the-art sequential recommendation methods on a variety of common evaluation metrics. Click to Read Paper