Models, code, and papers for "Larry Heck":
Several methods exist for a computer to generate music based on data including Markov chains, recurrent neural networks, recombinancy, and grammars. We explore the use of unit selection and concatenation as a means of generating music using a procedure based on ranking, where, we consider a unit to be a variable length number of measures of music. We first examine whether a unit selection method, that is restricted to a finite size unit library, can be sufficient for encompassing a wide spectrum of music. We do this by developing a deep autoencoder that encodes a musical input and reconstructs the input by selecting from the library. We then describe a generative model that combines a deep structured semantic model (DSSM) with an LSTM to predict the next unit, where units consist of four, two, and one measures of music. We evaluate the generative model using objective metrics including mean rank and accuracy and with a subjective listening test in which expert musicians are asked to complete a forced-choiced ranking task. We compare our model to a note-level generative baseline that consists of a stacked LSTM trained to predict forward by one note.
The goal of this paper is to use multi-task learning to efficiently scale slot filling models for natural language understanding to handle multiple target tasks or domains. The key to scalability is reducing the amount of training data needed to learn a model for a new task. The proposed multi-task model delivers better performance with less data by leveraging patterns that it learns from the other tasks. The approach supports an open vocabulary, which allows the models to generalize to unseen words, which is particularly important when very little training data is used. A newly collected crowd-sourced data set, covering four different domains, is used to demonstrate the effectiveness of the domain adaptation and open vocabulary techniques.
Entity Disambiguation aims to link mentions of ambiguous entities to a knowledge base (e.g., Wikipedia). Modeling topical coherence is crucial for this task based on the assumption that information from the same semantic context tends to belong to the same topic. This paper presents a novel deep semantic relatedness model (DSRM) based on deep neural networks (DNN) and semantic knowledge graphs (KGs) to measure entity semantic relatedness for topical coherence modeling. The DSRM is directly trained on large-scale KGs and it maps heterogeneous types of knowledge of an entity from KGs to numerical feature vectors in a latent space such that the distance between two semantically-related entities is minimized. Compared with the state-of-the-art relatedness approach proposed by (Milne and Witten, 2008a), the DSRM obtains 19.4% and 24.5% reductions in entity disambiguation errors on two publicly available datasets respectively.
Dialogue state tracking (DST) is a key component of task-oriented dialogue systems. DST estimates the user's goal at each user turn given the interaction until then. State of the art approaches for state tracking rely on deep learning methods, and represent dialogue state as a distribution over all possible slot values for each slot present in the ontology. Such a representation is not scalable when the set of possible values are unbounded (e.g., date, time or location) or dynamic (e.g., movies or usernames). Furthermore, training of such models requires labeled data, where each user turn is annotated with the dialogue state, which makes building models for new domains challenging. In this paper, we present a scalable multi-domain deep learning based approach for DST. We introduce a novel framework for state tracking which is independent of the slot value set, and represent the dialogue state as a distribution over a set of values of interest (candidate set) derived from the dialogue history or knowledge. Restricting these candidate sets to be bounded in size addresses the problem of slot-scalability. Furthermore, by leveraging the slot-independent architecture and transfer learning, we show that our proposed approach facilitates quick adaptation to new domains.
Training statistical dialog models in spoken dialog systems (SDS) requires large amounts of annotated data. The lack of scalable methods for data mining and annotation poses a significant hurdle for state-of-the-art statistical dialog managers. This paper presents an approach that directly leverage billions of web search and browse sessions to overcome this hurdle. The key insight is that task completion through web search and browse sessions is (a) predictable and (b) generalizes to spoken dialog task completion. The new method automatically mines behavioral search and browse patterns from web logs and translates them into spoken dialog models. We experiment with naturally occurring spoken dialogs and large scale web logs. Our session-based models outperform the state-of-the-art method for entity extraction task in SDS. We also achieve better performance for both entity and relation extraction on web search queries when compared with nontrivial baselines.
In this work we describe and evaluate methods to learn musical embeddings. Each embedding is a vector that represents four contiguous beats of music and is derived from a symbolic representation. We consider autoencoding-based methods including denoising autoencoders, and context reconstruction, and evaluate the resulting embeddings on a forward prediction and a classification task.
We present a framework combining hierarchical and multi-agent deep reinforcement learning approaches to solve coordination problems among a multitude of agents using a semi-decentralized model. The framework extends the multi-agent learning setup by introducing a meta-controller that guides the communication between agent pairs, enabling agents to focus on communicating with only one other agent at any step. This hierarchical decomposition of the task allows for efficient exploration to learn policies that identify globally optimal solutions even as the number of collaborating agents increases. We show promising initial experimental results on a simulated distributed scheduling problem.
Spoken Language Understanding (SLU) is a key component of goal oriented dialogue systems that would parse user utterances into semantic frame representations. Traditionally SLU does not utilize the dialogue history beyond the previous system turn and contextual ambiguities are resolved by the downstream components. In this paper, we explore novel approaches for modeling dialogue context in a recurrent neural network (RNN) based language understanding system. We propose the Sequential Dialogue Encoder Network, that allows encoding context from the dialogue history in chronological order. We compare the performance of our proposed architecture with two context models, one that uses just the previous turn context and another that encodes dialogue context in a memory network, but loses the order of utterances in the dialogue history. Experiments with a multi-domain dialogue dataset demonstrate that the proposed architecture results in reduced semantic frame error rates.
State-of-the-art slot filling models for goal-oriented human/machine conversational language understanding systems rely on deep learning methods. While multi-task training of such models alleviates the need for large in-domain annotated datasets, bootstrapping a semantic parsing model for a new domain using only the semantic frame, such as the back-end API or knowledge graph schema, is still one of the holy grail tasks of language understanding for dialogue systems. This paper proposes a deep learning based approach that can utilize only the slot description in context without the need for any labeled or unlabeled in-domain examples, to quickly bootstrap a new domain. The main idea of this paper is to leverage the encoding of the slot names and descriptions within a multi-task deep learned slot filling model, to implicitly align slots across domains. The proposed approach is promising for solving the domain scaling problem and eliminating the need for any manually annotated data or explicit schema alignment. Furthermore, our experiments on multiple domains show that this approach results in significantly better slot-filling performance when compared to using only in-domain data, especially in the low data regime.
We propose a novel zero-shot learning method for semantic utterance classification (SUC). It learns a classifier $f: X \to Y$ for problems where none of the semantic categories $Y$ are present in the training set. The framework uncovers the link between categories and utterances using a semantic space. We show that this semantic space can be learned by deep neural networks trained on large amounts of search engine query log data. More precisely, we propose a novel method that can learn discriminative semantic features without supervision. It uses the zero-shot learning framework to guide the learning of the semantic features. We demonstrate the effectiveness of the zero-shot semantic learning algorithm on the SUC dataset collected by (Tur, 2012). Furthermore, we achieve state-of-the-art results by combining the semantic features with a supervised method.
In this work, we present a hybrid learning method for training task-oriented dialogue systems through online user interactions. Popular methods for learning task-oriented dialogues include applying reinforcement learning with user feedback on supervised pre-training models. Efficiency of such learning method may suffer from the mismatch of dialogue state distribution between offline training and online interactive learning stages. To address this challenge, we propose a hybrid imitation and reinforcement learning method, with which a dialogue agent can effectively learn from its interaction with users by learning from human teaching and feedback. We design a neural network based task-oriented dialogue agent that can be optimized end-to-end with the proposed learning method. Experimental results show that our end-to-end dialogue agent can learn effectively from the mistake it makes via imitation learning from user teaching. Applying reinforcement learning with user feedback after the imitation learning stage further improves the agent's capability in successfully completing a task.
In this paper, we present a neural network based task-oriented dialogue system that can be optimized end-to-end with deep reinforcement learning (RL). The system is able to track dialogue state, interface with knowledge bases, and incorporate query results into agent's responses to successfully complete task-oriented dialogues. Dialogue policy learning is conducted with a hybrid supervised and deep RL methods. We first train the dialogue agent in a supervised manner by learning directly from task-oriented dialogue corpora, and further optimize it with deep RL during its interaction with users. In the experiments on two different dialogue task domains, our model demonstrates robust performance in tracking dialogue state and producing reasonable system responses. We show that deep RL based optimization leads to significant improvement on task success rate and reduction in dialogue length comparing to supervised training model. We further show benefits of training task-oriented dialogue model end-to-end comparing to component-wise optimization with experiment results on dialogue simulations and human evaluations.
Object detection models shipped with camera-equipped mobile devices cannot cover the objects of interest for every user. Therefore, the incremental learning capability is a critical feature for a robust and personalized mobile object detection system that many applications would rely on. In this paper, we present an efficient yet practical system, IMOD, to incrementally train an existing object detection model such that it can detect new object classes without losing its capability to detect old classes. The key component of IMOD is a novel incremental learning algorithm that trains end-to-end for one-stage object detection deep models only using training data of new object classes. Specifically, to avoid catastrophic forgetting, the algorithm distills three types of knowledge from the old model to mimic the old model's behavior on object classification, bounding box regression and feature extraction. In addition, since the training data for the new classes may not be available, a real-time dataset construction pipeline is designed to collect training images on-the-fly and automatically label the images with both category and bounding box annotations. We have implemented IMOD under both mobile-cloud and mobile-only setups. Experiment results show that the proposed system can learn to detect a new object class in just a few minutes, including both dataset construction and model training. In comparison, traditional fine-tuning based method may take a few hours for training, and in most cases would also need a tedious and costly manual dataset labeling step.
Documents exhibit sequential structure at multiple levels of abstraction (e.g., sentences, paragraphs, sections). These abstractions constitute a natural hierarchy for representing the context in which to infer the meaning of words and larger fragments of text. In this paper, we present CLSTM (Contextual LSTM), an extension of the recurrent neural network LSTM (Long-Short Term Memory) model, where we incorporate contextual features (e.g., topics) into the model. We evaluate CLSTM on three specific NLP tasks: word prediction, next sentence selection, and sentence topic prediction. Results from experiments run on two corpora, English documents in Wikipedia and a subset of articles from a recent snapshot of English Google News, indicate that using both words and topics as features improves performance of the CLSTM models over baseline LSTM models for these tasks. For example on the next sentence selection task, we get relative accuracy improvements of 21% for the Wikipedia dataset and 18% for the Google News dataset. This clearly demonstrates the significant benefit of using context appropriately in natural language (NL) tasks. This has implications for a wide variety of NL applications like question answering, sentence completion, paraphrase generation, and next utterance prediction in dialog systems.
The key challenge of generative Visual Dialogue (VD) systems is to respond to human queries with informative answers in natural and contiguous conversation flow. Traditional Maximum Likelihood Estimation (MLE)-based methods only learn from positive responses but ignore the negative responses, and consequently tend to yield safe or generic responses. To address this issue, we propose a novel training scheme in conjunction with weighted likelihood estimation (WLE) method. Furthermore, an adaptive multi-modal reasoning module is designed, to accommodate various dialogue scenarios automatically and select relevant information accordingly. The experimental results on the VisDial benchmark demonstrate the superiority of our proposed algorithm over other state-of-the-art approaches, with an improvement of 5.81% on recall@10.
We propose Machines Talking To Machines (M2M), a framework combining automation and crowdsourcing to rapidly bootstrap end-to-end dialogue agents for goal-oriented dialogues in arbitrary domains. M2M scales to new tasks with just a task schema and an API client from the dialogue system developer, but it is also customizable to cater to task-specific interactions. Compared to the Wizard-of-Oz approach for data collection, M2M achieves greater diversity and coverage of salient dialogue flows while maintaining the naturalness of individual utterances. In the first phase, a simulated user bot and a domain-agnostic system bot converse to exhaustively generate dialogue "outlines", i.e. sequences of template utterances and their semantic parses. In the second phase, crowd workers provide contextual rewrites of the dialogues to make the utterances more natural while preserving their meaning. The entire process can finish within a few hours. We propose a new corpus of 3,000 dialogues spanning 2 domains collected with M2M, and present comparisons with popular dialogue datasets on the quality and diversity of the surface forms and dialogue flows.
Deep neural networks (DNNs) often suffer from "catastrophic forgetting" during incremental learning (IL) --- an abrupt degradation of performance on the original set of classes when the training objective is adapted to a newly added set of classes. Existing IL approaches tend to produce a model that is biased towards either the old classes or new classes, unless with the help of exemplars of the old data. To address this issue, we propose a class-incremental learning paradigm called Deep Model Consolidation (DMC), which works well even when the original training data is not available. The idea is to first train a separate model only for the new classes, and then combine the two individual models trained on data of two distinct set of classes (old classes and new classes) via a novel dual distillation training objective. The two existing models are consolidated by exploiting publicly available unlabeled auxiliary data. This overcomes the potential difficulties due to unavailability of original training data. Compared to the state-of-the-art techniques, DMC demonstrates significantly better performance in CIFAR-100 image classification and PASCAL VOC 2007 object detection benchmarks in the single-headed IL setting.
In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems. Our training system utilizes a cluster of Central Processing Units(CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed "on-the-fly". We use vocal tract length perturbation  and an acoustic simulator  for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus  and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.
We report on a project to use a theorem prover to find proofs of the theorems in Tarskian geometry. These theorems start with fundamental properties of betweenness, proceed through the derivations of several famous theorems due to Gupta and end with the derivation from Tarski's axioms of Hilbert's 1899 axioms for geometry. They include the four challenge problems left unsolved by Quaife, who two decades ago found some \Otter proofs in Tarskian geometry (solving challenges issued in Wos's 1998 book). There are 212 theorems in this collection. We were able to find \Otter proofs of all these theorems. We developed a methodology for the automated preparation and checking of the input files for those theorems, to ensure that no human error has corrupted the formal development of an entire theory as embodied in two hundred input files and proofs. We distinguish between proofs that were found completely mechanically (without reference to the steps of a book proof) and proofs that were constructed by some technique that involved a human knowing the steps of a book proof. Proofs of length 40--100, roughly speaking, are difficult exercises for a human, and proofs of 100-250 steps belong in a Ph.D. thesis or publication. 29 of the proofs in our collection are longer than 40 steps, and ten are longer than 90 steps. We were able to derive completely mechanically all but 26 of the 183 theorems that have "short" proofs (40 or fewer deduction steps). We found proofs of the rest, as well as the 29 "hard" theorems, using a method that requires consulting the book proof at the outset. Our "subformula strategy" enabled us to prove four of the 29 hard theorems completely mechanically. These are Ph.D. level proofs, of length up to 108.
When data analysts train a classifier and check if its accuracy is significantly different from random guessing, they are implicitly and indirectly performing a hypothesis test (two sample testing) and it is of importance to ask whether this indirect method for testing is statistically optimal or not. Given that hypothesis tests attempt to maximize statistical power subject to a bound on the allowable false positive rate, while prediction attempts to minimize statistical risk on future predictions on unseen data, we wish to study whether a predictive approach for an ultimate aim of testing is prudent. We formalize this problem by considering the two-sample mean-testing setting where one must determine if the means of two Gaussians (with known and equal covariance) are the same or not, but the analyst indirectly does so by checking whether the accuracy achieved by Fisher's LDA classifier is significantly different from chance or not. Unexpectedly, we find that the asymptotic power of LDA's sample-splitting classification accuracy is actually minimax rate-optimal in terms of problem-dependent parameters. Since prediction is commonly thought to be harder than testing, it might come as a surprise to some that solving a harder problem does not create a information-theoretic bottleneck for the easier one. On the flip side, even though the power is rate-optimal, our derivation suggests that it may be worse by a small constant factor; hence practitioners must be wary of using (admittedly flexible) prediction methods on disguised testing problems.