For building question answering systems and natural language interfaces, semantic parsing has emerged as an important and powerful paradigm. Semantic parsers map natural language into logical forms, the classic representation for many important linguistic phenomena. The modern twist is that we are interested in learning semantic parsers from data, which introduces a new layer of statistical and computational issues. This article lays out the components of a statistical semantic parser, highlighting the key challenges. We will see that semantic parsing is a rich fusion of the logical and the statistical world, and that this fusion will play an integral role in the future of natural language understanding systems.

* Accepted to the Communications of the ACM
Click to Read Paper
This short note presents a new formal language, lambda dependency-based compositional semantics (lambda DCS) for representing logical forms in semantic parsing. By eliminating variables and making existential quantification implicit, lambda DCS logical forms are generally more compact than those in lambda calculus.

Click to Read Paper
We tackle the problem of generating a pun sentence given a pair of homophones (e.g., "died" and "dyed"). Supervised text generation is inappropriate due to the lack of a large corpus of puns, and even if such a corpus existed, mimicry is at odds with generating novel content. In this paper, we propose an unsupervised approach to pun generation using a corpus of unhumorous text and what we call the local-global surprisal principle: we posit that in a pun sentence, there is a strong association between the pun word (e.g., "dyed") and the distant context, as well as a strong association between the alternative word (e.g., "died") and the immediate context. This contrast creates surprise and thus humor. We instantiate this principle for pun generation in two ways: (i) as a measure based on the ratio of probabilities under a language model, and (ii) a retrieve-and-edit approach based on words suggested by a skip-gram model. Human evaluation shows that our retrieve-and-edit approach generates puns successfully 31% of the time, tripling the success rate of a neural generation baseline.

* NAACL 2019
Click to Read Paper
Adversarial perturbations dramatically decrease the accuracy of state-of-the-art image classifiers. In this paper, we propose and analyze a simple and computationally efficient defense strategy: inject random Gaussian noise, discretize each pixel, and then feed the result into any pre-trained classifier. Theoretically, we show that our randomized discretization strategy reduces the KL divergence between original and adversarial inputs, leading to a lower bound on the classification accuracy of any classifier against any (potentially whitebox) $\ell_\infty$-bounded adversarial attack. Empirically, we evaluate our defense on adversarial examples generated by a strong iterative PGD attack. On ImageNet, our defense is more robust than adversarially-trained networks and the winning defenses of the NIPS 2017 Adversarial Attacks & Defenses competition.

* In proceedings of the 22nd International Conference on Artificial Intelligence and Statistics
Click to Read Paper
Uncertainty sampling, a popular active learning algorithm, is used to reduce the amount of data required to learn a classifier, but it has been observed in practice to converge to different parameters depending on the initialization and sometimes to even better parameters than standard training on all the data. In this work, we give a theoretical explanation of this phenomenon, showing that uncertainty sampling on a convex loss can be interpreted as performing a preconditioned stochastic gradient step on a smoothed version of the population zero-one loss that converges to the population zero-one loss. Furthermore, uncertainty sampling moves in a descent direction and converges to stationary points of the smoothed population zero-one loss. Experiments on synthetic and real datasets support this connection.

* NeurIPS 2018
Click to Read Paper
While active learning offers potential cost savings, the actual data efficiency---the reduction in amount of labeled data needed to obtain the same error rate---observed in practice is mixed. This paper poses a basic question: when is active learning actually helpful? We provide an answer for logistic regression with the popular active learning algorithm, uncertainty sampling. Empirically, on 21 datasets from OpenML, we find a strong inverse correlation between data efficiency and the error rate of the final classifier. Theoretically, we show that for a variant of uncertainty sampling, the asymptotic data efficiency is within a constant factor of the inverse error rate of the limiting classifier.

Click to Read Paper
In sequential hypothesis testing, Generalized Binary Search (GBS) greedily chooses the test with the highest information gain at each step. It is known that GBS obtains the gold standard query cost of $O(\log n)$ for problems satisfying the $k$-neighborly condition, which requires any two tests to be connected by a sequence of tests where neighboring tests disagree on at most $k$ hypotheses. In this paper, we introduce a weaker condition, split-neighborly, which requires that for the set of hypotheses two neighbors disagree on, any subset is splittable by some test. For four problems that are not $k$-neighborly for any constant $k$, we prove that they are split-neighborly, which allows us to obtain the optimal $O(\log n)$ worst-case query cost.

* AISTATS 2018
Click to Read Paper
Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of $75\%$ F1 score to $36\%$; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to $7\%$. We hope our insights will motivate the development of new models that understand language more precisely.

* EMNLP 2017
Click to Read Paper
A core problem in learning semantic parsers from denotations is picking out consistent logical forms--those that yield the correct denotation--from a combinatorially large space. To control the search space, previous work relied on restricted set of rules, which limits expressivity. In this paper, we consider a much more expressive class of logical forms, and show how to use dynamic programming to efficiently represent the complete set of consistent logical forms. Expressivity also introduces many more spurious logical forms which are consistent with the correct denotation but do not represent the meaning of the utterance. To address this, we generate fictitious worlds and use crowdsourced denotations on these worlds to filter out spurious logical forms. On the WikiTableQuestions dataset, we increase the coverage of answerable questions from 53.5% to 76%, and the additional crowdsourced supervision lets us rule out 92.1% of spurious logical forms.

* Published at the Association for Computational Linguistics (ACL) conference, 2016
Click to Read Paper
We show how to estimate a model's test error from unlabeled data, on distributions very different from the training distribution, while assuming only that certain conditional independencies are preserved between train and test. We do not need to assume that the optimal predictor is the same between train and test, or that the true distribution lies in any parametric family. We can also efficiently differentiate the error estimate to perform unsupervised discriminative learning. Our technical tool is the method of moments, which allows us to exploit conditional independencies in the absence of a fully-specified model. Our framework encompasses a large family of losses including the log and exponential loss, and extends to structured output settings such as hidden Markov models.

* 15 pages
Click to Read Paper
Modeling crisp logical regularities is crucial in semantic parsing, making it difficult for neural models with no task-specific prior knowledge to achieve good results. In this paper, we introduce data recombination, a novel framework for injecting such prior knowledge into a model. From the training data, we induce a high-precision synchronous context-free grammar, which captures important conditional independence properties commonly found in semantic parsing. We then train a sequence-to-sequence recurrent network (RNN) model with a novel attention-based copying mechanism on datapoints sampled from this grammar, thereby teaching the model about these structural properties. Data recombination improves the accuracy of our RNN model on three semantic parsing datasets, leading to new state-of-the-art performance on the standard GeoQuery dataset for models with comparable supervision.

* ACL 2016
Click to Read Paper
Two important aspects of semantic parsing for question answering are the breadth of the knowledge source and the depth of logical compositionality. While existing work trades off one aspect for another, this paper simultaneously makes progress on both fronts through a new task: answering complex questions on semi-structured tables using question-answer pairs as supervision. The central challenge arises from two compounding factors: the broader domain results in an open-ended set of relations, and the deeper compositionality results in a combinatorial explosion in the space of logical forms. We propose a logical-form driven parsing algorithm guided by strong typing constraints and show that it obtains significant improvements over natural baselines. For evaluation, we created a new dataset of 22,033 complex questions on Wikipedia tables, which is made publicly available.

Click to Read Paper
Markov Chain Monte Carlo (MCMC) algorithms are often used for approximate inference inside learning, but their slow mixing can be difficult to diagnose and the approximations can seriously degrade learning. To alleviate these issues, we define a new model family using strong Doeblin Markov chains, whose mixing times can be precisely controlled by a parameter. We also develop an algorithm to learn such models, which involves maximizing the data likelihood under the induced stationary distribution of these chains. We show empirical improvements on two challenging inference tasks.

Click to Read Paper
A classic tension exists between exact inference in a simple model and approximate inference in a complex model. The latter offers expressivity and thus accuracy, but the former provides coverage of the space, an important property for confidence estimation and learning with indirect supervision. In this work, we introduce a new approach, reified context models, to reconcile this tension. Specifically, we let the amount of context (the arity of the factors in a graphical model) be chosen "at run-time" by reifying it---that is, letting this choice itself be a random variable inside the model. Empirically, we show that our approach obtains expressivity and coverage on three natural language tasks.

Click to Read Paper
How can we explain the predictions of a black-box model? In this paper, we use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. To scale up influence functions to modern machine learning settings, we develop a simple, efficient implementation that requires only oracle access to gradients and Hessian-vector products. We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks.

* International Conference on Machine Learning, 2017
Click to Read Paper
How much is 131 million US dollars? To help readers put such numbers in context, we propose a new task of automatically generating short descriptions known as perspectives, e.g. "$131 million is about the cost to employ everyone in Texas over a lunch period". First, we collect a dataset of numeric mentions in news articles, where each mention is labeled with a set of rated perspectives. We then propose a system to generate these descriptions consisting of two steps: formula construction and description generation. In construction, we compose formulae from numeric facts in a knowledge base and rank the resulting formulas based on familiarity, numeric proximity and semantic compatibility. In generation, we convert a formula into natural language using a sequence-to-sequence recurrent neural network. Our system obtains a 15.2% F1 improvement over a non-compositional baseline at formula construction and a 12.5 BLEU point improvement over a baseline description generation.

* ACL (2016), 578-587
Click to Read Paper
Discriminative latent-variable models are typically learned using EM or gradient-based optimization, which suffer from local optima. In this paper, we develop a new computationally efficient and provably consistent estimator for a mixture of linear regressions, a simple instance of a discriminative latent-variable model. Our approach relies on a low-rank linear regression to recover a symmetric tensor, which can be factorized into the parameters using a tensor power method. We prove rates of convergence for our estimator and provide an empirical evaluation illustrating its strengths relative to local optimization (EM).

* Accepted at ICML 2013. Includes supplementary material
Click to Read Paper
Despite their impressive performance on diverse tasks, neural networks fail catastrophically in the presence of adversarial inputs---imperceptibly but adversarially perturbed versions of natural inputs. We have witnessed an arms race between defenders who attempt to train robust networks and attackers who try to construct adversarial examples. One promise of ending the arms race is developing certified defenses, ones which are provably robust against all attackers in some family. These certified defenses are based on convex relaxations which construct an upper bound on the worst case loss over all attackers in the family. Previous relaxations are loose on networks that are not trained against the respective relaxation. In this paper, we propose a new semidefinite relaxation for certifying robustness that applies to arbitrary ReLU networks. We show that our proposed relaxation is tighter than previous relaxations and produces meaningful robustness guarantees on three different "foreign networks" whose training objectives are agnostic to our proposed relaxation.

* To appear at NIPS 2018
Click to Read Paper
Existing datasets for natural language inference (NLI) have propelled research on language understanding. We propose a new method for automatically deriving NLI datasets from the growing abundance of large-scale question answering datasets. Our approach hinges on learning a sentence transformation model which converts question-answer pairs into their declarative forms. Despite being primarily trained on a single QA dataset, we show that it can be successfully applied to a variety of other QA resources. Using this system, we automatically derive a new freely available dataset of over 500k NLI examples (QA-NLI), and show that it exhibits a wide range of inference phenomena rarely seen in previous NLI datasets.

* 11 pages, 6 figures
Click to Read Paper
We consider negotiation settings in which two agents use natural language to bargain on goods. Agents need to decide on both high-level strategy (e.g., proposing \$50) and the execution of that strategy (e.g., generating "The bike is brand new. Selling for just \$50."). Recent work on negotiation trains neural models, but their end-to-end nature makes it hard to control their strategy, and reinforcement learning tends to lead to degenerate solutions. In this paper, we propose a modular approach based on coarse di- alogue acts (e.g., propose(price=50)) that decouples strategy and generation. We show that we can flexibly set the strategy using supervised learning, reinforcement learning, or domain-specific knowledge without degeneracy, while our retrieval-based generation can maintain context-awareness and produce diverse utterances. We test our approach on the recently proposed DEALORNODEAL game, and we also collect a richer dataset based on real items on Craigslist. Human evaluation shows that our systems achieve higher task success rate and more human-like negotiation behavior than previous approaches.

* EMNLP 2018
Click to Read Paper