Research papers and code for "Xiaodong Liu":
In this paper, we present a Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks. MT-DNN not only leverages large amounts of cross-task data, but also benefits from a regularization effect that leads to more general representations in order to adapt to new tasks and domains. MT-DNN extends the model proposed in Liu et al. (2015) by incorporating a pre-trained bidirectional transformer language model, known as BERT (Devlin et al., 2018). MT-DNN obtains new state-of-the-art results on ten NLU tasks, including SNLI, SciTail, and eight out of nine GLUE tasks, pushing the GLUE benchmark to 82.2% (1.8% absolute improvement). We also demonstrate using the SNLI and SciTail datasets that the representations learned by MT-DNN allow domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations. Our code and pre-trained models will be made publicly available.

* 10 pages, 2 figures and 5 tables
Click to Read Paper and Get Code
This paper explores the use of knowledge distillation to improve a Multi-Task Deep Neural Network (MT-DNN) (Liu et al., 2019) for learning text representations across multiple natural language understanding tasks. Although ensemble learning can improve model performance, serving an ensemble of large DNNs such as MT-DNN can be prohibitively expensive. Here we apply the knowledge distillation method (Hinton et al., 2015) in the multi-task learning setting. For each task, we train an ensemble of different MT-DNNs (teacher) that outperforms any single model, and then train a single MT-DNN (student) via multi-task learning to \emph{distill} knowledge from these ensemble teachers. We show that the distilled MT-DNN significantly outperforms the original MT-DNN on 7 out of 9 GLUE tasks, pushing the GLUE benchmark (single model) to 83.7\% (1.5\% absolute improvement\footnote{ Based on the GLUE leaderboard at https://gluebenchmark.com/leaderboard as of April 1, 2019.}). The code and pre-trained models will be made publicly available at https://github.com/namisan/mt-dnn.

* 8 pages, 2 figures and 3 tables
Click to Read Paper and Get Code
The analysis of nonconvex matrix completion has recently attracted much attention in the community of machine learning thanks to its computational convenience. Existing analysis on this problem, however, usually relies on $\ell_{2,\infty}$ projection or regularization that involves unknown model parameters, although they are observed to be unnecessary in numerical simulations, see, e.g. Zheng and Lafferty [2016]. In this paper, we extend the analysis of the vanilla gradient descent for positive semidefinite matrix completion proposed in Ma et al. [2017] to the rectangular case, and more significantly, improve the required sampling complexity from $\widetilde{O}(r^3)$ to $\widetilde{O}(r^2)$. Our technical ideas and contributions are potentially useful in improving the leave-one-out analysis in other related problems.

Click to Read Paper and Get Code
We propose a stochastic answer network (SAN) to explore multi-step inference strategies in Natural Language Inference. Rather than directly predicting the results given the inputs, the model maintains a state and iteratively refines its predictions. Our experiments show that SAN achieves the state-of-the-art results on three benchmarks: Stanford Natural Language Inference (SNLI) dataset, MultiGenre Natural Language Inference (MultiNLI) dataset and Quora Question Pairs dataset.

* 5 pages, 1 figures
Click to Read Paper and Get Code
This paper proposes a hybrid neural network (HNN) model for commonsense reasoning. An HNN consists of two component models, a masked language model and a semantic similarity model, which share a BERT-based contextual encoder but use different model-specific input and output layers. HNN obtains new state-of-the-art results on three classic commonsense reasoning tasks, pushing the WNLI benchmark to 89%, the Winograd Schema Challenge (WSC) benchmark to 75.1%, and the PDP60 benchmark to 90.0%. An ablation study shows that language models and semantic similarity models are complementary approaches to commonsense reasoning, and HNN effectively combines the strengths of both. The code and pre-trained models will be publicly available at https://github.com/namisan/mt-dnn.

* 9 pages, 3 figures, 6 tables
Click to Read Paper and Get Code
Unmanned aerial vehicles (UAVs) are envisioned to complement the 5G communication infrastructure in future smart cities. Hot spots easily appear in road intersections, where effective communication among vehicles is challenging. UAVs may serve as relays with the advantages of low price, easy deployment, line-of-sight links, and flexible mobility. In this paper, we study a UAV-assisted vehicular network where the UAV jointly adjusts its transmission power and bandwidth allocation under 3D flight to maximize the total throughput. First, we formulate a Markov Decision Process (MDP) problem by modeling the mobility of the UAV/vehicles and the state transitions. Secondly, we solve the target problem using a deep reinforcement learning method, namely, the deep deterministic policy gradient, and propose three solutions with different control objectives. Then we extend the proposed solutions by considering the energy consumption of 3D flight. Thirdly, in a simplified model with small state space and action space, we verify the optimality of proposed algorithms. Comparing with two baseline schemes, we demonstrate the effectiveness of proposed algorithms in a realistic model.

* 13 pages, 13 figures
Click to Read Paper and Get Code
Unmanned aerial vehicles (UAVs) are envisioned to complement the 5G communication infrastructure in future smart cities. Hot spots easily appear in road intersections, where effective communication among vehicles is challenging. UAVs may serve as relays with the advantages of low price, easy deployment, line-of-sight links, and flexible mobility. In this paper, we study a UAV-assisted vehicular network where the UAV jointly adjusts its transmission power and bandwidth allocation under 3D flight to maximize the total throughput. First, we formulate a Markov Decision Process (MDP) problem by modeling the mobility of the UAV/vehicles and the state transitions. Secondly, we solve the target problem using a deep reinforcement learning method, namely, the deep deterministic policy gradient, and propose three solutions with different control objectives. Then we extend the proposed solutions considering of the energy consumption of 3D flight. Thirdly, in a simplified model with small state space and action space, we verify the optimality of proposed algorithms. Comparing with two baseline schemes, we demonstrate the effectiveness of proposed algorithms in a realistic model.

* 12 pages, 13 figures
Click to Read Paper and Get Code
As efficient traffic-management platforms, public vehicle (PV) systems are envisioned to be a promising approach to solving traffic congestions and pollutions for future smart cities. PV systems provide online/dynamic peer-to-peer ride-sharing services with the goal of serving sufficient number of customers with minimum number of vehicles and lowest possible cost. A key component of the PV system is the online ride-sharing scheduling strategy. In this paper, we propose an efficient path planning strategy that focuses on a limited potential search area for each vehicle by filtering out the requests that violate passenger service quality level, so that the global search is reduced to local search. We analyze the performance of the proposed solution such as reduction ratio of computational complexity. Simulations based on the Manhattan taxi data set show that, the computing time is reduced by 22% compared with the exhaustive search method under the same service quality performance.

* 12 pages
Click to Read Paper and Get Code
We propose a simple yet robust stochastic answer network (SAN) that simulates multi-step reasoning in machine reading comprehension. Compared to previous work such as ReasoNet which used reinforcement learning to determine the number of steps, the unique feature is the use of a kind of stochastic prediction dropout on the answer module (final layer) of the neural network during the training. We show that this simple trick improves robustness and achieves results competitive to the state-of-the-art on the Stanford Question Answering Dataset (SQuAD), the Adversarial SQuAD, and the Microsoft MAchine Reading COmprehension Dataset (MS MARCO).

* 11 pages, 5 figures, Accepted to ACL 2018
Click to Read Paper and Get Code
Reading comprehension (RC) is a challenging task that requires synthesis of information across sentences and multiple turns of reasoning. Using a state-of-the-art RC model, we empirically investigate the performance of single-turn and multiple-turn reasoning on the SQuAD and MS MARCO datasets. The RC model is an end-to-end neural network with iterative attention, and uses reinforcement learning to dynamically control the number of turns. We find that multiple-turn reasoning outperforms single-turn reasoning for all question and answer types; further, we observe that enabling a flexible number of turns generally improves upon a fixed multiple-turn strategy. %across all question types, and is particularly beneficial to questions with lengthy, descriptive answers. We achieve results competitive to the state-of-the-art on these two datasets.

Click to Read Paper and Get Code
Without discourse connectives, classifying implicit discourse relations is a challenging task and a bottleneck for building a practical discourse parser. Previous research usually makes use of one kind of discourse framework such as PDTB or RST to improve the classification performance on discourse relations. Actually, under different discourse annotation frameworks, there exist multiple corpora which have internal connections. To exploit the combination of different discourse corpora, we design related discourse classification tasks specific to a corpus, and propose a novel Convolutional Neural Network embedded multi-task learning system to synthesize these tasks by learning both unique and shared representations for each task. The experimental results on the PDTB implicit discourse relation classification task demonstrate that our model achieves significant gains over baseline systems.

* This is the pre-print version of a paper accepted by AAAI-16
Click to Read Paper and Get Code
This paper proposes a new architecture - Attentive Tensor Product Learning (ATPL) - to represent grammatical structures in deep learning models. ATPL is a new architecture to bridge this gap by exploiting Tensor Product Representations (TPR), a structured neural-symbolic model developed in cognitive science, aiming to integrate deep learning with explicit language structures and rules. The key ideas of ATPL are: 1) unsupervised learning of role-unbinding vectors of words via TPR-based deep neural network; 2) employing attention modules to compute TPR; and 3) integration of TPR with typical deep learning architectures including Long Short-Term Memory (LSTM) and Feedforward Neural Network (FFNN). The novelty of our approach lies in its ability to extract the grammatical structure of a sentence by using role-unbinding vectors, which are obtained in an unsupervised manner. This ATPL approach is applied to 1) image captioning, 2) part of speech (POS) tagging, and 3) constituency parsing of a sentence. Experimental results demonstrate the effectiveness of the proposed approach.

Click to Read Paper and Get Code
We propose a multi-task learning framework to jointly train a Machine Reading Comprehension (MRC) model on multiple datasets across different domains. Key to the proposed method is to learn robust and general contextual representations with the help of out-domain data in a multi-task framework. Empirical study shows that the proposed approach is orthogonal to the existing pre-trained representation models, such as word embedding and language models. Experiments on the Stanford Question Answering Dataset (SQuAD), the Microsoft MAchine Reading COmprehension Dataset (MS MARCO), NewsQA and other datasets show that our multi-task learning approach achieves significant improvement over state-of-the-art models in most MRC tasks.

* 9 pages, 2 figures, 7 tables
Click to Read Paper and Get Code
We investigate two crucial and closely related aspects of CNNs for optical flow estimation: models and training. First, we design a compact but effective CNN model, called PWC-Net, according to simple and well-established principles: pyramidal processing, warping, and cost volume processing. PWC-Net is 17 times smaller in size, 2 times faster in inference, and 11\% more accurate on Sintel final than the recent FlowNet2 model. It is the winning entry in the optical flow competition of the robust vision challenge. Next, we experimentally analyze the sources of our performance gains. In particular, we use the same training procedure of PWC-Net to retrain FlowNetC, a sub-network of FlowNet2. The retrained FlowNetC is 56\% more accurate on Sintel final than the previously trained one and even 5\% more accurate than the FlowNet2 model. We further improve the training procedure and increase the accuracy of PWC-Net on Sintel by 10\% and on KITTI 2012 and 2015 by 20\%. Our newly trained model parameters and training protocols will be available on https://github.com/NVlabs/PWC-Net

Click to Read Paper and Get Code
We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume. Cast in a learnable feature pyramid, PWC-Net uses the cur- rent optical flow estimate to warp the CNN features of the second image. It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow. PWC-Net is 17 times smaller in size and easier to train than the recent FlowNet2 model. Moreover, it outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks, running at about 35 fps on Sintel resolution (1024x436) images. Our models are available on https://github.com/NVlabs/PWC-Net.

* CVPR 2018 camera ready version (with github link to Caffe and PyTorch code)
Click to Read Paper and Get Code
Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks. However, NLMs are very computationally demanding largely due to the computational cost of the softmax layer over a large vocabulary. We observe that, in decoding of many NLP tasks, only the probabilities of the top-K hypotheses need to be calculated preciously and K is often much smaller than the vocabulary size. This paper proposes a novel softmax layer approximation algorithm, called Fast Graph Decoder (FGD), which quickly identifies, for a given context, a set of K words that are most likely to occur according to a NLM. We demonstrate that FGD reduces the decoding time by an order of magnitude while attaining close to the full softmax baseline accuracy on neural machine translation and language modeling tasks. We also prove the theoretical guarantee on the softmax approximation quality.

Click to Read Paper and Get Code
We investigate the problem of Language-Based Image Editing (LBIE). Given a source image and a natural language description, we want to generate a target image by editing the source image based on the description. We propose a generic modeling framework for two sub-tasks of LBIE: language-based image segmentation and image colorization. The framework uses recurrent attentive models to fuse image and language features. Instead of using a fixed step size, we introduce for each region of the image a termination gate to dynamically determine after each inference step whether to continue extrapolating additional information from the textual description. The effectiveness of the framework is validated on three datasets. First, we introduce a synthetic dataset, called CoSaL, to evaluate the end-to-end performance of our LBIE system. Second, we show that the framework leads to state-of-the-art performance on image segmentation on the ReferIt dataset. Third, we present the first language-based colorization result on the Oxford-102 Flowers dataset.

* Accepted to CVPR 2018 as a Spotlight
Click to Read Paper and Get Code