Research papers and code for "Jingjing Liu":
Automatic evaluation of semantic rationality is an important yet challenging task, and current automatic techniques cannot well identify whether a sentence is semantically rational. The methods based on the language model do not measure the sentence by rationality but by commonness. The methods based on the similarity with human written sentences will fail if human-written references are not available. In this paper, we propose a novel model called Sememe-Word-Matching Neural Network (SWM-NN) to tackle semantic rationality evaluation by taking advantage of sememe knowledge base HowNet. The advantage is that our model can utilize a proper combination of sememes to represent the fine-grained semantic meanings of a word within the specific contexts. We use the fine-grained semantic representation to help the model learn the semantic dependency among words. To evaluate the effectiveness of the proposed model, we build a large-scale rationality evaluation dataset. Experimental results on this dataset show that the proposed model outperforms the competitive baselines with a 5.4\% improvement in accuracy.

Click to Read Paper and Get Code
Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: ($i$) PKD-Last: learning from the last $k$ layers; and ($ii$) PKD-Skip: learning from every $k$ layers. These two patient distillation schemes enable the exploitation of rich information in the teacher's hidden layers, and encourage the student model to patiently learn from and imitate the teacher through a multi-layer distillation process. Empirically, this translates into improved results on multiple NLP tasks with significant gain in training efficiency, without sacrificing model accuracy.

* Accepted to EMNLP 2019
Click to Read Paper and Get Code
In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

Click to Read Paper and Get Code
To protect image contents, most existing encryption algorithms are designed to transform an original image into a texture-like or noise-like image, which is, however, an obvious visual sign indicating the presence of an encrypted image, results in a significantly large number of attacks. To solve this problem, in this paper, we propose a new image encryption method to generate a visually same image as the original one by sending a meaning-normal and independent image to a corresponding well-trained generative model to achieve the effect of disguising the original image. This image disguise method not only solves the problem of obvious visual implication, but also guarantees the security of the information.

* 4 pages,9 figures
Click to Read Paper and Get Code
We propose a multi-task learning framework to jointly train a Machine Reading Comprehension (MRC) model on multiple datasets across different domains. Key to the proposed method is to learn robust and general contextual representations with the help of out-domain data in a multi-task framework. Empirical study shows that the proposed approach is orthogonal to the existing pre-trained representation models, such as word embedding and language models. Experiments on the Stanford Question Answering Dataset (SQuAD), the Microsoft MAchine Reading COmprehension Dataset (MS MARCO), NewsQA and other datasets show that our multi-task learning approach achieves significant improvement over state-of-the-art models in most MRC tasks.

* 9 pages, 2 figures, 7 tables
Click to Read Paper and Get Code
We investigate the problem of Language-Based Image Editing (LBIE). Given a source image and a natural language description, we want to generate a target image by editing the source image based on the description. We propose a generic modeling framework for two sub-tasks of LBIE: language-based image segmentation and image colorization. The framework uses recurrent attentive models to fuse image and language features. Instead of using a fixed step size, we introduce for each region of the image a termination gate to dynamically determine after each inference step whether to continue extrapolating additional information from the textual description. The effectiveness of the framework is validated on three datasets. First, we introduce a synthetic dataset, called CoSaL, to evaluate the end-to-end performance of our LBIE system. Second, we show that the framework leads to state-of-the-art performance on image segmentation on the ReferIt dataset. Third, we present the first language-based colorization result on the Oxford-102 Flowers dataset.

* Accepted to CVPR 2018 as a Spotlight
Click to Read Paper and Get Code
This paper presents a novel neural model - Dynamic Fusion Network (DFN), for machine reading comprehension (MRC). DFNs differ from most state-of-the-art models in their use of a dynamic multi-strategy attention process, in which passages, questions and answer candidates are jointly fused into attention vectors, along with a dynamic multi-step reasoning module for generating answers. With the use of reinforcement learning, for each input sample that consists of a question, a passage and a list of candidate answers, an instance of DFN with a sample-specific network architecture can be dynamically constructed by determining what attention strategy to apply and how many reasoning steps to take. Experiments show that DFNs achieve the best result reported on RACE, a challenging MRC dataset that contains real human reading questions in a wide variety of types. A detailed empirical analysis also demonstrates that DFNs can produce attention vectors that summarize information from questions, passages and answer candidates more effectively than other popular MRC models.

* 13 pages, 5 figures, 5 tables
Click to Read Paper and Get Code
Multispectral pedestrian detection is essential for around-the-clock applications, e.g., surveillance and autonomous driving. We deeply analyze Faster R-CNN for multispectral pedestrian detection task and then model it into a convolutional network (ConvNet) fusion problem. Further, we discover that ConvNet-based pedestrian detectors trained by color or thermal images separately provide complementary information in discriminating human instances. Thus there is a large potential to improve pedestrian detection by using color and thermal images in DNNs simultaneously. We carefully design four ConvNet fusion architectures that integrate two-branch ConvNets on different DNNs stages, all of which yield better performance compared with the baseline detector. Our experimental results on KAIST pedestrian benchmark show that the Halfway Fusion model that performs fusion on the middle-level convolutional features outperforms the baseline method by 11% and yields a missing rate 3.5% lower than the other proposed architectures.

* 13 pages, 8 figures, BMVC 2016 oral
Click to Read Paper and Get Code
In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain. To this end, we propose an Adversarial Domain Adaptation framework (AdaMRC), where ($i$) pseudo questions are first generated for unlabeled passages in the target domain, and then ($ii$) a domain classifier is incorporated into an MRC model to predict which domain a given passage-question pair comes from. The classifier and the passage-question encoder are jointly trained using adversarial learning to enforce domain-invariant representation learning. Comprehensive evaluations demonstrate that our approach ($i$) is generalizable to different MRC models and datasets, ($ii$) can be combined with pre-trained large-scale language models (such as ELMo and BERT), and ($iii$) can be extended to semi-supervised learning.

* Accepted to EMNLP 2019
Click to Read Paper and Get Code
Graph based clustering is one of the major clustering methods. Most of it work in three separate steps: similarity graph construction, clustering label relaxing and label discretization with k-means. Such common practice has three disadvantages: 1) the predefined similarity graph is often fixed and may not be optimal for the subsequent clustering. 2) the relaxing process of cluster labels may cause significant information loss. 3) label discretization may deviate from the real clustering result since k-means is sensitive to the initialization of cluster centroids. To tackle these problems, in this paper, we propose an effective discrete optimal graph clustering (DOGC) framework. A structured similarity graph that is theoretically optimal for clustering performance is adaptively learned with a guidance of reasonable rank constraint. Besides, to avoid the information loss, we explicitly enforce a discrete transformation on the intermediate continuous label, which derives a tractable optimization problem with discrete solution. Further, to compensate the unreliability of the learned labels and enhance the clustering accuracy, we design an adaptive robust module that learns prediction function for the unseen data based on the learned discrete cluster labels. Finally, an iterative optimization strategy guaranteed with convergence is developed to directly solve the clustering results. Extensive experiments conducted on both real and synthetic datasets demonstrate the superiority of our proposed methods compared with several state-of-the-art clustering approaches.

Click to Read Paper and Get Code
In this paper, we introduce a new task - interactive image editing via conversational language, where users can guide an agent to edit images via multi-turn dialogue in natural language. In each dialogue turn, the agent takes a source image and a natural language description from the user as the input, and generates a target image following the textual description. Two new datasets are created for this task,Zap-Seq and DeepFashion-Seq, collected via crowdsourcing. For this task, we propose a new Sequential Attention Genrative Adversarial Network (SeqAttnGAN) framework, which applies a neural state tracker to encode both source image and textual descriptions, and generates high quality images in each dialogue turn. To achieve better region specific text-to-image generation, we also introducean attention mechanism into the model. Experiments on the two datasets, including quantitative evaluation and user study, show that our model outperforms state-of-the-art ap-proaches in both image quality and text-to-image consistency.

Click to Read Paper and Get Code
Attribute recognition has become crucial because of its wide applications in many computer vision tasks, such as person re-identification. Like many object recognition problems, variations in viewpoints, illumination, and recognition at far distance, all make this task challenging. In this work, we propose a joint CTC-Attention model (JCM), which maps attribute labels into sequences to learn the semantic relationship among attributes. Besides, this network uses neural network to encode images into sequences, and employs connectionist temporal classification (CTC) loss to train the network with the aim of improving the encoding performance of the network. At the same time, it adopts the attention model to decode the sequences, which can realize aligning the sequences and better learning the semantic information from attributes. Extensive experiments on three public datasets, i.e., Market-1501 attribute dataset, Duke attribute dataset and PETA dataset, demonstrate the effectiveness of the proposed method.

Click to Read Paper and Get Code
Training task-completion dialogue agents with reinforcement learning usually requires a large number of real user experiences. The Dyna-Q algorithm extends Q-learning by integrating a world model, and thus can effectively boost training efficiency using simulated experiences generated by the world model. The effectiveness of Dyna-Q, however, depends on the quality of the world model - or implicitly, the pre-specified ratio of real vs. simulated experiences used for Q-learning. To this end, we extend the recently proposed Deep Dyna-Q (DDQ) framework by integrating a switcher that automatically determines whether to use a real or simulated experience for Q-learning. Furthermore, we explore the use of active learning for improving sample efficiency, by encouraging the world model to generate simulated experiences in the state-action space where the agent has not (fully) explored. Our results show that by combining switcher and active learning, the new framework named as Switch-based Active Deep Dyna-Q (Switch-DDQ), leads to significant improvement over DDQ and Q-learning baselines in both simulation and human evaluations.

* 8 pages, 9 figures, AAAI 2019
Click to Read Paper and Get Code
Recent unsupervised approaches to domain adaptation primarily focus on minimizing the gap between the source and the target domains through refining the feature generator, in order to learn a better alignment between the two domains. This minimization can be achieved via a domain classifier to detect target-domain features that are divergent from source-domain features. However, by optimizing via such domain classification discrepancy, ambiguous target samples that are not smoothly distributed on the low-dimensional data manifold are often missed. To solve this issue, we propose a novel Contrastively Smoothed Class Alignment (CoSCA) model, that explicitly incorporates both intra- and inter-class domain discrepancy to better align ambiguous target samples with the source domain. CoSCA estimates the underlying label hypothesis of target samples, and simultaneously adapts their feature representations by optimizing a proposed contrastive loss. In addition, Maximum Mean Discrepancy (MMD) is utilized to directly match features between source and target samples for better global alignment. Experiments on several benchmark datasets demonstrate that CoSCA can outperform state-of-the-art approaches for unsupervised domain adaptation by producing more discriminative features.

Click to Read Paper and Get Code
Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a realistically-natural and topically-coherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analysis could constitute a "high-quality" story to the human eye. Following this quality guideline, we propose a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria. Experiments on the Visual Storytelling Dataset (VIST) with both automatic and human evaluations demonstrate that our ReCo-RL model achieves better performance than state-of-the-art baselines on both traditional metrics and the proposed new criteria.

Click to Read Paper and Get Code
We present a large-scale dataset, ReCoRD, for machine reading comprehension requiring commonsense reasoning. Experiments on this dataset demonstrate that the performance of state-of-the-art MRC systems fall far behind human performance. ReCoRD represents a challenge for future research to bridge the gap between human and machine commonsense reading comprehension. ReCoRD is available at http://nlp.jhu.edu/record.

* 14 pages
Click to Read Paper and Get Code
This proposal introduces a Dialogue Challenge for building end-to-end task-completion dialogue systems, with the goal of encouraging the dialogue research community to collaborate and benchmark on standard datasets and unified experimental environment. In this special session, we will release human-annotated conversational data in three domains (movie-ticket booking, restaurant reservation, and taxi booking), as well as an experiment platform with built-in simulators in each domain, for training and evaluation purposes. The final submitted systems will be evaluated both in simulated setting and by human judges.

* SLT 2018 Special Session: http://www.slt2018.org/news/
Click to Read Paper and Get Code
This paper presents a Discriminative Deep Dyna-Q (D3Q) approach to improving the effectiveness and robustness of Deep Dyna-Q (DDQ), a recently proposed framework that extends the Dyna-Q algorithm to integrate planning for task-completion dialogue policy learning. To obviate DDQ's high dependency on the quality of simulated experiences, we incorporate an RNN-based discriminator in D3Q to differentiate simulated experience from real user experience in order to control the quality of training data. Experiments show that D3Q significantly outperforms DDQ by controlling the quality of simulated experience used for planning. The effectiveness and robustness of D3Q is further demonstrated in a domain extension setting, where the agent's capability of adapting to a changing environment is tested.

* 11 pages, 10 figures, EMNLP 2018 long paper
Click to Read Paper and Get Code
Commonsense reasoning is fundamental to natural language understanding. While traditional methods rely heavily on human-crafted features and knowledge bases, we explore learning commonsense knowledge from a large amount of raw text via unsupervised learning. We propose two neural network models based on the Deep Structured Semantic Models (DSSM) framework to tackle two classic commonsense reasoning tasks, Winograd Schema challenges (WSC) and Pronoun Disambiguation (PDP). Evaluation shows that the proposed models effectively capture contextual information in the sentence and co-reference information between pronouns and nouns, and achieve significant improvement over previous state-of-the-art approaches.

* To appear in NAACL 2019, 10 pages
Click to Read Paper and Get Code
This paper presents Recurrent Dual Attention Network (ReDAN) for visual dialog, using multi-step reasoning to answer a series of questions about an image. In each turn of the dialog, ReDAN infers answers progressively through multiple steps. In each step, a recurrently-updated semantic representation of the (refined) query is used for iterative reasoning over both the image and previous dialog history. Experimental results on VisDial v1.0 dataset show that the proposed ReDAN model outperforms prior state-of-the-art approaches across multiple evaluation metrics. Visualization on the iterative reasoning process further demonstrates that ReDAN can locate context-relevant visual and textual clues leading to the correct answers step-by-step.

Click to Read Paper and Get Code