Models, code, and papers for "Jingjing Liu":
Automatic evaluation of semantic rationality is an important yet challenging task, and current automatic techniques cannot well identify whether a sentence is semantically rational. The methods based on the language model do not measure the sentence by rationality but by commonness. The methods based on the similarity with human written sentences will fail if human-written references are not available. In this paper, we propose a novel model called Sememe-Word-Matching Neural Network (SWM-NN) to tackle semantic rationality evaluation by taking advantage of sememe knowledge base HowNet. The advantage is that our model can utilize a proper combination of sememes to represent the fine-grained semantic meanings of a word within the specific contexts. We use the fine-grained semantic representation to help the model learn the semantic dependency among words. To evaluate the effectiveness of the proposed model, we build a large-scale rationality evaluation dataset. Experimental results on this dataset show that the proposed model outperforms the competitive baselines with a 5.4\% improvement in accuracy.
Recently BERT has been adopted in state-of-the-art text summarization models for document encoding. However, such BERT-based extractive models use the sentence as the minimal selection unit, which often results in redundant or uninformative phrases in the generated summaries. As BERT is pre-trained on sentence pairs, not documents, the long-range dependencies between sentences are not well captured. To address these issues, we present a graph-based discourse-aware neural summarization model - DiscoBert. By utilizing discourse segmentation to extract discourse units (instead of sentences) as candidates, DiscoBert provides a fine-grained granularity for extractive selection, which helps reduce redundancy in extracted summaries. Based on this, two discourse graphs are further proposed: ($i$) RST Graph based on RST discourse trees; and ($ii$) Coreference Graph based on coreference mentions in the document. DiscoBert first encodes the extracted discourse units with BERT, and then uses a graph convolutional network to capture the long-range dependencies among discourse units through the constructed graphs. Experimental results on two popular summarization datasets demonstrate that DiscoBert outperforms state-of-the-art methods by a significant margin.
Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: ($i$) PKD-Last: learning from the last $k$ layers; and ($ii$) PKD-Skip: learning from every $k$ layers. These two patient distillation schemes enable the exploitation of rich information in the teacher's hidden layers, and encourage the student model to patiently learn from and imitate the teacher through a multi-layer distillation process. Empirically, this translates into improved results on multiple NLP tasks with significant gain in training efficiency, without sacrificing model accuracy.
In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.
To protect image contents, most existing encryption algorithms are designed to transform an original image into a texture-like or noise-like image, which is, however, an obvious visual sign indicating the presence of an encrypted image, results in a significantly large number of attacks. To solve this problem, in this paper, we propose a new image encryption method to generate a visually same image as the original one by sending a meaning-normal and independent image to a corresponding well-trained generative model to achieve the effect of disguising the original image. This image disguise method not only solves the problem of obvious visual implication, but also guarantees the security of the information.
We propose a multi-task learning framework to jointly train a Machine Reading Comprehension (MRC) model on multiple datasets across different domains. Key to the proposed method is to learn robust and general contextual representations with the help of out-domain data in a multi-task framework. Empirical study shows that the proposed approach is orthogonal to the existing pre-trained representation models, such as word embedding and language models. Experiments on the Stanford Question Answering Dataset (SQuAD), the Microsoft MAchine Reading COmprehension Dataset (MS MARCO), NewsQA and other datasets show that our multi-task learning approach achieves significant improvement over state-of-the-art models in most MRC tasks.
We investigate the problem of Language-Based Image Editing (LBIE). Given a source image and a natural language description, we want to generate a target image by editing the source image based on the description. We propose a generic modeling framework for two sub-tasks of LBIE: language-based image segmentation and image colorization. The framework uses recurrent attentive models to fuse image and language features. Instead of using a fixed step size, we introduce for each region of the image a termination gate to dynamically determine after each inference step whether to continue extrapolating additional information from the textual description. The effectiveness of the framework is validated on three datasets. First, we introduce a synthetic dataset, called CoSaL, to evaluate the end-to-end performance of our LBIE system. Second, we show that the framework leads to state-of-the-art performance on image segmentation on the ReferIt dataset. Third, we present the first language-based colorization result on the Oxford-102 Flowers dataset.
This paper presents a novel neural model - Dynamic Fusion Network (DFN), for machine reading comprehension (MRC). DFNs differ from most state-of-the-art models in their use of a dynamic multi-strategy attention process, in which passages, questions and answer candidates are jointly fused into attention vectors, along with a dynamic multi-step reasoning module for generating answers. With the use of reinforcement learning, for each input sample that consists of a question, a passage and a list of candidate answers, an instance of DFN with a sample-specific network architecture can be dynamically constructed by determining what attention strategy to apply and how many reasoning steps to take. Experiments show that DFNs achieve the best result reported on RACE, a challenging MRC dataset that contains real human reading questions in a wide variety of types. A detailed empirical analysis also demonstrates that DFNs can produce attention vectors that summarize information from questions, passages and answer candidates more effectively than other popular MRC models.
Multispectral pedestrian detection is essential for around-the-clock applications, e.g., surveillance and autonomous driving. We deeply analyze Faster R-CNN for multispectral pedestrian detection task and then model it into a convolutional network (ConvNet) fusion problem. Further, we discover that ConvNet-based pedestrian detectors trained by color or thermal images separately provide complementary information in discriminating human instances. Thus there is a large potential to improve pedestrian detection by using color and thermal images in DNNs simultaneously. We carefully design four ConvNet fusion architectures that integrate two-branch ConvNets on different DNNs stages, all of which yield better performance compared with the baseline detector. Our experimental results on KAIST pedestrian benchmark show that the Halfway Fusion model that performs fusion on the middle-level convolutional features outperforms the baseline method by 11% and yields a missing rate 3.5% lower than the other proposed architectures.
Transformer has been successfully applied to many natural language processing tasks. However, for textual sequence matching, simple matching between the representation of a pair of sequences might bring in unnecessary noise. In this paper, we propose a new approach to sequence pair matching with Transformer, by learning head-wise matching representations on multiple levels. Experiments show that our proposed approach can achieve new state-of-the-art performance on multiple tasks that rely only on pre-computed sequence-vector-representation, such as SNLI, MNLI-match, MNLI-mismatch, QQP, and SQuAD-binary.
Large-scale pre-trained language model, such as BERT, has recently achieved great success in a wide range of language understanding tasks. However, it remains an open question how to utilize BERT for text generation tasks. In this paper, we present a novel approach to addressing this challenge in a generic sequence-to-sequence (Seq2Seq) setting. We first propose a new task, Conditional Masked Language Modeling (C-MLM), to enable fine-tuning of BERT on target text-generation dataset. The fine-tuned BERT (i.e., teacher) is then exploited as extra supervision to improve conventional Seq2Seq models (i.e., student) for text generation. By leveraging BERT's idiosyncratic bidirectional nature, distilling the knowledge learned from BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong baselines of Transformer on multiple text generation tasks, including machine translation (MT) and text summarization. Our proposed model also achieves new state-of-the-art results on the IWSLT German-English and English-Vietnamese MT datasets.
In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain. To this end, we propose an Adversarial Domain Adaptation framework (AdaMRC), where ($i$) pseudo questions are first generated for unlabeled passages in the target domain, and then ($ii$) a domain classifier is incorporated into an MRC model to predict which domain a given passage-question pair comes from. The classifier and the passage-question encoder are jointly trained using adversarial learning to enforce domain-invariant representation learning. Comprehensive evaluations demonstrate that our approach ($i$) is generalizable to different MRC models and datasets, ($ii$) can be combined with pre-trained large-scale language models (such as ELMo and BERT), and ($iii$) can be extended to semi-supervised learning.
Graph based clustering is one of the major clustering methods. Most of it work in three separate steps: similarity graph construction, clustering label relaxing and label discretization with k-means. Such common practice has three disadvantages: 1) the predefined similarity graph is often fixed and may not be optimal for the subsequent clustering. 2) the relaxing process of cluster labels may cause significant information loss. 3) label discretization may deviate from the real clustering result since k-means is sensitive to the initialization of cluster centroids. To tackle these problems, in this paper, we propose an effective discrete optimal graph clustering (DOGC) framework. A structured similarity graph that is theoretically optimal for clustering performance is adaptively learned with a guidance of reasonable rank constraint. Besides, to avoid the information loss, we explicitly enforce a discrete transformation on the intermediate continuous label, which derives a tractable optimization problem with discrete solution. Further, to compensate the unreliability of the learned labels and enhance the clustering accuracy, we design an adaptive robust module that learns prediction function for the unseen data based on the learned discrete cluster labels. Finally, an iterative optimization strategy guaranteed with convergence is developed to directly solve the clustering results. Extensive experiments conducted on both real and synthetic datasets demonstrate the superiority of our proposed methods compared with several state-of-the-art clustering approaches.
In this paper, we introduce a new task - interactive image editing via conversational language, where users can guide an agent to edit images via multi-turn dialogue in natural language. In each dialogue turn, the agent takes a source image and a natural language description from the user as the input, and generates a target image following the textual description. Two new datasets are created for this task,Zap-Seq and DeepFashion-Seq, collected via crowdsourcing. For this task, we propose a new Sequential Attention Genrative Adversarial Network (SeqAttnGAN) framework, which applies a neural state tracker to encode both source image and textual descriptions, and generates high quality images in each dialogue turn. To achieve better region specific text-to-image generation, we also introducean attention mechanism into the model. Experiments on the two datasets, including quantitative evaluation and user study, show that our model outperforms state-of-the-art ap-proaches in both image quality and text-to-image consistency.
Attribute recognition has become crucial because of its wide applications in many computer vision tasks, such as person re-identification. Like many object recognition problems, variations in viewpoints, illumination, and recognition at far distance, all make this task challenging. In this work, we propose a joint CTC-Attention model (JCM), which maps attribute labels into sequences to learn the semantic relationship among attributes. Besides, this network uses neural network to encode images into sequences, and employs connectionist temporal classification (CTC) loss to train the network with the aim of improving the encoding performance of the network. At the same time, it adopts the attention model to decode the sequences, which can realize aligning the sequences and better learning the semantic information from attributes. Extensive experiments on three public datasets, i.e., Market-1501 attribute dataset, Duke attribute dataset and PETA dataset, demonstrate the effectiveness of the proposed method.
Training task-completion dialogue agents with reinforcement learning usually requires a large number of real user experiences. The Dyna-Q algorithm extends Q-learning by integrating a world model, and thus can effectively boost training efficiency using simulated experiences generated by the world model. The effectiveness of Dyna-Q, however, depends on the quality of the world model - or implicitly, the pre-specified ratio of real vs. simulated experiences used for Q-learning. To this end, we extend the recently proposed Deep Dyna-Q (DDQ) framework by integrating a switcher that automatically determines whether to use a real or simulated experience for Q-learning. Furthermore, we explore the use of active learning for improving sample efficiency, by encouraging the world model to generate simulated experiences in the state-action space where the agent has not (fully) explored. Our results show that by combining switcher and active learning, the new framework named as Switch-based Active Deep Dyna-Q (Switch-DDQ), leads to significant improvement over DDQ and Q-learning baselines in both simulation and human evaluations.
Objective: We propose an end-to-end CNN-based locating model for pulmonary tuberculosis (TB) diagnosis in radiographs. This model makes full use of chest radiograph (X-ray) for its improved accessibility, reduced cost and high accuracy for TB disease. Methods: Several specialized improvements are proposed for detection task in medical field. A false positive (FP) restrictor head is introduced for FP reduction. Anchor-oriented network heads is proposed in the position regression section. An optimization of loss function is designed for hard example mining. Results: The experimental results show that when the threshold of intersection over union (IoU) is set to 0.3, the average precision (AP) of two test data sets provided by different hospitals reaches 0.9023 and 0.9332. Ablation experiments shows that hard example mining and change of regressor heads contribute most in this work, but FP restriction is necessary in a CAD diagnose system. Conclusion: The results prove the high precision and good generalization ability of our proposed model comparing to previous works. Significance: We first make full use of the feature extraction ability of CNNs in TB diagnostic field and make exploration in localization of TB, when the previous works focus on the weaker task of healthy-sick subject classification.
There are two main lines of research on visual reasoning: neural module network (NMN) with explicit multi-hop reasoning through handcrafted neural modules, and monolithic network with implicit reasoning in the latent feature space. The former excels in interpretability and compositionality, while the latter usually achieves better performance due to model flexibility and parameter efficiency. In order to bridge the gap between the two and leverage the merits of both, we present Meta Module Network (MMN), a novel hybrid approach that can utilize a Meta Module to perform versatile functionalities, while preserving compositionality and interpretability through modularized design. The proposed model first parses an input question into a functional program through a Program Generator. Instead of handcrafting a task-specific network to represent each function similar to traditional NMN, we propose a Meta Module, which can read a recipe (function specifications) to dynamically instantiate the task-specific Instance Modules for compositional reasoning. To endow different instance modules with designated functionalities, we design a symbolic teacher which can execute against provided scene graphs to generate guidelines for the instantiated modules (student) to follow during training. Experiments conducted on the GQA benchmark demonstrates that MMN outperforms both NMN and monolithic network baselines, with good generalization ability to handle unseen functions.
In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes from different levels of granularity (i.e., questions, paragraphs, sentences, and entities), the representations of which are initialized with BERT-based context encoders. By weaving heterogeneous nodes in an integral unified graph, this characteristic hierarchical differentiation of node granularity enables HGN to support different question answering sub-tasks simultaneously (e.g., paragraph selection, supporting facts extraction, and answer prediction). Given a constructed hierarchical graph for each question, the initial node representations are updated through graph propagation; and for each sub-task, multi-hop reasoning is performed by traversing through graph edges. Extensive experiments on the HotpotQA benchmark demonstrate that the proposed HGN approach significantly outperforms prior state-of-the-art methods by a large margin in both Distractor and Fullwiki settings.