Models, code, and papers for "Hai Zhao":
Head-driven phrase structure grammar (HPSG) enjoys a uniform formalism representing rich contextual syntactic and even semantic meanings. This paper makes the first attempt to formulate a simplified HPSG by integrating constituent and dependency formal representations into head-driven phrase structure. Then two parsing algorithms are respectively proposed for two converted tree representations, division span and joint span. As HPSG encodes both constituent and dependency structure information, the proposed HPSG parsers may be regarded as a sort of joint decoder for both types of structures and thus are evaluated in terms of extracted or converted constituent and dependency parsing trees. Our parser achieves new state-of-the-art performance for both parsing tasks on Penn Treebank (PTB) and Chinese Penn Treebank, verifying the effectiveness of joint learning constituent and dependency structures. In details, we report 95.84 F1 of constituent parsing and 97.00% UAS of dependency parsing on PTB.
In this paper, we propose a span based model combined with syntactic information for n-ary open information extraction. The advantage of span model is that it can leverage span level features, which is difficult in token based BIO tagging methods. We also improve the previous bootstrap method to construct training corpus. Experiments show that our model outperforms previous open information extraction systems. Our code and data are publicly available at https://github.com/zhanjunlang/Span_OIE
Chemical information extraction is to convert chemical knowledge in text into true chemical database, which is a text processing task heavily relying on chemical compound name identification and standardization. Once a systematic name for a chemical compound is given, it will naturally and much simply convert the name into the eventually required molecular formula. However, for many chemical substances, they have been shown in many other names besides their systematic names which poses a great challenge for this task. In this paper, we propose a framework to do the auto standardization from the non-systematic names to the corresponding systematic names by using the spelling error correction, byte pair encoding tokenization and neural sequence to sequence model. Our framework is trained end to end and is fully data-driven. Our standardization accuracy on the test dataset achieves 54.04% which has a great improvement compared to previous state-of-the-art result.
Easy-first parsing relies on subtree re-ranking to build the complete parse tree. Whereas the intermediate state of parsing processing are represented by various subtrees, whose internal structural information is the key lead for later parsing action decisions, we explore a better representation for such subtrees. In detail, this work introduces a bottom-up subtree encoder based on the child-sum tree-LSTM. Starting from an easy-first dependency parser without other handcraft features, we show that the effective subtree encoder does promote the parsing process, and is able to make a greedy search easy-first parser achieve promising results on benchmark treebanks compared to state-of-the-art baselines.
Chinese pinyin input method engine (IME) converts pinyin into character so that Chinese characters can be conveniently inputted into computer through common keyboard. IMEs work relying on its core component, pinyin-to-character conversion (P2C). Usually Chinese IMEs simply predict a list of character sequences for user choice only according to user pinyin input at each turn. However, Chinese inputting is a multi-turn online procedure, which can be supposed to be exploited for further user experience promoting. This paper thus for the first time introduces a sequence-to-sequence model with gated-attention mechanism for the core task in IMEs. The proposed neural P2C model is learned by encoding previous input utterance as extra context to enable our IME capable of predicting character sequence with incomplete pinyin input. Our model is evaluated in different benchmark datasets showing great user experience improvement compared to traditional models, which demonstrates the first engineering practice of building Chinese aided IME.
For different language pairs, word-level neural machine translation (NMT) models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary (OOV) words. The common practice usually replaces all these rare or unknown words with a <UNK> token, which limits the translation performance to some extent. Most of recent work handled such a problem by splitting words into characters or other specially extracted subword units to enable open-vocabulary translation. Byte pair encoding (BPE) is one of the successful attempts that has been shown extremely competitive by providing effective subword segmentation for NMT systems. In this paper, we extend the BPE style segmentation to a general unsupervised framework with three statistical measures: frequency (FRQ), accessor variety (AV) and description length gain (DLG). We test our approach on two translation tasks: German to English and Chinese to English. The experimental results show that AV and DLG enhanced systems outperform the FRQ baseline in the frequency weighted schemes at different significant levels.
Implicit discourse relation recognition is a challenging task as the relation prediction without explicit connectives in discourse parsing needs understanding of text spans and cannot be easily derived from surface features from the input sentence pairs. Thus, properly representing the text is very crucial to this task. In this paper, we propose a model augmented with different grained text representations, including character, subword, word, sentence, and sentence pair levels. The proposed deeper model is evaluated on the benchmark treebank and achieves state-of-the-art accuracy with greater than 48% in 11-way and $F_1$ score greater than 50% in 4-way classifications for the first time according to our best knowledge.
Answering questions from university admission exams (Gaokao in Chinese) is a challenging AI task since it requires effective representation to capture complicated semantic relations between questions and answers. In this work, we propose a hybrid neural model for deep question-answering task from history examinations. Our model employs a cooperative gated neural network to retrieve answers with the assistance of extra labels given by a neural turing machine labeler. Empirical study shows that the labeler works well with only a small training dataset and the gated mechanism is good at fetching the semantic representation of lengthy answers. Experiments on question answering demonstrate the proposed model obtains substantial performance gains over various neural model baselines in terms of multiple evaluation metrics.
Most previous approaches to Chinese word segmentation formalize this problem as a character-based sequence labeling task where only contextual information within fixed sized local windows and simple interactions between adjacent tags can be captured. In this paper, we propose a novel neural framework which thoroughly eliminates context windows and can utilize complete segmentation history. Our model employs a gated combination neural network over characters to produce distributed representations of word candidates, which are then given to a long short-term memory (LSTM) language scoring model. Experiments on the benchmark datasets show that without the help of feature engineering as most existing approaches, our models achieve competitive or better performances with previous state-of-the-art methods.
This paper presents generalized probabilistic models for high-order projective dependency parsing and an algorithmic framework for learning these statistical models involving dependency trees. Partition functions and marginals for high-order dependency trees can be computed efficiently, by adapting our algorithms which extend the inside-outside algorithm to higher-order cases. To show the effectiveness of our algorithms, we perform experiments on three languages---English, Chinese and Czech, using maximum conditional likelihood estimation for model training and L-BFGS for parameter estimation. Our methods achieve competitive performance for English, and outperform all previously reported dependency parsers for Chinese and Czech.
Recently, semantic role labeling (SRL) has earned a series of success with even higher performance improvements, which can be mainly attributed to syntactic integration and enhanced word representation. However, most of these efforts focus on English, while SRL on multiple languages more than English has received relatively little attention so that is kept underdevelopment. Thus this paper intends to fill the gap on multilingual SRL with special focus on the impact of syntax and contextualized word representation. Unlike existing work, we propose a novel method guided by syntactic rule to prune arguments, which enables us to integrate syntax into multilingual SRL model simply and effectively. We present a unified SRL model designed for multiple languages together with the proposed uniform syntax enhancement. Our model achieves new state-of-the-art results on the CoNLL-2009 benchmarks of all seven languages. Besides, we pose a discussion on the syntactic role among different languages and verify the effectiveness of deep enhanced representation for multilingual SRL.
Implicit discourse relation recognition is a challenging task due to the absence of the necessary informative clue from explicit connectives. The prediction of relations requires a deep understanding of the semantic meanings of sentence pairs. As implicit discourse relation recognizer has to carefully tackle the semantic similarity of the given sentence pairs and the severe data sparsity issue exists in the meantime, it is supposed to be beneficial from mastering the entire training data. Thus in this paper, we propose a novel memory mechanism to tackle the challenges for further performance improvement. The memory mechanism is adequately memorizing information by pairing representations and discourse relations of all training instances, which right fills the slot of the data-hungry issue in the current implicit discourse relation recognizer. Our experiments show that our full model with memorizing the entire training set reaches new state-of-the-art against strong baselines, which especially for the first time exceeds the milestone of 60% accuracy in the 4-way task.
This paper presents a smart sliding Chinese pinyin Input Method Editor (IME) for touchscreen devices which allows user finger sliding from one key to another on the touchscreen instead of tapping keys one by one, while the target Chinese character sequence will be predicted during the sliding process to help user input Chinese characters efficiently. Moreover, the layout of the virtual keyboard of our IME adapts to user sliding for more efficient inputting. The layout adaption process is utilized with Recurrent Neural Networks (RNN) and deep reinforcement learning. The pinyin-to-character converter is implemented with a sequence-to-sequence (Seq2Seq) model to predict the target Chinese sequence. A sliding simulator is built to automatically produce sliding samples for model training and virtual keyboard test. The key advantage of our proposed IME is that nearly all its built-in tactics can be optimized automatically with deep learning algorithms only following user behavior. Empirical studies verify the effectiveness of the proposed model and show a better user input efficiency.
Deep neural network models have helped named entity (NE) recognition achieve amazing performance without handcrafting features. However, existing systems require large amounts of human annotated training data. Efforts have been made to replace human annotations with external knowledge (e.g., NE dictionary, part-of-speech tags), while it is another challenge to obtain such effective resources. In this work, we propose a fully unsupervised NE recognition model which only needs to take informative clues from pre-trained word embeddings. We first apply Gaussian Hidden Markov Model and Deep Autoencoding Gaussian Mixture Model on word embeddings for entity span detection and type prediction, and then further design an instance selector based on reinforcement learning to distinguish positive sentences from noisy sentences and refine these coarse-grained annotations through neural networks. Extensive experiments on CoNLL benchmark datasets demonstrate that our proposed light NE recognition model achieves remarkable performance without using any annotated lexicon or corpus.
Both syntactic and semantic structures are key linguistic contextual clues, in which parsing the latter has been well shown beneficial from parsing the former. However, few works ever made an attempt to let semantic parsing help syntactic parsing. As linguistic representation formalisms, both syntax and semantics may be represented in either span (constituent/phrase) or dependency, on both of which joint learning was also seldom explored. In this paper, we propose a novel joint model of syntactic and semantic parsing on both span and dependency representations, which incorporates syntactic information effectively in the encoder of neural network and benefits from two representation formalisms in a uniform way. The experiments show that semantics and syntax can benefit each other by optimizing joint objectives. Our single model achieves new state-of-the-art or competitive results on both span and dependency semantic parsing on Propbank benchmarks and both dependency and constituent syntactic parsing on Penn Treebank.
Constituent and dependency representation for syntactic structure share a lot of linguistic and computational characteristics, this paper thus makes the first attempt by introducing a new model that is capable of parsing constituent and dependency at the same time, so that lets either of the parsers enhance each other. Especially, we evaluate the effect of different shared network components and empirically verify that dependency parsing may be much more beneficial from constituent parsing structure. The proposed parser achieves new state-of-the-art performance for both parsing tasks, constituent and dependency on PTB and CTB benchmarks.
Semantic role labeling (SRL) is a task to recognize all the predicate-argument pairs of a sentence, which has been in a performance improvement bottleneck after a series of latest works were presented. This paper proposes a novel syntax-agnostic SRL model enhanced by the proposed associated memory network (AMN), which makes use of inter-sentence attention of label-known associated sentences as a kind of memory to further enhance dependency-based SRL. In detail, we use sentences and their labels from train dataset as an associated memory cue to help label the target sentence. Furthermore, we compare several associated sentences selecting strategies and label merging methods in AMN to find and utilize the label of associated sentences while attending them. By leveraging the attentive memory from known training data, Our full model reaches state-of-the-art on CoNLL-2009 benchmark datasets for syntax-agnostic setting, showing a new effective research line of SRL enhancement other than exploiting external resources such as well pre-trained language models.