Research papers and code for "Shuming Ma":
Conditional random fields (CRF) is one of the most famous approaches for structured classification. It is a structured gradient-based method, which has high accuracy but with drawbacks: very slow training, hard to implement in the tasks with complex structures, and no support of decode-based optimization (which is important in many cases). To address these issues, we propose a simple and fast solution, a decode-based probabilistic online learning method, called CRF with decode-based learning (DBL-CRF). The proposed DBL-CRF decodes the output candidates, derives probabilities, and conduct efficient online learning. The method has the similar probabilistic information as CRF, but supports decode-based optimization and does not need gradient computation. We show that this method is with fast training, very simple to implement, with top accuracy, and with theoretical guarantees of convergence. Experiments on well-known tasks show that our method has better accuracy and much faster speed than our strong baseline CRF systems. The code is available at https://github.com/lancopku/Decode-CRF

Click to Read Paper and Get Code
Text summarization and text simplification are two major ways to simplify the text for poor readers, including children, non-native speakers, and the functionally illiterate. Text summarization is to produce a brief summary of the main ideas of the text, while text simplification aims to reduce the linguistic complexity of the text and retain the original meaning. Recently, most approaches for text summarization and text simplification are based on the sequence-to-sequence model, which achieves much success in many text generation tasks. However, although the generated simplified texts are similar to source texts literally, they have low semantic relevance. In this work, our goal is to improve semantic relevance between source texts and simplified texts for text summarization and text simplification. We introduce a Semantic Relevance Based neural model to encourage high semantic similarity between texts and summaries. In our model, the source text is represented by a gated attention encoder, while the summary representation is produced by a decoder. Besides, the similarity score between the representations is maximized during training. Our experiments show that the proposed model outperforms the state-of-the-art systems on two benchmark corpus.

Click to Read Paper and Get Code
To speed up the training process, many existing systems use parallel technology for online learning algorithms. However, most research mainly focus on stochastic gradient descent (SGD) instead of other algorithms. We propose a generic online parallel learning framework for large margin models, and also analyze our framework on popular large margin algorithms, including MIRA and Structured Perceptron. Our framework is lock-free and easy to implement on existing systems. Experiments show that systems with our framework can gain near linear speed up by increasing running threads, and with no loss in accuracy.

Click to Read Paper and Get Code
Dependency parsing is an important NLP task. A popular approach for dependency parsing is structured perceptron. Still, graph-based dependency parsing has the time complexity of $O(n^3)$, and it suffers from slow training. To deal with this problem, we propose a parallel algorithm called parallel perceptron. The parallel algorithm can make full use of a multi-core computer which saves a lot of training time. Based on experiments we observe that dependency parsing with parallel perceptron can achieve 8-fold faster training speed than traditional structured perceptron methods when using 10 threads, and with no loss at all in accuracy.

Click to Read Paper and Get Code
Conditional Random Field (CRF) and recurrent neural models have achieved success in structured prediction. More recently, there is a marriage of CRF and recurrent neural models, so that we can gain from both non-linear dense features and globally normalized CRF objective. These recurrent neural CRF models mainly focus on encode node features in CRF undirected graphs. However, edge features prove important to CRF in structured prediction. In this work, we introduce a new recurrent neural CRF model, which learns non-linear edge features, and thus makes non-linear features encoded completely. We compare our model with different neural models in well-known structured prediction tasks. Experiments show that our model outperforms state-of-the-art methods in NP chunking, shallow parsing, Chinese word segmentation and POS tagging.

Click to Read Paper and Get Code
Article comments can provide supplementary opinions and facts for readers, thereby increase the attraction and engagement of articles. Therefore, automatically commenting is helpful in improving the activeness of the community, such as online forums and news websites. Previous work shows that training an automatic commenting system requires large parallel corpora. Although part of articles are naturally paired with the comments on some websites, most articles and comments are unpaired on the Internet. To fully exploit the unpaired data, we completely remove the need for parallel data and propose a novel unsupervised approach to train an automatic article commenting model, relying on nothing but unpaired articles and comments. Our model is based on a retrieval-based commenting framework, which uses news to retrieve comments based on the similarity of their topics. The topic representation is obtained from a neural variational topic model, which is trained in an unsupervised manner. We evaluate our model on a news comment dataset. Experiments show that our proposed topic-based approach significantly outperforms previous lexicon-based models. The model also profits from paired corpora and achieves state-of-the-art performance under semi-supervised scenarios.

Click to Read Paper and Get Code
With the development of information technology, there is an explosive growth in the number of online comment concerning news, blogs and so on. The massive comments are overloaded, and often contain some misleading and unwelcome information. Therefore, it is necessary to identify high-quality comments and filter out low-quality comments. In this work, we introduce a novel task: high-quality comment identification (HQCI), which aims to automatically assess the quality of online comments. First, we construct a news comment corpus, which consists of news, comments, and the corresponding quality label. Second, we analyze the dataset, and find the quality of comments can be measured in three aspects: informativeness, consistency, and novelty. Finally, we propose a novel multi-target text matching model, which can measure three aspects by referring to the news and surrounding comments. Experimental results show that our method can outperform various baselines by a large margin on the news dataset.

Click to Read Paper and Get Code
In neural abstractive summarization, the conventional sequence-to-sequence (seq2seq) model often suffers from repetition and semantic irrelevance. To tackle the problem, we propose a global encoding framework, which controls the information flow from the encoder to the decoder based on the global information of the source context. It consists of a convolutional gated unit to perform global encoding to improve the representations of the source-side information. Evaluations on the LCSTS and the English Gigaword both demonstrate that our model outperforms the baseline models, and the analysis shows that our model is capable of reducing repetition.

* Accepted by ACL 2018
Click to Read Paper and Get Code
Text summarization and sentiment classification both aim to capture the main ideas of the text but at different levels. Text summarization is to describe the text within a few sentences, while sentiment classification can be regarded as a special type of summarization which "summarizes" the text into a even more abstract fashion, i.e., a sentiment class. Based on this idea, we propose a hierarchical end-to-end model for joint learning of text summarization and sentiment classification, where the sentiment classification label is treated as the further "summarization" of the text summarization output. Hence, the sentiment classification layer is put upon the text summarization layer, and a hierarchical structure is derived. Experimental results on Amazon online reviews datasets show that our model achieves better performance than the strong baseline systems on both abstractive summarization and sentiment classification.

* accepted by IJCAI-18
Click to Read Paper and Get Code
A sentence can be translated into more than one correct sentences. However, most of the existing neural machine translation models only use one of the correct translations as the targets, and the other correct sentences are punished as the incorrect sentences in the training stage. Since most of the correct translations for one sentence share the similar bag-of-words, it is possible to distinguish the correct translations from the incorrect ones by the bag-of-words. In this paper, we propose an approach that uses both the sentences and the bag-of-words as targets in the training stage, in order to encourage the model to generate the potentially correct sentences that are not appeared in the training set. We evaluate our model on a Chinese-English translation dataset, and experiments show our model outperforms the strong baselines by the BLEU score of 4.55.

* accepted by ACL 2018
Click to Read Paper and Get Code
Most of the current abstractive text summarization models are based on the sequence-to-sequence model (Seq2Seq). The source content of social media is long and noisy, so it is difficult for Seq2Seq to learn an accurate semantic representation. Compared with the source content, the annotated summary is short and well written. Moreover, it shares the same meaning as the source content. In this work, we supervise the learning of the representation of the source content with that of the summary. In implementation, we regard a summary autoencoder as an assistant supervisor of Seq2Seq. Following previous work, we evaluate our model on a popular Chinese social media dataset. Experimental results show that our model achieves the state-of-the-art performances on the benchmark dataset.

* accepted by ACL 2018
Click to Read Paper and Get Code
As more and more academic papers are being submitted to conferences and journals, evaluating all these papers by professionals is time-consuming and can cause inequality due to the personal factors of the reviewers. In this paper, in order to assist professionals in evaluating academic papers, we propose a novel task: automatic academic paper rating (AAPR), which automatically determine whether to accept academic papers. We build a new dataset for this task and propose a novel modularized hierarchical convolutional neural network to achieve automatic academic paper rating. Evaluation results show that the proposed model outperforms the baselines by a large margin. The dataset and code are available at \url{https://github.com/lancopku/AAPR}

* Accepted by ACL2018
Click to Read Paper and Get Code
Attention-based sequence-to-sequence model has proved successful in Neural Machine Translation (NMT). However, the attention without consideration of decoding history, which includes the past information in the decoder and the attention mechanism, often causes much repetition. To address this problem, we propose the decoding-history-based Adaptive Control of Attention (ACA) for the NMT model. ACA learns to control the attention by keeping track of the decoding history and the current information with a memory vector, so that the model can take the translated contents and the current information into consideration. Experiments on Chinese-English translation and the English-Vietnamese translation have demonstrated that our model significantly outperforms the strong baselines. The analysis shows that our model is capable of generating translation with less repetition and higher accuracy. The code will be available at https://github.com/lancopku

Click to Read Paper and Get Code
We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-$k$ elements (in terms of magnitude) are kept. As a result, only $k$ rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction ($k$ divided by the vector dimension) in the computational cost. Surprisingly, experimental results demonstrate that we can update only 1--4\% of the weights at each back propagation pass. This does not result in a larger number of training iterations. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The code is available at https://github.com/jklj077/meProp

* Accepted by The 34th International Conference on Machine Learning (ICML 2017)
Click to Read Paper and Get Code
We propose a method, called Label Embedding Network, which can learn label representation (label embedding) during the training process of deep networks. With the proposed method, the label embedding is adaptively and automatically learned through back propagation. The original one-hot represented loss function is converted into a new loss function with soft distributions, such that the originally unrelated labels have continuous interactions with each other during the training process. As a result, the trained model can achieve substantially higher accuracy and with faster convergence speed. Experimental results based on competitive tasks demonstrate the effectiveness of the proposed method, and the learned label embedding is reasonable and interpretable. The proposed method achieves comparable or even better results than the state-of-the-art systems. The source code is available at \url{https://github.com/lancopku/LabelEmb}.

Click to Read Paper and Get Code
Existing neural models usually predict the tag of the current token independent of the neighboring tags. The popular LSTM-CRF model considers the tag dependencies between every two consecutive tags. However, it is hard for existing neural models to take longer distance dependencies of tags into consideration. The scalability is mainly limited by the complex model structures and the cost of dynamic programming during training. In our work, we first design a new model called "high order LSTM" to predict multiple tags for the current token which contains not only the current tag but also the previous several tags. We call the number of tags in one prediction as "order". Then we propose a new method called Multi-Order BiLSTM (MO-BiLSTM) which combines low order and high order LSTMs together. MO-BiLSTM keeps the scalability to high order models with a pruning technique. We evaluate MO-BiLSTM on all-phrase chunking and NER datasets. Experiment results show that MO-BiLSTM achieves the state-of-the-art result in chunking and highly competitive results in two NER datasets.

* Accepted by COLING 2018
Click to Read Paper and Get Code
We introduce the task of automatic live commenting. Live commenting, which is also called `video barrage', is an emerging feature on online video sites that allows real-time comments from viewers to fly across the screen like bullets or roll at the right side of the screen. The live comments are a mixture of opinions for the video and the chit chats with other comments. Automatic live commenting requires AI agents to comprehend the videos and interact with human viewers who also make the comments, so it is a good testbed of an AI agent's ability of dealing with both dynamic vision and language. In this work, we construct a large-scale live comment dataset with 2,361 videos and 895,929 live comments. Then, we introduce two neural models to generate live comments based on the visual and textual contexts, which achieve better performance than previous neural baselines such as the sequence-to-sequence model. Finally, we provide a retrieval-based evaluation protocol for automatic live commenting where the model is asked to sort a set of candidate comments based on the log-likelihood score, and evaluated on metrics such as mean-reciprocal-rank. Putting it all together, we demonstrate the first `LiveBot'.

Click to Read Paper and Get Code
We propose a novel model for multi-label text classification, which is based on sequence-to-sequence learning. The model generates higher-level semantic unit representations with multi-level dilated convolution as well as a corresponding hybrid attention mechanism that extracts both the information at the word-level and the level of the semantic unit. Our designed dilated convolution effectively reduces dimension and supports an exponential expansion of receptive fields without loss of local information, and the attention-over-attention mechanism is able to capture more summary relevant information from the source context. Results of our experiments show that the proposed model has significant advantages over the baseline models on the dataset RCV1-V2 and Ren-CECps, and our analysis demonstrates that our model is competitive to the deterministic hierarchical models and it is more robust to classifying low-frequency labels.

* To appear in EMNLP 2018
Click to Read Paper and Get Code
Most recent approaches use the sequence-to-sequence model for paraphrase generation. The existing sequence-to-sequence model tends to memorize the words and the patterns in the training dataset instead of learning the meaning of the words. Therefore, the generated sentences are often grammatically correct but semantically improper. In this work, we introduce a novel model based on the encoder-decoder framework, called Word Embedding Attention Network (WEAN). Our proposed model generates the words by querying distributed word representations (i.e. neural word embeddings), hoping to capturing the meaning of the according words. Following previous work, we evaluate our model on two paraphrase-oriented tasks, namely text simplification and short text abstractive summarization. Experimental results show that our model outperforms the sequence-to-sequence baseline by the BLEU score of 6.3 and 5.5 on two English text simplification datasets, and the ROUGE-2 F1 score of 5.7 on a Chinese summarization dataset. Moreover, our model achieves state-of-the-art performances on these three benchmark datasets.

* arXiv admin note: text overlap with arXiv:1710.02318
Click to Read Paper and Get Code
Multi-label classification is an important yet challenging task in natural language processing. It is more complex than single-label classification in that the labels tend to be correlated. Existing methods tend to ignore the correlations between labels. Besides, different parts of the text can contribute differently for predicting different labels, which is not considered by existing models. In this paper, we propose to view the multi-label classification task as a sequence generation problem, and apply a sequence generation model with a novel decoder structure to solve it. Extensive experimental results show that our proposed methods outperform previous work by a substantial margin. Further analysis of experimental results demonstrates that the proposed methods not only capture the correlations between labels, but also select the most informative words automatically when predicting different labels.

* Accepted by COLING 2018
Click to Read Paper and Get Code