Research papers and code for "Qiang Ning":
Identifying temporal relations between events is an essential step towards natural language understanding. However, the temporal relation between two events in a story depends on, and is often dictated by, relations among other events. Consequently, effectively identifying temporal relations between events is a challenging problem even for human annotators. This paper suggests that it is important to take these dependencies into account while learning to identify these relations and proposes a structured learning approach to address this challenge. As a byproduct, this provides a new perspective on handling missing relations, a known issue that hurts existing methods. As we show, the proposed approach results in significant improvements on the two commonly used data sets for this problem.

* Long paper appeared in EMNLP'17. 11 pages and 4 figures
Click to Read Paper and Get Code
Sliced inverse regression (SIR) is a pioneer tool for supervised dimension reduction. It identifies the effective dimension reduction space, the subspace of significant factors with intrinsic lower dimensionality. In this paper, we propose to refine the SIR algorithm through an overlapping slicing scheme. The new algorithm, called overlapping sliced inverse regression (OSIR), is able to estimate the effective dimension reduction space and determine the number of effective factors more accurately. We show that such overlapping procedure has the potential to identify the information contained in the derivatives of the inverse regression curve, which helps to explain the superiority of OSIR. We also prove that OSIR algorithm is $\sqrt n $-consistent and verify its effectiveness by simulations and real applications.

Click to Read Paper and Get Code
Existing temporal relation (TempRel) annotation schemes often have low inter-annotator agreements (IAA) even between experts, suggesting that the current annotation task needs a better definition. This paper proposes a new multi-axis modeling to better capture the temporal structure of events. In addition, we identify that event end-points are a major source of confusion in annotation, so we also propose to annotate TempRels based on start-points only. A pilot expert annotation using the proposed scheme shows significant improvement in IAA from the conventional 60's to 80's (Cohen's Kappa). This better-defined annotation scheme further enables the use of crowdsourcing to alleviate the labor intensity for each annotator. We hope that this work can foster more interesting studies towards event understanding.

* [Final Version] 14 pages, accepted by ACL'18
Click to Read Paper and Get Code
For many structured learning tasks, the data annotation process is complex and costly. Existing annotation schemes usually aim at acquiring completely annotated structures, under the common perception that partial structures are of low quality and could hurt the learning process. This paper questions this common perception, motivated by the fact that structures consist of interdependent sets of variables. Thus, given a fixed budget, partly annotating each structure may provide the same level of supervision, while allowing for more structures to be annotated. We provide an information theoretic formulation for this perspective and use it, in the context of three diverse structured learning tasks, to show that learning from partial structures can sometimes outperform learning from complete ones. Our findings may provide important insights into structured data annotation schemes and could support progress in learning protocols for structured tasks.

* Long paper accepted by NAACL'19. 11 pages and 7 figures
Click to Read Paper and Get Code
Understanding temporal and causal relations between events is a fundamental natural language understanding task. Because a cause must be before its effect in time, temporal and causal relations are closely related and one relation even dictates the other one in many cases. However, limited attention has been paid to studying these two relations jointly. This paper presents a joint inference framework for them using constrained conditional models (CCMs). Specifically, we formulate the joint problem as an integer linear programming (ILP) problem, enforcing constraints inherently in the nature of time and causality. We show that the joint inference framework results in statistically significant improvement in the extraction of both temporal and causal relations from text.

* Long paper appeared in ACL'18. 11 pages and 1 figure
Click to Read Paper and Get Code
Annotating temporal relations (TempRel) between events described in natural language is known to be labor intensive, partly because the total number of TempRels is quadratic in the number of events. As a result, only a small number of documents are typically annotated, limiting the coverage of various lexical/semantic phenomena. In order to improve existing approaches, one possibility is to make use of the readily available, partially annotated data (P as in partial) that cover more documents. However, missing annotations in P are known to hurt, rather than help, existing systems. This work is a case study in exploring various usages of P for TempRel extraction. Results show that despite missing annotations, P is still a useful supervision signal for this task within a constrained bootstrapping learning framework. The system described in this system is publicly available.

* [Final Version] short paper accepted by *SEM'18
Click to Read Paper and Get Code
Extracting temporal relations (before, after, overlapping, etc.) is a key aspect of understanding events described in natural language. We argue that this task would gain from the availability of a resource that provides prior knowledge in the form of the temporal order that events usually follow. This paper develops such a resource -- a probabilistic knowledge base acquired in the news domain -- by extracting temporal relations between events from the New York Times (NYT) articles over a 20-year span (1987--2007). We show that existing temporal extraction systems can be improved via this resource. As a byproduct, we also show that interesting statistics can be retrieved from this resource, which can potentially benefit other time-aware tasks. The proposed system and resource are both publicly available.

* 13 pages, 3 figures, accepted by NAACL'18
Click to Read Paper and Get Code
Automatic extraction of temporal information in text is an important component of natural language understanding. It involves two basic tasks: (1) Understanding time expressions that are mentioned explicitly in text (e.g., February 27, 1998 or tomorrow), and (2) Understanding temporal information that is conveyed implicitly via relations. In this paper, we introduce CogCompTime, a system that has these two important functionalities. It incorporates the most recent progress, achieves state-of-the-art performance, and is publicly available.1 We believe that this demo will be useful for multiple time-aware applications and provide valuable insight for future research in temporal understanding.

* Demo paper appeared in EMNLP'18. 6 pages and 3 figures
Click to Read Paper and Get Code
In current clinical practice, electroencephalograms (EEG) are reviewed and analyzed by well-trained neurologists to provide supports for therapeutic decisions. The way of manual reviewing is labor-intensive and error prone. Automatic and accurate seizure/nonseizure classification methods are needed. One major problem is that the EEG signals for seizure state and nonseizure state exhibit considerable variations. In order to capture essential seizure features, this paper integrates an emerging deep learning model, the independently recurrent neural network (IndRNN), with a dense structure and an attention mechanism to exploit temporal and spatial discriminating features and overcome seizure variabilities. The dense structure is to ensure maximum information flow between layers. The attention mechanism is to capture spatial features. Evaluations are performed in cross-validation experiments over the noisy CHB-MIT data set. The obtained average sensitivity, specificity and precision of 88.80%, 88.60% and 88.69% are better than using the current state-of-the-art methods. In addition, we explore how the segment length affects the classification performance. Thirteen different segment lengths are assessed, showing that the classification performance varies over the segment lengths, and the maximal fluctuating margin is more than 4%. Thus, the segment length is an important factor influencing the classification performance.

* 12 pages, 8 figures. arXiv admin note: text overlap with arXiv:1903.09326
Click to Read Paper and Get Code
In current clinical practices, electroencephalograms (EEG) are reviewed and analyzed by trained neurologists to provide supports for therapeutic decisions. Manual reviews can be laborious and error prone. Automatic and accurate seizure/non-seizure classification methods are desirable. A critical challenge is that seizure morphologies exhibit considerable variabilities. In order to capture essential seizure features, this paper leverages an emerging deep learning model, the independently recurrent neural network (IndRNN), to construct a new approach for the seizure/non-seizure classification. This new approach gradually expands the time scales, thereby extracting temporal and spatial features from the local time duration to the entire record. Evaluations are conducted with cross-validation experiments across subjects over the noisy data of CHB-MIT. Experimental results demonstrate that the proposed approach outperforms the current state-of-the-art methods. In addition, we explore how the segment length affects the classification performance. Thirteen different segment lengths are assessed, showing that the classification performance varies over the segment lengths, and the maximal fluctuating margin is more than 4%. Thus, the segment length is an important factor influencing the classification performance.

* 10 pages, 2 figures, submitted to AMIA symposium 2019
Click to Read Paper and Get Code
Retinal vessel segmentation based on deep learning requires a lot of manual labeled data. That is time-consuming, laborious and professional. What is worse, the acquisition of abundant fundus images is difficult. These problems are more serious due to the presence of abnormalities, varying size and shape of the vessels, non-uniform illumination and anatomical changes. In this paper, we propose a data-efficient semi-supervised learning framework, which effectively combines the existing deep learning network with GAN and self-training ideas. In view of the difficulty of tuning hyper-parameters of semi-supervised learning, we propose a method for hyper-parameters selection based on particle swarm optimization algorithm. To the best of our knowledge, this work is the first demonstration that combines intelligent optimization with semi-supervised learning for achieving the best performance. Under the collaboration of adversarial learning, self-training and PSO to select optimal hyper-parameters, we obtain the performance of retinal vessel segmentation approximate to or even better than representative supervised learning using only one tenth of the labeled data from DRIVE.

Click to Read Paper and Get Code
Text simplification (TS) can be viewed as monolingual translation task, translating between text variations within a single language. Recent neural TS models draw on insights from neural machine translation to learn lexical simplification and content reduction using encoder-decoder model. But different from neural machine translation, we cannot obtain enough ordinary and simplified sentence pairs for TS, which are expensive and time-consuming to build. Target-side simplified sentences plays an important role in boosting fluency for statistical TS, and we investigate the use of simplified sentences to train, with no changes to the network architecture. We propose to pair simple training sentence with a synthetic ordinary sentence via back-translation, and treating this synthetic data as additional training data. We train encoder-decoder model using synthetic sentence pairs and original sentence pairs, which can obtain substantial improvements on the available WikiLarge data and WikiSmall data compared with the state-of-the-art methods.

Click to Read Paper and Get Code
We propose an approach to reduce the bias of ridge regression and regularization kernel network. When applied to a single data set the new algorithms have comparable learning performance with the original ones. When applied to incremental learning with block wise streaming data the new algorithms are more efficient due to bias reduction. Both theoretical characterizations and simulation studies are used to verify the effectiveness of these new algorithms.

Click to Read Paper and Get Code
Convolutional neural network is a very important model of deep learning. It can help avoid the exploding/vanishing gradient problem and improve the generalizability of a neural network if the singular values of the Jacobian of a layer are bounded around $1$ in the training process. We propose a new penalty function for a convolutional kernel to let the singular values of the corresponding transformation matrix are bounded around $1$. We show how to carry out the gradient type methods. The penalty is about the transformation matrix corresponding to a kernel, not directly about the kernel, which is different from results in existing papers. This provides a new regularization method about the weights of convolutional layers. Other penalty functions about a kernel can be devised following this idea in future.

Click to Read Paper and Get Code
The developments of deep neural networks (DNN) in recent years have ushered a brand new era of artificial intelligence. DNNs are proved to be excellent in solving very complex problems, e.g., visual recognition and text understanding, to the extent of competing with or even surpassing people. Despite inspiring and encouraging success of DNNs, thorough theoretical analyses still lack to unravel the mystery of their magics. The design of DNN structure is dominated by empirical results in terms of network depth, number of neurons and activations. A few of remarkable works published recently in an attempt to interpret DNNs have established the first glimpses of their internal mechanisms. Nevertheless, research on exploring how DNNs operate is still at the initial stage with plenty of room for refinement. In this paper, we extend precedent research on neural networks with piecewise linear activations (PLNN) concerning linear regions bounds. We present (i) the exact maximal number of linear regions for single layer PLNNs; (ii) a upper bound for multi-layer PLNNs; and (iii) a tighter upper bound for the maximal number of liner regions on rectifier networks. The derived bounds also indirectly explain why deep models are more powerful than shallow counterparts, and how non-linearity of activation functions impacts on expressiveness of networks.

* Counting linear regions of neural networks
Click to Read Paper and Get Code
Multidimensional scaling is an important dimension reduction tool in statistics and machine learning. Yet few theoretical results characterizing its statistical performance exist, not to mention any in high dimensions. By considering a unified framework that includes low, moderate and high dimensions, we study multidimensional scaling in the setting of clustering noisy data. Our results suggest that, in order to achieve consistent estimation of the embedding scheme, the classical multidimensional scaling needs to be modified, especially when the noise level increases. To this end, we propose {\it modified multidimensional scaling} which applies a nonlinear transformation to the sample eigenvalues. The nonlinear transformation depends on the dimensionality, sample size and unknown moment. We show that modified multidimensional scaling followed by various clustering algorithms can achieve exact recovery, i.e., all the cluster labels can be recovered correctly with probability tending to one. Numerical simulations and two real data applications lend strong support to our proposed methodology. As a byproduct, we unify and improve existing results on the $\ell_{\infty}$ bound for eigenvectors under only low bounded moment conditions. This can be of independent interest.

* 31 pages, 4 figures
Click to Read Paper and Get Code
Edge features contain important information about graphs. However, current state-of-the-art neural network models designed for graph learning do not consider incorporating edge features, especially multi-dimensional edge features. In this paper, we propose an attention mechanism which combines both node features and edge features. Guided by the edge features, the attention mechanism on a pair of graph nodes will not only depend on node contents, but also ajust automatically with respect to the properties of the edge connecting these two nodes. Moreover, the edge features are adjusted by the attention function and fed to the next layer, which means our edge features are adaptive across network layers. As a result, our proposed adaptive edge features guided graph attention model can consolidate a rich source of graph information that current state-of-the-art graph learning methods cannot. We apply our proposed model to graph node classification, and experimental results on three citaion network datasets and a biological network dataset show that out method outperforms the current state-of-the-art methods, testifying to the discriminative capability of edge features and the effectiveness of our adaptive edge features guided attention model. Additional ablation experimental study further shows that both the edge features and adaptiveness components contribute to our model.

Click to Read Paper and Get Code
Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks. In this paper, we give a survey for MTL. First, we classify different MTL algorithms into several categories, including feature learning approach, low-rank approach, task clustering approach, task relation learning approach, and decomposition approach, and then discuss the characteristics of each approach. In order to improve the performance of learning tasks further, MTL can be combined with other learning paradigms including semi-supervised learning, active learning, unsupervised learning, reinforcement learning, multi-view learning and graphical models. When the number of tasks is large or the data dimensionality is high, batch MTL models are difficult to handle this situation and online, parallel and distributed MTL models as well as dimensionality reduction and feature hashing are reviewed to reveal their computational and storage advantages. Many real-world applications use MTL to boost their performance and we review representative works. Finally, we present theoretical analyses and discuss several future directions for MTL.

Click to Read Paper and Get Code
Machine Learning as a Service (MLaaS), such as Microsoft Azure, Amazon AWS, offers an effective DNN model to complete the machine learning task for small businesses and individuals who are restricted to the lacking data and computing power. However, here comes an issue that user privacy is ex-posed to the MLaaS server, since users need to upload their sensitive data to the MLaaS server. In order to preserve their privacy, users can encrypt their data before uploading it. This makes it difficult to run the DNN model because it is not designed for running in ciphertext domain. In this paper, using the Paillier homomorphic cryptosystem we present a new Privacy-Preserving Deep Neural Network model that we called 2P-DNN. This model can fulfill the machine leaning task in ciphertext domain. By using 2P-DNN, MLaaS is able to provide a Privacy-Preserving machine learning ser-vice for users. We build our 2P-DNN model based on LeNet-5, and test it with the encrypted MNIST dataset. The classification accuracy is more than 97%, which is close to the accuracy of LeNet-5 running with the MNIST dataset and higher than that of other existing Privacy-Preserving machine learning models

Click to Read Paper and Get Code
The locations of the fiducial facial landmark points around facial components and facial contour capture the rigid and non-rigid facial deformations due to head movements and facial expressions. They are hence important for various facial analysis tasks. Many facial landmark detection algorithms have been developed to automatically detect those key points over the years, and in this paper, we perform an extensive review of them. We classify the facial landmark detection algorithms into three major categories: holistic methods, Constrained Local Model (CLM) methods, and the regression-based methods. They differ in the ways to utilize the facial appearance and shape information. The holistic methods explicitly build models to represent the global facial appearance and shape information. The CLMs explicitly leverage the global shape model but build the local appearance models. The regression-based methods implicitly capture facial shape and appearance information. For algorithms within each category, we discuss their underlying theories as well as their differences. We also compare their performances on both controlled and in the wild benchmark datasets, under varying facial expressions, head poses, and occlusion. Based on the evaluations, we point out their respective strengths and weaknesses. There is also a separate section to review the latest deep learning-based algorithms. The survey also includes a listing of the benchmark databases and existing software. Finally, we identify future research directions, including combining methods in different categories to leverage their respective strengths to solve landmark detection "in-the-wild".

* International Journal on Computer Vision, 2017
Click to Read Paper and Get Code