Research papers and code for "Hao Pan":
Learning with a primary objective, such as softmax cross entropy for classification and sequence generation, has been the norm for training deep neural networks for years. Although being a widely-adopted approach, using cross entropy as the primary objective exploits mostly the information from the ground-truth class for maximizing data likelihood, and largely ignores information from the complement (incorrect) classes. We argue that, in addition to the primary objective, training also using a complement objective that leverages information from the complement classes can be effective in improving model performance. This motivates us to study a new training paradigm that maximizes the likelihood of the groundtruth class while neutralizing the probabilities of the complement classes. We conduct extensive experiments on multiple tasks ranging from computer vision to natural language understanding. The experimental results confirm that, compared to the conventional training with just one primary objective, training also with the complement objective further improves the performance of the state-of-the-art models across all tasks. In addition to the accuracy improvement, we also show that models trained with both primary and complement objectives are more robust to single-step adversarial attacks.

* ICLR'19 Camera Ready
Click to Read Paper and Get Code
Deep learning using multi-layer neural networks (NNs) architecture manifests superb power in modern machine learning systems. The trained Deep Neural Networks (DNNs) are typically large. The question we would like to address is whether it is possible to simplify the NN during training process to achieve a reasonable performance within an acceptable computational time. We presented a novel approach of optimising a deep neural network through regularisation of net- work architecture. We proposed regularisers which support a simple mechanism of dropping neurons during a network training process. The method supports the construction of a simpler deep neural networks with compatible performance with its simplified version. As a proof of concept, we evaluate the proposed method with examples including sparse linear regression, deep autoencoder and convolutional neural network. The valuations demonstrate excellent performance. The code for this work can be found in http://www.github.com/panweihit/DropNeuron

Click to Read Paper and Get Code
We extend Convolutional Neural Networks (CNNs) on flat and regular domains (e.g. 2D images) to curved surfaces embedded in 3D Euclidean space that are discretized as irregular meshes and widely used to represent geometric data in Computer Vision and Graphics. We define surface convolution on tangent spaces of a surface domain, where the convolution has two desirable properties: 1) the distortion of surface domain signals is locally minimal when being projected to the tangent space, and 2) the translation equi-variance property holds locally, by aligning tangent spaces with the canonical parallel transport that preserves metric. For computation, we rely on a parallel N-direction frame field on the surface that minimizes field variation and therefore is as compatible as possible to and approximates the parallel transport. On the tangent spaces equipped with parallel frames, the computation of surface convolution becomes standard routine. The frames have rotational symmetry which we disambiguate by constructing the covering space of surface induced by the parallel frames and grouping the feature maps into N sets accordingly; convolution is computed on the N branches of the cover space with respective feature maps while the kernel weights are shared. To handle irregular points of a discrete mesh while sharing kernel weights, we make the convolution semi-discrete, i.e. the convolution kernels are polynomial functions, and their convolution with discrete surface points becomes sampling and weighted summation. Pooling and unpooling operations are computed along a mesh hierarchy built through simplification. The presented surface CNNs allow effective deep learning on meshes. We show that for tasks of classification, segmentation and non-rigid registration, surface CNNs using only raw input signals achieve superior performances than previous models using sophisticated input features.

* 10 pages, 11 figures
Click to Read Paper and Get Code
In this paper, we consider the problem of machine reading task when the questions are in the form of keywords, rather than natural language. In recent years, researchers have achieved significant success on machine reading comprehension tasks, such as SQuAD and TriviaQA. These datasets provide a natural language question sentence and a pre-selected passage, and the goal is to answer the question according to the passage. However, in the situation of interacting with machines by means of text, people are more likely to raise a query in form of several keywords rather than a complete sentence. The keyword-based query comprehension is a new challenge, because small variations to a question may completely change its semantical information, thus yield different answers. In this paper, we propose a novel neural network system that consists a Demand Optimization Model based on a passage-attention neural machine translation and a Reader Model that can find the answer given the optimized question. The Demand Optimization Model optimizes the original query and output multiple reconstructed questions, then the Reader Model takes the new questions as input and locate the answers from the passage. To make predictions robust, an evaluation mechanism will score the reconstructed questions so the final answer strike a good balance between the quality of both the Demand Optimization Model and the Reader Model. Experimental results on several datasets show that our framework significantly improves multiple strong baselines on this challenging task.

Click to Read Paper and Get Code
This paper investigates a new task named Conversational Question Generation (CQG) which is to generate a question based on a passage and a conversation history (i.e., previous turns of question-answer pairs). CQG is a crucial task for developing intelligent agents that can drive question-answering style conversations or test user understanding of a given passage. Towards that end, we propose a new approach named Reinforced Dynamic Reasoning (ReDR) network, which is based on the general encoder-decoder framework but incorporates a reasoning procedure in a dynamic manner to better understand what has been asked and what to ask next about the passage. To encourage producing meaningful questions, we leverage a popular question answering (QA) model to provide feedback and fine-tune the question generator using a reinforcement learning mechanism. Empirical results on the recently released CoQA dataset demonstrate the effectiveness of our method in comparison with various baselines and model variants. Moreover, to show the applicability of our method, we also apply it to create multi-turn question-answering conversations for passages in SQuAD.

* Accepted in ACL 2019
Click to Read Paper and Get Code
Bionic design refers to an approach of generative creativity in which a target object (e.g. a floor lamp) is designed to contain features of biological source objects (e.g. flowers), resulting in creative biologically-inspired design. In this work, we attempt to model the process of shape-oriented bionic design as follows: given an input image of a design target object, the model generates images that 1) maintain shape features of the input design target image, 2) contain shape features of images from the specified biological source domain, 3) are plausible and diverse. We propose DesignGAN, a novel unsupervised deep generative approach to realising bionic design. Specifically, we employ a conditional Generative Adversarial Networks architecture with several designated losses (an adversarial loss, a regression loss, a cycle loss and a latent loss) that respectively constrict our model to meet the corresponding aforementioned requirements of bionic design modelling. We perform qualitative and quantitative experiments to evaluate our method, and demonstrate that our proposed approach successfully generates creative images of bionic design.

Click to Read Paper and Get Code
Inspired by practical importance of social networks, economic networks, biological networks and so on, studies on large and complex networks have attracted a surge of attentions in the recent years. Link prediction is a fundamental issue to understand the mechanisms by which new links are added to the networks. We introduce the method of robust principal component analysis (robust PCA) into link prediction, and estimate the missing entries of the adjacency matrix. On one hand, our algorithm is based on the sparsity and low rank property of the matrix, on the other hand, it also performs very well when the network is dense. This is because a relatively dense real network is also sparse in comparison to the complete graph. According to extensive experiments on real networks from disparate fields, when the target network is connected and sufficiently dense, whatever it is weighted or unweighted, our method is demonstrated to be very effective and with prediction accuracy being considerably improved comparing with many state-of-the-art algorithms.

Click to Read Paper and Get Code
Machine comprehension(MC) style question answering is a representative problem in natural language processing. Previous methods rarely spend time on the improvement of encoding layer, especially the embedding of syntactic information and name entity of the words, which are very crucial to the quality of encoding. Moreover, existing attention methods represent each query word as a vector or use a single vector to represent the whole query sentence, neither of them can handle the proper weight of the key words in query sentence. In this paper, we introduce a novel neural network architecture called Multi-layer Embedding with Memory Network(MEMEN) for machine reading task. In the encoding layer, we employ classic skip-gram model to the syntactic and semantic information of the words to train a new kind of embedding layer. We also propose a memory network of full-orientation matching of the query and passage to catch more pivotal information. Experiments show that our model has competitive results both from the perspectives of precision and efficiency in Stanford Question Answering Dataset(SQuAD) among all published results and achieves the state-of-the-art results on TriviaQA dataset.

Click to Read Paper and Get Code
Developing a safe and efficient collision avoidance policy for multiple robots is challenging in the decentralized scenarios where each robot generate its paths without observing other robots' states and intents. While other distributed multi-robot collision avoidance systems exist, they often require extracting agent-level features to plan a local collision-free action, which can be computationally prohibitive and not robust. More importantly, in practice the performance of these methods are much lower than their centralized counterparts. We present a decentralized sensor-level collision avoidance policy for multi-robot systems, which directly maps raw sensor measurements to an agent's steering commands in terms of movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we present a multi-scenario multi-stage training framework to find an optimal policy which is trained over a large number of robots on rich, complex environments simultaneously using a policy gradient based reinforcement learning algorithm. We validate the learned sensor-level collision avoidance policy in a variety of simulated scenarios with thorough performance evaluations and show that the final learned policy is able to find time efficient, collision-free paths for a large-scale robot system. We also demonstrate that the learned policy can be well generalized to new scenarios that do not appear in the entire training period, including navigating a heterogeneous group of robots and a large-scale scenario with 100 robots. Videos are available at https://sites.google.com/view/drlmaca

Click to Read Paper and Get Code
Machine Comprehension (MC) is one of the core problems in natural language processing, requiring both understanding of the natural language and knowledge about the world. Rapid progress has been made since the release of several benchmark datasets, and recently the state-of-the-art models even surpass human performance on the well-known SQuAD evaluation. In this paper, we transfer knowledge learned from machine comprehension to the sequence-to-sequence tasks to deepen the understanding of the text. We propose MacNet: a novel encoder-decoder supplementary architecture to the widely used attention-based sequence-to-sequence models. Experiments on neural machine translation (NMT) and abstractive text summarization show that our proposed framework can significantly improve the performance of the baseline models, and our method for the abstractive text summarization achieves the state-of-the-art results on the Gigaword dataset.

* Accepted In NeurIPS 2018
Click to Read Paper and Get Code
In critical care, intensivists are required to continuously monitor high dimensional vital signs and lab measurements to detect and diagnose acute patient conditions. This has always been a challenging task. In this study, we propose a novel self-correcting deep learning prediction approach to address this challenge. We focus on an example of the prediction of acute kidney injury (AKI). Compared with the existing models, our method has a number of distinct features: we utilized the accumulative data of patients in ICU; we developed a self-correcting mechanism that feeds errors from the previous predictions back into the network; we also proposed a regularization method that takes into account not only the model's prediction error on the label but also its estimation errors on the input data. This mechanism is applied in both regression and classification tasks. We compared the performance of our proposed method with the conventional deep learning models on two real-world clinical datasets and demonstrated that our proposed model constantly outperforms these baseline models. In particular, the proposed model achieved area under ROC curve at 0.893 on the MIMIC III dataset, and 0.871 on the Philips eICU dataset.

Click to Read Paper and Get Code
This paper proposes a practical approach to addressing limitations posed by use of single active electrodes in applications for sleep stage classification. Electroencephalography (EEG)-based characterizations of sleep stage progression contribute the diagnosis and monitoring of the many pathologies of sleep. Several prior reports have explored ways of automating the analysis of sleep EEG and of reducing the complexity of the data needed for reliable discrimination of sleep stages in order to make it possible to perform sleep studies at lower cost in the home (rather than only in specialized clinical facilities). However, these reports have involved recordings from electrodes placed on the cranial vertex or occiput, which can be uncomfortable or difficult for subjects to position. Those that have utilized single EEG channels which contain less sleep information, have showed poor classification performance. We have taken advantage of Rectifier Neural Network for feature detection and Long Short-Term Memory (LSTM) network for sequential data learning to optimize classification performance with single electrode recordings. After exploring alternative electrode placements, we found a comfortable configuration of a single-channel EEG on the forehead and have shown that it can be integrated with additional electrodes for simultaneous recording of the electroocuolgram (EOG). Evaluation of data from 62 people (with 494 hours sleep) demonstrated better performance of our analytical algorithm for automated sleep classification than existing approaches using vertex or occipital electrode placements. Use of this recording configuration with neural network deconvolution promises to make clinically indicated home sleep studies practical.

* THIS ARTICLE HAS BEEN PUBLISHED IN IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING
Click to Read Paper and Get Code
Combining deep neural networks with reinforcement learning has shown great potential in the next-generation intelligent control. However, there are challenges in terms of safety and cost in practical applications. In this paper, we propose the Intervention Aided Reinforcement Learning (IARL) framework, which utilizes human intervened robot-environment interaction to improve the policy. We used the Unmanned Aerial Vehicle (UAV) as the test platform. We built neural networks as our policy to map sensor readings to control signals on the UAV. Our experiment scenarios cover both simulation and reality. We show that our approach substantially reduces the human intervention and improves the performance in autonomous navigation, at the same time it ensures safety and keeps training cost acceptable.

* Wang, F., Zhou, B., Chen, K., Fan, T., Zhang, X., Li, J., ... & Pan, J. (2018, October). Intervention Aided Reinforcement Learning for Safe and Practical Policy Optimization in Navigation. In Conference on Robot Learning (pp. 410-421)
Click to Read Paper and Get Code
Model robustness has been an important issue, since adding small adversarial perturbations to images is sufficient to drive the model accuracy down to nearly zero. In this paper, we propose a new training objective "Guided Complement Entropy" (GCE) that has dual desirable effects: (a) neutralizing the predicted probabilities of incorrect classes, and (b) maximizing the predicted probability of the ground-truth class, particularly when (a) is achieved. Training with GCE encourages models to learn latent representations where samples of different classes form distinct clusters, which we argue, improves the model robustness against adversarial perturbations. Furthermore, compared with the state-of-the-arts trained with cross-entropy, same models trained with GCE achieve significant improvements on the robustness against white-box adversarial attacks, both with and without adversarial training. When no attack is present, training with GCE also outperforms cross-entropy in terms of model accuracy.

Click to Read Paper and Get Code
Robots that autonomously manipulate objects within warehouses have the potential to shorten the package delivery time and improve the efficiency of the e-commerce industry. In this paper, we present a robotic system that is capable of both picking and placing general objects in warehouse scenarios. Given a target object, the robot autonomously detects it from a shelf or a table and estimates its full 6D pose. With this pose information, the robot picks the object using its gripper, and then places it into a container or at a specified location. We describe our pick-and-place system in detail while highlighting our design principles for the warehouse settings, including the perception method that leverages knowledge about its workspace, three grippers designed to handle a large variety of different objects in terms of shape, weight and material, and grasp planning in cluttered scenarios. We also present extensive experiments to evaluate the performance of our picking system and demonstrate that the robot is competent to accomplish various tasks in warehouse settings, such as picking a target item from a tight space, grasping different objects from the shelf, and performing pick-and-place tasks on the table.

* 10 pages, 10 figures
Click to Read Paper and Get Code
Segments that span contiguous parts of inputs, such as phonemes in speech, named entities in sentences, actions in videos, occur frequently in sequence prediction problems. Segmental models, a class of models that explicitly hypothesizes segments, have allowed the exploration of rich segment features for sequence prediction. However, segmental models suffer from slow decoding, hampering the use of computationally expensive features. In this thesis, we introduce discriminative segmental cascades, a multi-pass inference framework that allows us to improve accuracy by adding higher-order features and neural segmental features while maintaining efficiency. We also show that instead of including more features to obtain better accuracy, segmental cascades can be used to speed up training and decoding. Segmental models, similarly to conventional speech recognizers, are typically trained in multiple stages. In the first stage, a frame classifier is trained with manual alignments, and then in the second stage, segmental models are trained with manual alignments and the out- puts of the frame classifier. However, obtaining manual alignments are time-consuming and expensive. We explore end-to-end training for segmental models with various loss functions, and show how end-to-end training with marginal log loss can eliminate the need for detailed manual alignments. We draw the connections between the marginal log loss and a popular end-to-end training approach called connectionist temporal classification. We present a unifying framework for various end-to-end graph search-based models, such as hidden Markov models, connectionist temporal classification, and segmental models. Finally, we discuss possible extensions of segmental models to large-vocabulary sequence prediction tasks.

* Thesis
Click to Read Paper and Get Code
Field Programmable Gate Arrays (FPGAs) plays an increasingly important role in data sampling and processing industries due to its highly parallel architecture, low power consumption, and flexibility in custom algorithms. Especially, in the artificial intelligence field, for training and implement the neural networks and machine learning algorithms, high energy efficiency hardware implement and massively parallel computing capacity are heavily demanded. Therefore, many global companies have applied FPGAs into AI and Machine learning fields such as autonomous driving and Automatic Spoken Language Recognition (Baidu) [1] [2] and Bing search (Microsoft) [3]. Considering the FPGAs great potential in these fields, we tend to implement a general neural network hardware architecture on XILINX ZU9CG System On Chip (SOC) platform [4], which contains abundant hardware resource and powerful processing capacity. The general neural network architecture on the FPGA SOC platform can perform forward and backward algorithms in deep neural networks (DNN) with high performance and easily be adjusted according to the type and scale of the neural networks.

Click to Read Paper and Get Code
The "digital Michelangelo project" was a seminal computer vision project in the early 2000's that pushed the capabilities of acquisition systems and involved multiple people from diverse fields, many of whom are now leaders in industry and academia. Reviewing this project with modern eyes provides us with the opportunity to reflect on several issues, relevant now as then to the field of computer vision and research in general, that go beyond the technical aspects of the work. This article was written in the context of a reading group competition at the week-long International Computer Vision Summer School 2017 (ICVSS) on Sicily, Italy. To deepen the participants understanding of computer vision and to foster a sense of community, various reading groups were tasked to highlight important lessons which may be learned from provided literature, going beyond the contents of the paper. This report is the winning entry of this guided discourse (Fig. 1). The authors closely examined the origins, fruits and most importantly lessons about research in general which may be distilled from the "digital Michelangelo project". Discussions leading to this report were held within the group as well as with Hao Li, the group mentor.

* 5 pages. 3 figures
Click to Read Paper and Get Code
In traditional neural networks for image processing, the inputs of the neural networks should be the same size such as 224*224*3. But how can we train the neural net model with different input size? A common way to do is image deformation which accompany a problem of information loss (e.g. image crop or wrap). Sequence model(RNN, LSTM, etc.) can accept different size of input like text and audio. But one disadvantage for sequence model is that the previous information will become more fragmentary during the transfer in time step, it will make the network hard to train especially for long sequential data. In this paper we propose a new network structure called Attention Incorporate Network(AIN). It solve the problem of different size of inputs including: images, text, audio, and extract the key features of the inputs by attention mechanism, pay different attention depends on the importance of the features not rely on the data size. Experimentally, AIN achieve a higher accuracy, better convergence comparing to the same size of other network structure

Click to Read Paper and Get Code
This paper uses supervised learning, random search and deep reinforcement learning (DRL) methods to control large signalized intersection networks. The traffic model is Cellular Automaton rule 184, which has been shown to be a parameter-free representation of traffic flow, and is the most efficient implementation of the Kinematic Wave model with triangular fundamental diagram. We are interested in the steady-state performance of the system, both spatially and temporally: we consider a homogeneous grid network inscribed on a torus, which makes the network boundary-free, and drivers choose random routes. As a benchmark we use the longest-queue-first (LQF) greedy algorithm. We find that: (i) a policy trained with supervised learning with only two examples outperforms LQF, (ii) random search is able to generate near-optimal policies, (iii) the prevailing average network occupancy during training is the major determinant of the effectiveness of DRL policies. When trained under free-flow conditions one obtains DRL policies that are optimal for all traffic conditions, but this performance deteriorates as the occupancy during training increases. For occupancies > 75% during training, DRL policies perform very poorly for all traffic conditions, which means that DRL methods cannot learn under highly congested conditions. We conjecture that DRL's inability to learn under congestion might be explained by a property of urban networks found here, whereby even a very bad policy produces an intersection throughput higher than downstream capacity. This means that the actual throughput tends to be independent of the policy. Our findings imply that it is advisable for current DRL methods in the literature to discard any congested data when training, and that doing this will improve their performance under all traffic conditions.

* 15 pages, 10 figures
Click to Read Paper and Get Code