Models, code, and papers for "Chao Wang":
To provide a survey on the existing tasks and models in Machine Reading Comprehension (MRC), this report reviews: 1) the dataset collection and performance evaluation of some representative simple-reasoning and complex-reasoning MRC tasks; 2) the architecture designs, attention mechanisms, and performance-boosting approaches for developing neural-network-based MRC models; 3) some recently proposed transfer learning approaches to incorporating text-style knowledge contained in external corpora into the neural networks of MRC models; 4) some recently proposed knowledge base encoding approaches to incorporating graph-style knowledge contained in external knowledge bases into the neural networks of MRC models. Besides, according to what has been achieved and what are still deficient, this report also proposes some open problems for the future research.
Pectoral muscle identification is often required for breast cancer risk analysis, such as estimating breast density. Traditional methods are overwhelmingly based on manual visual assessment or straight line fitting for the pectoral muscle boundary, which are inefficient and inaccurate since pectoral muscle in mammograms can have curved boundaries. This paper proposes a novel and automatic pectoral muscle identification algorithm for MLO view mammograms. It is suitable for both scanned film and full field digital mammograms. This algorithm is demonstrated using a public domain software ImageJ. A validation of this algorithm has been performed using real-world data and it shows promising result.
The primary objective of a safe navigation algorithm is to guide the object from its current position to the target position while avoiding any collision with the en-route obstacles, and the appropriate obstacle avoidance strategies are the key factors to ensure safe navigation tasks in dynamic environments. In this report, three different obstacle avoidance strategies for safe navigation in dynamic environments have been presented. The biologically-inspired navigation algorithm (BINA) is efficient in terms of avoidance time. The equidistant based navigation algorithm (ENA) is able to achieve navigation task with in uncertain dynamic environments. The navigation algorithm algorithm based on an integrated environment representation (NAIER) allows the object to seek a safe path through obstacles in unknown dynamic environment in a human-like fashion. The performances and features of the proposed navigation algorithms are confirmed by extensive simulation results and experiments with a real non-holonomic mobile robot. The algorithms have been implemented on two real control systems: intelligent wheelchair and robotic hospital bed. The performance of the proposed algorithms with SAM and Flexbed demonstrate their capabilities to achieve navigation tasks in complicated real time scenarios. The proposed algorithms are easy to be implemented in real time and costly efficient. An extra study on networked multi-robots formation building algorithm is presented in this paper. A constructive and easy-to-implement decentralised control is proposed for a formation building of a group of random positioned objects. Furthermore, the problem of formation building with anonymous objects is addressed. This randomised decentralised navigation algorithm achieves the convergence to a desired configuration with probability 1.
Automatic head frontal-view identification is challenging due to appearance variations caused by pose changes, especially without any training samples. In this paper, we present an unsupervised algorithm for identifying frontal view among multiple facial images under various yaw poses (derived from the same person). Our approach is based on Locally Linear Embedding (LLE), with the assumption that with yaw pose being the only variable, the facial images should lie in a smooth and low dimensional manifold. We horizontally flip the facial images and present two K-nearest neighbor protocols for the original images and the flipped images, respectively. In the proposed extended LLE, for any facial image (original or flipped one), we search (1) the Ko nearest neighbors among the original facial images and (2) the Kf nearest neighbors among the flipped facial images to construct the same neighborhood graph. The extended LLE eliminates the differences (because of background, face position and scale in the whole image and some asymmetry of left-right face) between the original facial image and the flipped facial image at the same yaw pose so that the flipped facial images can be used effectively. Our approach does not need any training samples as prior information. The experimental results show that the frontal view of head can be identified reliably around the lowest point of the pose manifold for multiple facial images, especially the cropped facial images (little background and centered face).
We describe a class of systems theory based neural networks called "Network Of Recurrent neural networks" (NOR), which introduces a new structure level to RNN related models. In NOR, RNNs are viewed as the high-level neurons and are used to build the high-level layers. More specifically, we propose several methodologies to design different NOR topologies according to the theory of system evolution. Then we carry experiments on three different tasks to evaluate our implementations. Experimental results show our models outperform simple RNN remarkably under the same number of parameters, and sometimes achieve even better results than GRU and LSTM.
We present a novel approach to reconstruct RGB-D indoor scene with plane primitives. Our approach takes as input a RGB-D sequence and a dense coarse mesh reconstructed by some 3D reconstruction method on the sequence, and generate a lightweight, low-polygonal mesh with clear face textures and sharp features without losing geometry details from the original scene. To achieve this, we firstly partition the input mesh with plane primitives, simplify it into a lightweight mesh next, then optimize plane parameters, camera poses and texture colors to maximize the photometric consistency across frames, and finally optimize mesh geometry to maximize consistency between geometry and planes. Compared to existing planar reconstruction methods which only cover large planar regions in the scene, our method builds the entire scene by adaptive planes without losing geometry details and preserves sharp features in the final mesh. We demonstrate the effectiveness of our approach by applying it onto several RGB-D scans and comparing it to other state-of-the-art reconstruction methods.
We propose a novel approach to reconstruct RGB-D indoor scene based on plane primitives. Our approach takes as input a RGB-D sequence and a dense coarse mesh reconstructed from it, and generates a lightweight, low-polygonal mesh with clear face textures and sharp features without losing geometry details from the original scene. Compared to existing methods which only cover large planar regions in the scene, our method builds the entire scene by adaptive planes without losing geometry details and also preserves sharp features in the mesh. Experiments show that our method is more efficient to generate textured mesh from RGB-D data than state-of-the-arts.
To apply general knowledge to machine reading comprehension (MRC), we propose an innovative MRC approach, which consists of a WordNet-based data enrichment method and an MRC model named as Knowledge Aided Reader (KAR). The data enrichment method uses the semantic relations of WordNet to extract semantic level inter-word connections from each passage-question pair in the MRC dataset, and allows us to control the amount of the extraction results by setting a hyper-parameter. KAR uses the extraction results of the data enrichment method as explicit knowledge to assist the prediction of answer spans. According to the experimental results, the single model of KAR achieves an Exact Match (EM) of $72.4$ and an F1 Score of $81.1$ on the development set of SQuAD, and more importantly, by applying different settings in the data enrichment method to change the amount of the extraction results, there is a $2\%$ variation in the resulting performance of KAR, which implies that the explicit knowledge provided by the data enrichment method plays an effective role in the training of KAR.
As a generative model for building end-to-end dialogue systems, Hierarchical Recurrent Encoder-Decoder (HRED) consists of three layers of Gated Recurrent Unit (GRU), which from bottom to top are separately used as the word-level encoder, the sentence-level encoder, and the decoder. Despite performing well on dialogue corpora, HRED is computationally expensive to train due to its complexity. To improve the training efficiency of HRED, we propose a new model, which is named as Simplified HRED (SHRED), by making each layer of HRED except the top one simpler than its upper layer. On the one hand, we propose Scalar Gated Unit (SGU), which is a simplified variant of GRU, and use it as the sentence-level encoder. On the other hand, we use Fixed-size Ordinally-Forgetting Encoding (FOFE), which has no trainable parameter at all, as the word-level encoder. The experimental results show that compared with HRED under the same word embedding size and the same hidden state size for each layer, SHRED reduces the number of trainable parameters by 25\%--35\%, and the training time by more than 50\%, but still achieves slightly better performance.
This paper addresses the nearest neighbor search problem under inner product similarity and introduces a compact code-based approach. The idea is to approximate a vector using the composition of several elements selected from a source dictionary and to represent this vector by a short code composed of the indices of the selected elements. The inner product between a query vector and a database vector is efficiently estimated from the query vector and the short code of the database vector. We show the superior performance of the proposed group $M$-selection algorithm that selects $M$ elements from $M$ source dictionaries for vector approximation in terms of search accuracy and efficiency for compact codes of the same length via theoretical and empirical analysis. Experimental results on large-scale datasets ($1M$ and $1B$ SIFT features, $1M$ linear models and Netflix) demonstrate the superiority of the proposed approach.
As deep neural networks are increasingly being deployed in practice, their efficiency has become an important issue. While there are compression techniques for reducing the network's size, energy consumption and computational requirement, they only demonstrate empirically that there is no loss of accuracy, but lack formal guarantees of the compressed network, e.g., in the presence of adversarial examples. Existing verification techniques such as Reluplex, ReluVal, and DeepPoly provide formal guarantees, but they are designed for analyzing a single network instead of the relationship between two networks. To fill the gap, we develop a new method for differential verification of two closely related networks. Our method consists of a fast but approximate forward interval analysis pass followed by a backward pass that iteratively refines the approximation until the desired property is verified. We have two main innovations. During the forward pass, we exploit structural and behavioral similarities of the two networks to more accurately bound the difference between the output neurons of the two networks. Then in the backward pass, we leverage the gradient differences to more accurately compute the most beneficial refinement. Our experiments show that, compared to state-of-the-art verification tools, our method can achieve orders-of-magnitude speedup and prove many more properties than existing tools.
While deep learning based end-to-end automatic speech recognition (ASR) systems have greatly simplified modeling pipelines, they suffer from the data sparsity issue. In this work, we propose a self-training method with an end-to-end system for semi-supervised ASR. Starting from a Connectionist Temporal Classification (CTC) system trained on the supervised data, we iteratively generate pseudo-labels on a mini-batch of unsupervised utterances with the current model, and use the pseudo-labels to augment the supervised data for immediate model update. Our method retains the simplicity of end-to-end ASR systems, and can be seen as performing alternating optimization over a well-defined learning objective. We also perform empirical investigations of our method, regarding the effect of data augmentation, decoding beamsize for pseudo-label generation, and freshness of pseudo-labels. On a commonly used semi-supervised ASR setting with the WSJ corpus, our method gives 14.4% relative WER improvement over a carefully-trained base system with data augmentation, reducing the performance gap between the base system and the oracle system by 50%.
Biographical databases contain diverse information about individuals. Person names, birth information, career, friends, family and special achievements are some possible items in the record for an individual. The relationships between individuals, such as kinship and friendship, provide invaluable insights about hidden communities which are not directly recorded in databases. We show that some simple matrix and graph-based operations are effective for inferring relationships among individuals, and illustrate the main ideas with the China Biographical Database (CBDB).
This paper revisits the problem of analyzing multiple ratings given by different judges. Different from previous work that focuses on distilling the true labels from noisy crowdsourcing ratings, we emphasize gaining diagnostic insights into our in-house well-trained judges. We generalize the well-known DawidSkene model (Dawid & Skene, 1979) to a spectrum of probabilistic models under the same "TrueLabel + Confusion" paradigm, and show that our proposed hierarchical Bayesian model, called HybridConfusion, consistently outperforms DawidSkene on both synthetic and real-world data sets.
Practitioners often need to build ASR systems for new use cases in a short amount of time, given limited in-domain data. While recently developed end-to-end methods largely simplify the modeling pipelines, they still suffer from the data sparsity issue. In this work, we explore a few simple-to-implement techniques for building online ASR systems in an end-to-end fashion, with a small amount of transcribed data in the target domain. These techniques include data augmentation in the target domain, domain adaptation using models previously trained on a large source domain, and knowledge distillation on non-transcribed target domain data; they are applicable in real scenarios with different types of resources. Our experiments demonstrate that each technique is independently useful in the low-resource setting, and combining them yields significant improvement of the online ASR performance in the target domain.
Tie strength prediction, sometimes named weight prediction, is vital in exploring the diversity of connectivity pattern emerged in networks. Due to the fundamental significance, it has drawn much attention in the field of network analysis and mining. Some related works appeared in recent years have significantly advanced our understanding of how to predict the strong and weak ties in the social networks. However, most of the proposed approaches are scenario-aware methods heavily depending on some special contexts and even exclusively used in social networks. As a result, they are less applicable to various kinds of networks. In contrast to the prior studies, here we propose a new computational framework called Neighborhood Estimating Weight (NEW) which is purely driven by the basic structure information of the network and has the flexibility for adapting to diverse types of networks. In NEW, we design a novel index, i.e., connection inclination, to generate the representative features of the network, which is capable of capturing the actual distribution of the tie strength. In order to obtain the optimized prediction results, we also propose a parameterized regression model which approximately has a linear time complexity and thus is readily extended to the implementation in large-scale networks. The experimental results on six real-world networks demonstrate that our proposed predictive model outperforms the state of the art methods, which is powerful for predicting the missing tie strengths when only a part of the network's tie strength information is available.
Different rain models and novel network structures have been proposed to remove rain streaks from single rainy images. In this work, we bring attention to the intrinsic priors and multi-scale features of the rainy images, and develop several intrinsic loss functions to train a CNN deraining network. We first study the sparse priors of rainy images, which have been verified to preserve unbroken edges in image decomposition. However, its mathematical formulation usually leads to an intractable solution, we propose quasi-sparsity priors to decrease complexity, so that our network can be trained under the supervision of sparse properties of rainy images. Quasi-sparsity supervises network training in different gradient domain which is still ill-posed to decompose a rainy image into rain layer and background layer. We develop another $L_1$ loss based on the intrinsic low-value property of rain layer to restore image contents together with the commonly-used $L_1$ similarity loss. Multi-scale features are further explored via a multi-scale auxiliary decoding structure to show which kinds of features contribute the most to the deraining task, and the corresponding multi-scale auxiliary loss improves the deraining performance further. In our network, more efficient group convolution and feature sharing are utilized to obtain an one order of magnitude improvement in network running speed. The proposed deraining method performs favorably against state-of-the-art deraining approaches.
Ride-hailing services are growing rapidly and becoming one of the most disruptive technologies in the transportation realm. Accurate prediction of ride-hailing trip demand not only enables cities to better understand people's activity patterns, but also helps ride-hailing companies and drivers make informed decisions to reduce deadheading vehicle miles traveled, traffic congestion, and energy consumption. In this study, a convolutional neural network (CNN)-based deep learning model is proposed for multi-step ride-hailing demand prediction using the trip request data in Chengdu, China, offered by DiDi Chuxing. The CNN model is capable of accurately predicting the ride-hailing pick-up demand at each 1-km by 1-km zone in the city of Chengdu for every 10 minutes. Compared with another deep learning model based on long short-term memory, the CNN model is 30% faster for the training and predicting process. The proposed model can also be easily extended to make multi-step predictions, which would benefit the on-demand shared autonomous vehicles applications and fleet operators in terms of supply-demand rebalancing. The prediction error attenuation analysis shows that the accuracy stays acceptable as the model predicts more steps.
We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an $\varepsilon$-optimal point in $O(L^3 \log(1/\varepsilon))$ iterations, which scales polynomially with the network depth $L$. Our result and the $\exp(\Omega(L))$ convergence time for the standard initialization (Xavier or near-identity) [Shamir, 2018] together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when $L$ is large.