Research papers and code for "Ning Zhang":
Traditional machine learning methods usually minimize a simple loss function to learn a predictive model, and then use a complex performance measure to measure the prediction performance. However, minimizing a simple loss function cannot guarantee that an optimal performance. In this paper, we study the problem of optimizing the complex performance measure directly to obtain a predictive model. We proposed to construct a maximum likelihood model for this problem, and to learn the model parameter, we minimize a com- plex loss function corresponding to the desired complex performance measure. To optimize the loss function, we approximate the upper bound of the complex loss. We also propose impose the sparsity to the model parameter to obtain a sparse model. An objective is constructed by combining the upper bound of the loss function and the sparsity of the model parameter, and we develop an iterative algorithm to minimize it by using the fast iterative shrinkage- thresholding algorithm framework. The experiments on optimization on three different complex performance measures, including F-score, receiver operating characteristic curve, and recall precision curve break even point, over three real-world applications, aircraft event recognition of civil aviation safety, in- trusion detection in wireless mesh networks, and image classification, show the advantages of the proposed method over state-of-the-art methods.

Click to Read Paper and Get Code
Self-navigation, referring to automatically reaching the goal while avoiding collision with obstacles, is a fundamental skill of mobile robots. Currently, Deep Reinforcement Learning (DRL) can enable the robot to navigate in a more complex environment with less computation power compared to conventional methods. However, it is time-consuming and hard to train the robot to learn goal-reaching and obstacle-avoidance skills simultaneously using DRL-based algorithms. In this paper, two Dueling Deep Q Networks (DQN) named Goal Network and Avoidance Network are used to learn the goal-reaching and obstacle-avoidance skills individually. A novel method named danger-aware advantage composition is proposed to fuse the two networks together without any redesigning and retraining. The composed Navigation Network can enable the robot to reach the goal right behind the wall and to navigate in unknown complexed environment safely and quickly.

* 8 pages, 12 figures
Click to Read Paper and Get Code
While deeper and wider neural networks are actively pushing the performance limits of various computer vision and machine learning tasks, they often require large sets of labeled data for effective training and suffer from extremely high computational complexity. In this paper, we will develop a new framework for training deep neural networks on datasets with limited labeled samples using cross-network knowledge projection which is able to improve the network performance while reducing the overall computational complexity significantly. Specifically, a large pre-trained teacher network is used to observe samples from the training data. A projection matrix is learned to project this teacher-level knowledge and its visual representations from an intermediate layer of the teacher network to an intermediate layer of a thinner and faster student network to guide and regulate its training process. Both the intermediate layers from the teacher network and the injection layers from the student network are adaptively selected during training by evaluating a joint loss function in an iterative manner. This knowledge projection framework allows us to use crucial knowledge learned by large networks to guide the training of thinner student networks, avoiding over-fitting, achieving better network performance, and significantly reducing the complexity. Extensive experimental results on benchmark datasets have demonstrated that our proposed knowledge projection approach outperforms existing methods, improving accuracy by up to 4% while reducing network complexity by 4 to 10 times, which is very attractive for practical applications of deep neural networks.

Click to Read Paper and Get Code
Human pose estimation using deep neural networks aims to map input images with large variations into multiple body keypoints which must satisfy a set of geometric constraints and inter-dependency imposed by the human body model. This is a very challenging nonlinear manifold learning process in a very high dimensional feature space. We believe that the deep neural network, which is inherently an algebraic computation system, is not the most effecient way to capture highly sophisticated human knowledge, for example those highly coupled geometric characteristics and interdependence between keypoints in human poses. In this work, we propose to explore how external knowledge can be effectively represented and injected into the deep neural networks to guide its training process using learned projections that impose proper prior. Specifically, we use the stacked hourglass design and inception-resnet module to construct a fractal network to regress human pose images into heatmaps with no explicit graphical modeling. We encode external knowledge with visual features which are able to characterize the constraints of human body models and evaluate the fitness of intermediate network output. We then inject these external features into the neural network using a projection matrix learned using an auxiliary cost function. The effectiveness of the proposed inception-resnet module and the benefit in guided learning with knowledge projection is evaluated on two widely used benchmarks. Our approach achieves state-of-the-art performance on both datasets.

* 13 pages, 12 figures. arXiv admin note: text overlap with arXiv:1609.01743, arXiv:1702.07432, arXiv:1602.00134 by other authors
Click to Read Paper and Get Code
We present a deep neural network based method for the retrieval of watermarks from images of 3D printed objects. To deal with the variability of all possible 3D printing and image acquisition settings we train the network with synthetic data. The main simulator parameters such as texture, illumination and camera position are dynamically randomized in non-realistic ways, forcing the neural network to learn the intrinsic features of the 3D printed watermarks. At the end of the pipeline, the watermark, in the form of a two-dimensional bit array, is retrieved through a series of simple image processing and statistical operations applied on the confidence map generated by the neural network. The results demonstrate that the inclusion of synthetic DR data in the training set increases the generalization power of the network, which performs better on images from previously unseen 3D printed objects. We conclude that in our application domain of information retrieval from 3D printed objects, where access to the exact CAD files of the printed objects can be assumed, one can use inexpensive synthetic data to enhance neural network training, reducing the need for the labour intensive process of creating large amounts of hand labelled real data or the need to generate photorealistic synthetic data.

Click to Read Paper and Get Code
Sliced inverse regression (SIR) is a pioneer tool for supervised dimension reduction. It identifies the effective dimension reduction space, the subspace of significant factors with intrinsic lower dimensionality. In this paper, we propose to refine the SIR algorithm through an overlapping slicing scheme. The new algorithm, called overlapping sliced inverse regression (OSIR), is able to estimate the effective dimension reduction space and determine the number of effective factors more accurately. We show that such overlapping procedure has the potential to identify the information contained in the derivatives of the inverse regression curve, which helps to explain the superiority of OSIR. We also prove that OSIR algorithm is $\sqrt n $-consistent and verify its effectiveness by simulations and real applications.

Click to Read Paper and Get Code
Convolutional neural nets (convnets) trained from massive labeled datasets have substantially improved the state-of-the-art in image classification and object detection. However, visual understanding requires establishing correspondence on a finer level than object category. Given their large pooling regions and training from whole-image labels, it is not clear that convnets derive their success from an accurate correspondence model which could be used for precise localization. In this paper, we study the effectiveness of convnet activation features for tasks requiring correspondence. We present evidence that convnet features localize at a much finer scale than their receptive field sizes, that they can be used to perform intraclass alignment as well as conventional hand-engineered features, and that they outperform conventional features in keypoint prediction on objects from PASCAL VOC 2011.

Click to Read Paper and Get Code
Motivated by certain applications from physics, biochemistry, economics, and computer science, in which the objects under investigation are not accessible because of various limitations, we propose a trial-and-error model to examine algorithmic issues in such situations. Given a search problem with a hidden input, we are asked to find a valid solution, to find which we can propose candidate solutions (trials), and use observed violations (errors), to prepare future proposals. In accordance with our motivating applications, we consider the fairly broad class of constraint satisfaction problems, and assume that errors are signaled by a verification oracle in the format of the index of a violated constraint (with the content of the constraint still hidden). Our discoveries are summarized as follows. On one hand, despite the seemingly very little information provided by the verification oracle, efficient algorithms do exist for a number of important problems. For the Nash, Core, Stable Matching, and SAT problems, the unknown-input versions are as hard as the corresponding known-input versions, up to a factor of polynomial. We further give almost tight bounds on the latter two problems' trial complexities. On the other hand, there are problems whose complexities are substantially increased in the unknown-input model. In particular, no time-efficient algorithms exist (under standard hardness assumptions) for Graph Isomorphism and Group Isomorphism problems. The tools used to achieve these results include order theory, strong ellipsoid method, and some non-standard reductions. Our model investigates the value of information, and our results demonstrate that the lack of input information can introduce various levels of extra difficulty. The model exhibits intimate connections with (and we hope can also serve as a useful supplement to) certain existing learning and complexity theories.

Click to Read Paper and Get Code
We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision. Specifically, we exploit the multi-scale nature of information in sequential data by formulating it explicitly within a factorized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. The model is evaluated on two speech corpora to demonstrate, qualitatively, its ability to transform speakers or linguistic content by manipulating different sets of latent variables; and quantitatively, its ability to outperform an i-vector baseline for speaker verification and reduce the word error rate by as much as 35% in mismatched train/test scenarios for automatic speech recognition tasks.

* Accepted to NIPS 2017
Click to Read Paper and Get Code
An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representations. We demonstrate the capability of our model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data.

* Interspeech 2017, pp 1273-1277
* Accepted to Interspeech 2017
Click to Read Paper and Get Code
Domain mismatch between training and testing can lead to significant degradation in performance in many machine learning scenarios. Unfortunately, this is not a rare situation for automatic speech recognition deployments in real-world applications. Research on robust speech recognition can be regarded as trying to overcome this domain mismatch issue. In this paper, we address the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are presented, but word transcripts are only available for the source domain speech. We present novel augmentation-based methods that transform speech in a way that does not change the transcripts. Specifically, we first train a variational autoencoder on both source and target domain data (without supervision) to learn a latent representation of speech. We then transform nuisance attributes of speech that are irrelevant to recognition by modifying the latent representations, in order to augment labeled training data with additional data whose distribution is more similar to the target domain. The proposed method is evaluated on the CHiME-4 dataset and reduces the absolute word error rate (WER) by as much as 35% compared to the non-adapted baseline.

* Accepted to IEEE ASRU 2017
Click to Read Paper and Get Code
We apply a general recurrent neural network (RNN) encoder framework to community question answering (cQA) tasks. Our approach does not rely on any linguistic processing, and can be applied to different languages or domains. Further improvements are observed when we extend the RNN encoders with a neural attention mechanism that encourages reasoning over entire sequences. To deal with practical issues such as data sparsity and imbalanced labels, we apply various techniques such as transfer learning and multitask learning. Our experiments on the SemEval-2016 cQA task show 10% improvement on a MAP score compared to an information retrieval-based approach, and achieve comparable performance to a strong handcrafted feature-based method.

Click to Read Paper and Get Code
Dropout and other feature noising schemes have shown promising results in controlling over-fitting by artificially corrupting the training data. Though extensive theoretical and empirical studies have been performed for generalized linear models, little work has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for linear SVMs. To deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a re-weighted least square problem, where the re-weights have closed-form solutions. The similar ideas are applied to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of linear SVMs.

Click to Read Paper and Get Code
In recent years, the success of deep learning has carried over from discriminative models to generative models. In particular, generative adversarial networks (GANs) have facilitated a new level of performance ranging from media manipulation to dataset re-generation. Despite the success, the potential risks of privacy breach stemming from GANs are less well explored. In this paper, we focus on membership inference attack against GANs that has the potential to reveal information about victim models' training data. Specifically, we present the first taxonomy of membership inference attacks, which encompasses not only existing attacks but also our novel ones. We also propose the first generic attack model that can be instantiated in various settings according to adversary's knowledge about the victim model. We complement our systematic analysis of attack vectors with a comprehensive experimental study, that investigates the effectiveness of these attacks w.r.t. model type, training configurations, and attack type across three diverse application scenarios ranging from images, over medical data to location data. We show consistent effectiveness in all the setups, which bridges the assumption gap and performance gap in previous study with a complete spectrum of performance across settings. We conclusively remind users to think over before publicizing any part of their models.

Click to Read Paper and Get Code
A general information transmission model, under independent and identically distributed Gaussian codebook and nearest neighbor decoding rule with processed channel output, is investigated using the performance metric of generalized mutual information. When the encoder and the decoder know the statistical channel model, it is found that the optimal channel output processing function is the conditional expectation operator, thus hinting a potential role of regression, a classical topic in machine learning, for this model. Without utilizing the statistical channel model, a problem formulation inspired by machine learning principles is established, with suitable performance metrics introduced. A data-driven inference algorithm is proposed to solve the problem, and the effectiveness of the algorithm is validated via numerical experiments. Extensions to more general information transmission models are also discussed.

* To appear at IEEE JSAC. Code available at http://staff.ustc.edu.cn/~wenyizha/jsac19code.zip
Click to Read Paper and Get Code
We present a novel problem setting in zero-shot learning, zero-shot object recognition and detection in the context. Contrary to the traditional zero-shot learning methods, which simply infers unseen categories by transferring knowledge from the objects belonging to semantically similar seen categories, we aim to understand the identity of the novel objects in an image surrounded by the known objects using the inter-object relation prior. Specifically, we leverage the visual context and the geometric relationships between all pairs of objects in a single image, and capture the information useful to infer unseen categories. We integrate our context-aware zero-shot learning framework into the traditional zero-shot learning techniques seamlessly using a Conditional Random Field (CRF). The proposed algorithm is evaluated on both zero-shot region classification and zero-shot detection tasks. The results on Visual Genome (VG) dataset show that our model significantly boosts performance with the additional visual context compared to traditional methods.

Click to Read Paper and Get Code
Accurate segmentation of retinal vessels is a basic step in Diabetic retinopathy(DR) detection. Most methods based on deep convolutional neural network (DCNN) have small receptive fields, and hence they are unable to capture global context information of larger regions, with difficult to identify lesions. The final segmented retina vessels contain more noise with low classification accuracy. Therefore, in this paper, we propose a DCNN structure named as D-Net. In the proposed D-Net, the dilation convolution is used in the backbone network to obtain a larger receptive field without losing spatial resolution, so as to reduce the loss of feature information and to reduce the difficulty of tiny thin vessels segmentation. The large receptive field can better distinguished between the lesion area and the blood vessel area. In the proposed Multi-Scale Information Fusion module (MSIF), parallel convolution layers with different dilation rates are used, so that the model can obtain more dense feature information and better capture retinal vessel information of different sizes. In the decoding module, the skip layer connection is used to propagate context information to higher resolution layers, so as to prevent low-level information from passing the entire network structure. Finally, our method was verified on DRIVE, STARE and CHASE dataset. The experimental results show that our network structure outperforms some state-of-art method, such as N4-fields, U-Net, and DRIU in terms of accuracy, sensitivity, specificity, and AUCROC. Particularly, D-Net outperforms U-Net by 1.04%, 1.23% and 2.79% in DRIVE, STARE, and CHASE three dataset, respectively.

Click to Read Paper and Get Code
Both the tasks of multi-person human pose estimation and pose tracking in videos are quite challenging. Existing methods can be categorized into two groups: top-down and bottom-up approaches. In this paper, following the top-down approach, we aim to build a strong baseline system with three modules: human candidate detector, single-person pose estimator and human pose tracker. Firstly, we choose a generic object detector among state-of-the-art methods to detect human candidates. Then, the cascaded pyramid network is used to estimate the corresponding human pose. Finally, we use a flow-based pose tracker to render keypoint-association across frames, i.e., assigning each human candidate a unique and temporally-consistent id, for the multi-target pose tracking purpose. We conduct extensive ablative experiments to validate various choices of models and configurations. We take part in two ECCV 18 PoseTrack challenges: pose estimation and pose tracking.

* To appear in ECCVW (2018). Workshop: 2nd PoseTrack Challenge
Click to Read Paper and Get Code
Currently, Segmentation of bitewing radiograpy images is a very challenging task. The focus of the study is to segment it into caries, enamel, dentin, pulp, crowns, restoration and root canal treatments. The main method of semantic segmentation of bitewing radiograpy images at this stage is the U-shaped deep convolution neural network, but its accuracy is low. in order to improve the accuracy of semantic segmentation of bitewing radiograpy images, this paper proposes the use of Conditional Generative Adversarial network (cGAN) combined with U-shaped network structure (U-Net) approach to semantic segmentation of bitewing radiograpy images. The experimental results show that the accuracy of cGAN combined with U-Net is 69.7%, which is 13.3% higher than the accuracy of u-shaped deep convolution neural network of 56.4%.

* 12pages, in Chinese
Click to Read Paper and Get Code
Finding visual correspondence between local features is key to many computer vision problems. While defining features with larger contextual scales usually implies greater discriminativeness, it could also lead to less spatial accuracy of the features. We propose AutoScaler, a scale-attention network to explicitly optimize this trade-off in visual correspondence tasks. Our network consists of a weight-sharing feature network to compute multi-scale feature maps and an attention network to combine them optimally in the scale space. This allows our network to have adaptive receptive field sizes over different scales of the input. The entire network is trained end-to-end in a siamese framework for visual correspondence tasks. Our method achieves favorable results compared to state-of-the-art methods on challenging optical flow and semantic matching benchmarks, including Sintel, KITTI and CUB-2011. We also show that our method can generalize to improve hand-crafted descriptors (e.g Daisy) on general visual correspondence tasks. Finally, our attention network can generate visually interpretable scale attention maps.

Click to Read Paper and Get Code