Models, code, and papers for "Ming Lu":
Adversarial machine learning is an emerging field that focuses on studying vulnerabilities of machine learning approaches in adversarial settings and developing techniques accordingly to make learning robust to adversarial manipulations. It plays a vital role in various machine learning applications and has attracted tremendous attention across different communities recently. In this paper, we explore different adversarial scenarios in the context of quantum machine learning. We find that, similar to traditional classifiers based on classical neural networks, quantum learning systems are likewise vulnerable to crafted adversarial examples, independent of whether the input data is classical or quantum. In particular, we find that a quantum classifier that achieves nearly the state-of-the-art accuracy can be conclusively deceived by adversarial examples obtained via adding imperceptible perturbations to the original legitimate samples. This is explicitly demonstrated with quantum adversarial learning in different scenarios, including classifying real-life images (e.g., handwritten digit images in the dataset MNIST), learning phases of matter (such as, ferromagnetic/paramagnetic orders and symmetry protected topological phases), and classifying quantum data. Furthermore, we show that based on the information of the adversarial examples at hand, practical defense strategies can be designed to fight against a number of different attacks. Our results uncover the notable vulnerability of quantum machine learning systems to adversarial perturbations, which not only reveals a novel perspective in bridging machine learning and quantum physics in theory but also provides valuable guidance for practical applications of quantum classifiers based on both near-term and future quantum technologies.
Directly learning features from the point cloud has become an active research direction in 3D understanding. Existing learning-based methods usually construct local regions from the point cloud and extract the corresponding features using shared Multi-Layer Perceptron (MLP) and max pooling. However, most of these processes do not adequately take the spatial distribution of the point cloud into account, limiting the ability to perceive fine-grained patterns. We design a novel Local Spatial Attention (LSA) module to adaptively generate attention maps according to the spatial distribution of local regions. The feature learning process which integrates with these attention maps can effectively capture the local geometric structure. We further propose the Spatial Feature Extractor (SFE), which constructs a branch architecture, to aggregate the spatial information with associated features in each layer of the network better.The experiments show that our network, named LSANet, can achieve on par or better performance than the state-of-the-art methods when evaluating on the challenging benchmark datasets. The source code is available at https://github.com/LinZhuoChen/LSANet.
Dancing to music is an instinctive move by humans. Learning to model the music-to-dance generation process is, however, a challenging problem. It requires significant efforts to measure the correlation between music and dance as one needs to simultaneously consider multiple aspects, such as style and beat of both music and dance. Additionally, dance is inherently multimodal and various following movements of a pose at any moment are equally likely. In this paper, we propose a synthesis-by-analysis learning framework to generate dance from music. In the analysis phase, we decompose a dance into a series of basic dance units, through which the model learns how to move. In the synthesis phase, the model learns how to compose a dance by organizing multiple basic dancing movements seamlessly according to the input music. Experimental qualitative and quantitative results demonstrate that the proposed method can synthesize realistic, diverse,style-consistent, and beat-matching dances from music.
Image extrapolation aims at expanding the narrow field of view of a given image patch. Existing models mainly deal with natural scene images of homogeneous regions and have no control of the content generation process. In this work, we study conditional image extrapolation to synthesize new images guided by the input structured text. The text is represented as a graph to specify the objects and their spatial relation to the unknown regions of the image. Inspired by drawing techniques, we propose a progressive generative model of three stages, i.e., generating a coarse bounding-boxes layout, refining it to a finer segmentation layout, and mapping the layout to a realistic output. Such a multi-stage design is shown to facilitate the training process and generate more controllable results. We validate the effectiveness of the proposed method on the face and human clothing dataset in terms of visual results, quantitative evaluations and flexible controls.
We present a novel and easy-to-implement training framework for visual tracking. Our approach mainly focuses on learning feature embeddings in an end-to-end way, which can generalize well to the trackers based on online discriminatively trained ridge regression model. This goal is efficiently achieved by taking advantage of the following two important theories. 1) Ridge regression problem has closed-form solution and is implicit differentiation under the optimality condition. Therefore, its solver can be embedded as a layer with efficient forward and backward processes in training deep convolutional neural networks. 2) Woodbury identity can be utilized to ensure efficient solution of ridge regression problem when the high-dimensional feature embeddings are employed. Moreover, in order to address the extreme foreground-background class imbalance during training, we modify the origin shrinkage loss and then employ it as the loss function for efficient and effective training. It is worth mentioning that the above core parts of our proposed training framework are easy to be implemented with several lines of code under the current popular deep learning frameworks, thus our approach is easy to be followed. Extensive experiments on six public benchmarks, OTB2015, NFS, TrackingNet, GOT10k, VOT2018, and VOT2019, show that the proposed tracker achieves state-of-the-art performance, while running at over 30 FPS. Code will be made available.
Occlusion edge detection requires both accurate locations and context constraints of the contour. Existing CNN-based pipeline does not utilize adaptive methods to filter the noise introduced by low-level features. To address this dilemma, we propose a novel Context-constrained accurate Contour Extraction Network (CCENet). Spatial details are retained and contour-sensitive context is augmented through two extraction blocks, respectively. Then, an elaborately designed fusion module is available to integrate features, which plays a complementary role to restore details and remove clutter. Weight response of attention mechanism is eventually utilized to enhance occluded contours and suppress noise. The proposed CCENet significantly surpasses state-of-the-art methods on PIOD and BSDS ownership dataset of object edge detection and occlusion orientation detection.
Human parsing is an essential branch of semantic segmentation, which is a fine-grained semantic segmentation task to identify the constituent parts of human. The challenge of human parsing is to extract effective semantic features to resolve deformation and multi-scale variations. In this work, we proposed an end-to-end model called C-DLinkNet based on LinkNet, which contains a new module named Smooth Module to combine the multi-level features in Decoder part. C-DLinkNet is capable of producing competitive parsing performance compared with the state-of-the-art methods with smaller input sizes and no additional information, i.e., achiving mIoU=53.05 on the validation set of LIP dataset.
Recent studies have shown that imbalance ratio is not the only cause of the performance loss of a classifier in imbalanced data classification. In fact, other data factors, such as small disjuncts, noises and overlapping, also play the roles in tandem with imbalance ratio, which makes the problem difficult. Thus far, the empirical studies have demonstrated the relationship between the imbalance ratio and other data factors only. To the best of our knowledge, there is no any measurement about the extent of influence of class imbalance on the classification performance of imbalanced data. Further, it is also unknown for a dataset which data factor is actually the main barrier for classification. In this paper, we focus on Bayes optimal classifier and study the influence of class imbalance from a theoretical perspective. Accordingly, we propose an instance measure called Individual Bayes Imbalance Impact Index ($IBI^3$) and a data measure called Bayes Imbalance Impact Index ($BI^3$). $IBI^3$ and $BI^3$ reflect the extent of influence purely by the factor of imbalance in terms of each minority class sample and the whole dataset, respectively. Therefore, $IBI^3$ can be used as an instance complexity measure of imbalance and $BI^3$ is a criterion to show the degree of how imbalance deteriorates the classification. As a result, we can therefore use $BI^3$ to judge whether it is worth using imbalance recovery methods like sampling or cost-sensitive methods to recover the performance loss of a classifier. The experiments show that $IBI^3$ is highly consistent with the increase of prediction score made by the imbalance recovery methods and $BI^3$ is highly consistent with the improvement of F1 score made by the imbalance recovery methods on both synthetic and real benchmark datasets.
Inferring the laws of interaction between particles and agents in complex dynamical systems from observational data is a fundamental challenge in a wide variety of disciplines. We propose a non-parametric statistical learning approach to estimate the governing laws of distance-based interactions, with no reference or assumption about their analytical form, from data consisting trajectories of interacting agents. We demonstrate the effectiveness of our learning approach both by providing theoretical guarantees, and by testing the approach on a variety of prototypical systems in various disciplines. These systems include homogeneous and heterogeneous agents systems, ranging from particle systems in fundamental physics to agent-based systems modeling opinion dynamics under the social influence, prey-predator dynamics, flocking and swarming, and phototaxis in cell dynamics.
Instance-level human analysis is common in real-life scenarios and has multiple manifestations, such as human part segmentation, dense pose estimation, human-object interactions, etc. Models need to distinguish different human instances in the image panel and learn rich features to represent the details of each instance. In this paper, we present an end-to-end pipeline for solving the instance-level human analysis, named Parsing R-CNN. It processes a set of human instances simultaneously through comprehensive considering the characteristics of region-based approach and the appearance of a human, thus allowing representing the details of instances. Parsing R-CNN is very flexible and efficient, which is applicable to many issues in human instance analysis. Our approach outperforms all state-of-the-art methods on CIHP (Crowd Instance-level Human Parsing), MHP v2.0 (Multi-Human Parsing) and DensePose-COCO datasets. Based on the proposed Parsing R-CNN, we reach the 1st place in the COCO 2018 Challenge DensePose Estimation task. Code and models are public available.
Over the years many ellipse detection algorithms spring up and are studied broadly, while the critical issue of detecting ellipses accurately and efficiently in real-world images remains a challenge. In this paper, an accurate and efficient ellipse detector by arc-support line segments is proposed. The arc-support line segment simplifies the complicated expression of curves in an image while retains the general properties including convexity and polarity, which grounds the successful detection of ellipses. The arc-support groups are formed by iteratively and robustly linking the arc-support line segments that latently belong to a common ellipse at point statistics level. Afterward, two complementary approaches, namely, selecting the group with higher saliency to fit an ellipse, and searching all the valid paired arc-support groups, are utilized to generate the initial ellipse set, both locally and globally. In ellipse fitting step, a superposition principle for the fast ellipse fitting is developed to accelerate the process. Then, the ellipse candidates can be formulated by the hierarchical clustering of 5D parameter space of initial ellipse set. Finally, the salient ellipse candidates are selected as detections subject to the stringent and effective verification. Extensive experiments on three public datasets are implemented and our method achieves the best F-measure scores compared to the state-of-the-art methods.
Occlusion relationship reasoning demands closed contour to express the object, and orientation of each contour pixel to describe the order relationship between objects. Current CNN-based methods neglect two critical issues of the task: (1) simultaneous existence of the relevance and distinction for the two elements, i.e, occlusion edge and occlusion orientation; and (2) inadequate exploration to the orientation features. For the reasons above, we propose the Occlusion-shared and Feature-separated Network (OFNet). On one hand, considering the relevance between edge and orientation, two sub-networks are designed to share the occlusion cue. On the other hand, the whole network is split into two paths to learn the high-level semantic features separately. Moreover, a contextual feature for orientation prediction is extracted, which represents the bilateral cue of the foreground and background areas. The bilateral cue is then fused with the occlusion cue to precisely locate the object regions. Finally, a stripe convolution is designed to further aggregate features from surrounding scenes of the occlusion edge. The proposed OFNet remarkably advances the state-of-the-art approaches on PIOD and BSDS ownership dataset. The source code is available at https://github.com/buptlr/OFNet.
Neuroscientists have enjoyed much success in understanding brain functions by constructing brain connectivity networks using data collected under highly controlled experimental settings. However, these experimental settings bear little resemblance to our real-life experience in day-to-day interactions with the surroundings. To address this issue, neuroscientists have been measuring brain activity under natural viewing experiments in which the subjects are given continuous stimuli, such as watching a movie or listening to a story. The main challenge with this approach is that the measured signal consists of both the stimulus-induced signal, as well as intrinsic-neural and non-neuronal signals. By exploiting the experimental design, we propose to estimate stimulus-locked brain network by treating non-stimulus-induced signals as nuisance parameters. In many neuroscience applications, it is often important to identify brain regions that are connected to many other brain regions during cognitive process. We propose an inferential method to test whether the maximum degree of the estimated network is larger than a pre-specific number. We prove that the type I error can be controlled and that the power increases to one asymptotically. Simulation studies are conducted to assess the performance of our method. Finally, we analyze a functional magnetic resonance imaging dataset obtained under the Sherlock Holmes movie stimuli.
We consider the task of estimating the 3D orientation of an object of known category given an image of the object and a bounding box around it. Recently, CNN-based regression and classification methods have shown significant performance improvements for this task. This paper proposes a new CNN-based approach to monocular orientation estimation that advances the state of the art in four different directions. First, we take into account the Riemannian structure of the orientation space when designing regression losses and nonlinear activation functions. Second, we propose a mixed Riemannian regression and classification framework that better handles the challenging case of nearly symmetric objects. Third, we propose a data augmentation strategy that is specifically designed to capture changes in 3D orientation. Fourth, our approach leads to state-of-the-art results on the PASCAL3D+ dataset.
In this paper, we analyze the spatial information of deep features, and propose two complementary regressions for robust visual tracking. First, we propose a kernelized ridge regression model wherein the kernel value is defined as the weighted sum of similarity scores of all pairs of patches between two samples. We show that this model can be formulated as a neural network and thus can be efficiently solved. Second, we propose a fully convolutional neural network with spatially regularized kernels, through which the filter kernel corresponding to each output channel is forced to focus on a specific region of the target. Distance transform pooling is further exploited to determine the effectiveness of each output channel of the convolution layer. The outputs from the kernelized ridge regression model and the fully convolutional neural network are combined to obtain the ultimate response. Experimental results on two benchmark datasets validate the effectiveness of the proposed method.
For visual tracking, an ideal filter learned by the correlation filter (CF) method should take both discrimination and reliability information. However, existing attempts usually focus on the former one while pay less attention to reliability learning. This may make the learned filter be dominated by the unexpected salient regions on the feature map, thereby resulting in model degradation. To address this issue, we propose a novel CF-based optimization problem to jointly model the discrimination and reliability information. First, we treat the filter as the element-wise product of a base filter and a reliability term. The base filter is aimed to learn the discrimination information between the target and backgrounds, and the reliability term encourages the final filter to focus on more reliable regions. Second, we introduce a local response consistency regular term to emphasize equal contributions of different regions and avoid the tracker being dominated by unreliable regions. The proposed optimization problem can be solved using the alternating direction method and speeded up in the Fourier domain. We conduct extensive experiments on the OTB-2013, OTB-2015 and VOT-2016 datasets to evaluate the proposed tracker. Experimental results show that our tracker performs favorably against other state-of-the-art trackers.
Visual tracking addresses the problem of identifying and localizing an unknown target in a video given the target specified by a bounding box in the first frame. In this paper, we propose a dual network to better utilize features among layers for visual tracking. It is observed that features in higher layers encode semantic context while its counterparts in lower layers are sensitive to discriminative appearance. Thus we exploit the hierarchical features in different layers of a deep model and design a dual structure to obtain better feature representation from various streams, which is rarely investigated in previous work. To highlight geometric contours of the target, we integrate the hierarchical feature maps with an edge detector as the coarse prior maps to further embed local details around the target. To leverage the robustness of our dual network, we train it with random patches measuring the similarities between the network activation and target appearance, which serves as a regularization to enforce the dual network to focus on target object. The proposed dual network is updated online in a unique manner based on the observation that the target being tracked in consecutive frames should share more similar feature representations than those in the surrounding background. It is also found that for a target object, the prior maps can help further enhance performance by passing message into the output maps of the dual network. Therefore, an independent component analysis with reference algorithm (ICA-R) is employed to extract target context using prior maps as guidance. Online tracking is conducted by maximizing the posterior estimate on the final maps with stochastic and periodic update. Quantitative and qualitative evaluations on two large-scale benchmark data sets show that the proposed algorithm performs favourably against the state-of-the-arts.
The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the best parametric Text-to-Speech (TTS) systems in terms of the naturalness of generated speech, especially the naturalness in prosody. However, the model complexity and inference cost of BLSTM prevents its usage in many runtime applications. Meanwhile, Deep Feed-forward Sequential Memory Networks (DFSMN) has shown its consistent out-performance over BLSTM in both word error rate (WER) and the runtime computation cost in speech recognition tasks. Since speech synthesis also requires to model long-term dependencies compared to speech recognition, in this paper, we investigate the Deep-FSMN (DFSMN) in speech synthesis. Both objective and subjective experiments show that, compared with BLSTM TTS method, the DFSMN system can generate synthesized speech with comparable speech quality while drastically reduce model complexity and speech generation time.
Data cleaning consumes about 80% of the time spent on data analysis for clinical research projects. This is a much bigger problem in the era of big data and machine learning in the field of medicine where large volumes of data are being generated. We report an initial effort towards automated patient data cleaning using deep learning: the standardization of organ labeling in radiation therapy. Organs are often labeled inconsistently at different institutions (sometimes even within the same institution) and at different time periods, which poses a problem for clinical research, especially for multi-institutional collaborative clinical research where the acquired patient data is not being used effectively. We developed a convolutional neural network (CNN) to automatically identify each organ in the CT image and then label it with the standardized nomenclature presented at AAPM Task Group 263. We tested this model on the CT images of 54 patients with prostate and 100 patients with head and neck cancer who previously received radiation therapy. The model achieved 100% accuracy in detecting organs and assigning standardized labels for the patients tested. This work shows the feasibility of using deep learning in patient data cleaning that enables standardized datasets to be generated for effective intra- and interinstitutional collaborative clinical research.
Networked video applications, e.g., video conferencing, often suffer from poor visual quality due to unexpected network fluctuation and limited bandwidth. In this paper, we have developed a Quality Enhancement Network (QENet) to reduce the video compression artifacts, leveraging the spatial and temporal priors generated by respective multi-scale convolutions spatially and warped temporal predictions in a recurrent fashion temporally. We have integrated this QENet as a standard-alone post-processing subsystem to the High Efficiency Video Coding (HEVC) compliant decoder. Experimental results show that our QENet demonstrates the state-of-the-art performance against default in-loop filters in HEVC and other deep learning based methods with noticeable objective gains in Peak-Signal-to-Noise Ratio (PSNR) and subjective gains visually.