Research papers and code for "Peng Wang":
Recovering the occlusion relationships between objects is a fundamental human visual ability which yields important information about the 3D world. In this paper we propose a deep network architecture, called DOC, which acts on a single image, detects object boundaries and estimates the border ownership (i.e. which side of the boundary is foreground and which is background). We represent occlusion relations by a binary edge map, to indicate the object boundary, and an occlusion orientation variable which is tangential to the boundary and whose direction specifies border ownership by a left-hand rule. We train two related deep convolutional neural networks, called DOC, which exploit local and non-local image cues to estimate this representation and hence recover occlusion relations. In order to train and test DOC we construct a large-scale instance occlusion boundary dataset using PASCAL VOC images, which we call the PASCAL instance occlusion dataset (PIOD). This contains 10,000 images and hence is two orders of magnitude larger than existing occlusion datasets for outdoor images. We test two variants of DOC on PIOD and on the BSDS occlusion dataset and show they outperform state-of-the-art methods. Finally, we perform numerous experiments investigating multiple settings of DOC and transfer between BSDS and PIOD, which provides more insights for further study of occlusion estimation.

* Accepted to ECCV 2016
Click to Read Paper and Get Code
Probabilistic graphical models are graphical representations of probability distributions. Graphical models have applications in many fields including biology, social sciences, linguistic, neuroscience. In this paper, we propose directed acyclic graphs (DAGs) learning via bootstrap aggregating. The proposed procedure is named as DAGBag. Specifically, an ensemble of DAGs is first learned based on bootstrap resamples of the data and then an aggregated DAG is derived by minimizing the overall distance to the entire ensemble. A family of metrics based on the structural hamming distance is defined for the space of DAGs (of a given node set) and is used for aggregation. Under the high-dimensional-low-sample size setting, the graph learned on one data set often has excessive number of false positive edges due to over-fitting of the noise. Aggregation overcomes over-fitting through variance reduction and thus greatly reduces false positives. We also develop an efficient implementation of the hill climbing search algorithm of DAG learning which makes the proposed method computationally competitive for the high-dimensional regime. The DAGBag procedure is implemented in the R package dagbag.

Click to Read Paper and Get Code
Text spotting in natural scene images is of great importance for many image understanding tasks. It includes two sub-tasks: text detection and recognition. In this work, we propose a unified network that simultaneously localizes and recognizes text with a single forward pass, avoiding intermediate processes such as image cropping and feature re-calculation, word separation, and character grouping. In contrast to existing approaches that consider text detection and recognition as two distinct tasks and tackle them one by one, the proposed framework settles these two tasks concurrently. The whole framework can be trained end-to-end and is able to handle text of arbitrary shapes. The convolutional features are calculated only once and shared by both detection and recognition modules. Through multi-task training, the learned features become more discriminate and improve the overall performance. By employing the $2$D attention model in word recognition, the irregularity of text can be robustly addressed. It provides the spatial location for each character, which not only helps local feature extraction in word recognition, but also indicates an orientation angle to refine text localization. Our proposed method has achieved state-of-the-art performance on several standard text spotting benchmarks, including both regular and irregular ones.

* An extension of the work "Towards End-to-end Text Spotting with Convolutional Recurrent Neural Networks", Proc. Int. Conf. Comp. Vision 2017
Click to Read Paper and Get Code
The widespread popularization of vehicles has facilitated all people's life during the last decades. However, the emergence of a large number of vehicles poses the critical but challenging problem of vehicle re-identification (reID). Till now, for most vehicle reID algorithms, both the training and testing processes are conducted on the same annotated datasets under supervision. However, even a well-trained model will still cause fateful performance drop due to the severe domain bias between the trained dataset and the real-world scenes. To address this problem, this paper proposes a domain adaptation framework for vehicle reID (DAVR), which narrows the cross-domain bias by fully exploiting the labeled data from the source domain to adapt the target domain. DAVR develops an image-to-image translation network named Dual-branch Adversarial Network (DAN), which could promote the images from the source domain (well-labeled) to learn the style of target domain (unlabeled) without any annotation and preserve identity information from source domain. Then the generated images are employed to train the vehicle reID model by a proposed attention-based feature learning model with more reasonable styles. Through the proposed framework, the well-trained reID model has better domain adaptation ability for various scenes in real-world situations. Comprehensive experimental results have demonstrated that our proposed DAVR can achieve excellent performances on both VehicleID dataset and VeRi-776 dataset.

* arXiv admin note: substantial text overlap with arXiv:1903.07868
Click to Read Paper and Get Code
With the development of information technology, we have witnessed an age of data explosion which produces a large variety of data filled with redundant information. Because dimension reduction is an essential tool which embeds high-dimensional data into a lower-dimensional subspace to avoid redundant information, it has attracted interests from researchers all over the world. However, facing with features from multiple views, it's difficult for most dimension reduction methods to fully comprehended multi-view features and integrate compatible and complementary information from these features to construct low-dimensional subspace directly. Furthermore, most multi-view dimension reduction methods cannot handle features from nonlinear spaces with high dimensions. Therefore, how to construct a multi-view dimension reduction methods which can deal with multi-view features from high-dimensional nonlinear space is of vital importance but challenging. In order to address this problem, we proposed a novel method named Co-regularized Multi-view Sparse Reconstruction Embedding (CMSRE) in this paper. By exploiting correlations of sparse reconstruction from multiple views, CMSRE is able to learn local sparse structures of nonlinear manifolds from multiple views and constructs significative low-dimensional representations for them. Due to the proposed co-regularized scheme, correlations of sparse reconstructions from multiple views are preserved by CMSRE as much as possible. Furthermore, sparse representation produces more meaningful correlations between features from each single view, which helps CMSRE to gain better performances. Various evaluations based on the applications of document classification, face recognition and image retrieval can demonstrate the effectiveness of the proposed approach on multi-view dimension reduction.

Click to Read Paper and Get Code
With the development of multimedia time, one sample can always be described from multiple views which contain compatible and complementary information. Most algorithms cannot take information from multiple views into considerations and fail to achieve desirable performance in most situations. For many applications, such as image retrieval, face recognition, etc., an appropriate distance metric can better reflect the similarities between various samples. Therefore, how to construct a good distance metric learning methods which can deal with multiview data has been an important topic during the last decade. In this paper, we proposed a novel algorithm named Self-weighted Multiview Metric Learning (SM2L) which can finish this task by maximizing the cross correlations between different views. Furthermore, because multiple views have different contributions to the learning procedure of SM2L, we adopt a self-weighted learning framework to assign multiple views with different weights. Various experiments on benchmark datasets can verify the performance of our proposed method.

Click to Read Paper and Get Code
Machine learning, especially deep neural networks, has been rapidly developed in fields including computer vision, speech recognition and reinforcement learning. Although Mini-batch SGD is one of the most popular stochastic optimization methods in training deep networks, it shows a slow convergence rate due to the large noise in gradient approximation. In this paper, we attempt to remedy this problem by building more efficient batch selection method based on typicality sampling, which reduces the error of gradient estimation in conventional Minibatch SGD. We analyze the convergence rate of the resulting typical batch SGD algorithm and compare convergence properties between Minibatch SGD and the algorithm. Experimental results demonstrate that our batch selection scheme works well and more complex Minibatch SGD variants can benefit from the proposed batch selection strategy.

* 10 pages, 4 figures, for journal
Click to Read Paper and Get Code
Deep neural networks have recently demonstrated the traffic prediction capability with the time series data obtained by sensors mounted on road segments. However, capturing spatio-temporal features of the traffic data often requires a significant number of parameters to train, increasing computational burden. In this work we demonstrate that embedding topological information of the road network improves the process of learning traffic features. We use a graph of a vehicular road network with recurrent neural networks (RNNs) to infer the interaction between adjacent road segments as well as the temporal dynamics. The topology of the road network is converted into a spatio-temporal graph to form a structural RNN (SRNN). The proposed approach is validated over traffic speed data from the road network of the city of Santander in Spain. The experiment shows that the graph-based method outperforms the state-of-the-art methods based on spatio-temporal images, requiring much fewer parameters to train.

* Accepted and revised, to be presented in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) on May 2019
Click to Read Paper and Get Code
Siamese networks have drawn great attention in visual tracking because of their balanced accuracy and speed. However, the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet [18], which does not fully take advantage of the capability of modern deep neural networks. In this paper, we investigate how to leverage deeper and wider convolutional neural networks to enhance tracking robustness and accuracy. We observe that direct replacement of backbones with existing powerful architectures, such as ResNet [14] and Inception [33], does not bring improvements. The main reasons are that 1)large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision; and 2) the network padding for convolutions induces a positional bias in learning. To address these issues, we propose new residual modules to eliminate the negative impact of padding, and further design new architectures using these modules with controlled receptive field size and network stride. The designed architectures are lightweight and guarantee real-time tracking speed when applied to SiamFC [2] and SiamRPN [20]. Experiments show that solely due to the proposed network architectures, our SiamFC+ and SiamRPN+ obtain up to 9.8%/5.7% (AUC), 23.3%/8.8% (EAO) and 24.4%/25.0% (EAO) relative improvements over the original versions [2, 20] on the OTB-15, VOT-16 and VOT-17 datasets, respectively.

Click to Read Paper and Get Code
Discriminating targets moving against a cluttered background is a huge challenge, let alone detecting a target as small as one or a few pixels and tracking it in flight. In the fly's visual system, a class of specific neurons, called small target motion detectors (STMDs), have been identified as showing exquisite selectivity for small target motion. Some of the STMDs have also demonstrated directional selectivity which means these STMDs respond strongly only to their preferred motion direction. Directional selectivity is an important property of these STMD neurons which could contribute to tracking small targets such as mates in flight. However, little has been done on systematically modeling these directional selective STMD neurons. In this paper, we propose a directional selective STMD-based neural network (DSTMD) for small target detection in a cluttered background. In the proposed neural network, a new correlation mechanism is introduced for direction selectivity via correlating signals relayed from two pixels. Then, a lateral inhibition mechanism is implemented on the spatial field for size selectivity of STMD neurons. Extensive experiments showed that the proposed neural network not only is in accord with current biological findings, i.e. showing directional preferences, but also worked reliably in detecting small targets against cluttered backgrounds.

* 14 pages, 21 figures
Click to Read Paper and Get Code
Depth estimation from a single image is a fundamental problem in computer vision. In this paper, we propose a simple yet effective convolutional spatial propagation network (CSPN) to learn the affinity matrix for depth prediction. Specifically, we adopt an efficient linear propagation model, where the propagation is performed with a manner of recurrent convolutional operation, and the affinity among neighboring pixels is learned through a deep convolutional neural network (CNN). We apply the designed CSPN to two depth estimation tasks given a single image: (1) To refine the depth output from state-of-the-art (SOTA) existing methods; and (2) to convert sparse depth samples to a dense depth map by embedding the depth samples within the propagation procedure. The second task is inspired by the availability of LIDARs that provides sparse but accurate depth measurements. We experimented the proposed CSPN over two popular benchmarks for depth estimation, i.e. NYU v2 and KITTI, where we show that our proposed approach improves in not only quality (e.g., 30% more reduction in depth error), but also speed (e.g., 2 to 5 times faster) than prior SOTA methods.

* 14 pages, 8 figures, ECCV 2018
Click to Read Paper and Get Code
Small target motion detection is critical for insects to search for and track mates or prey which always appear as small dim speckles in the visual field. A class of specific neurons, called small target motion detectors (STMDs), has been characterized by exquisite sensitivity for small target motion. Understanding and analyzing visual pathway of STMD neurons are beneficial to design artificial visual systems for small target motion detection. Feedback loops have been widely identified in visual neural circuits and play an important role in target detection. However, if there exists a feedback loop in the STMD visual pathway or if a feedback loop could significantly improve the detection performance of STMD neurons, is unclear. In this paper, we propose a feedback neural network for small target motion detection against naturally cluttered backgrounds. In order to form a feedback loop, model output is temporally delayed and relayed to previous neural layer as feedback signal. Extensive experiments showed that the significant improvement of the proposed feedback neural network over the existing STMD-based models for small target motion detection.

* 10 pages, 5 figures
Click to Read Paper and Get Code
A class of specialized neurons, called lobula plate tangential cells (LPTCs) has been shown to respond strongly to wide-field motion. The classic model, elementary motion detector (EMD) and its improved model, two-quadrant detector (TQD) have been proposed to simulate LPTCs. Although EMD and TQD can percept background motion, their outputs are so cluttered that it is difficult to discriminate actual motion direction of the background. In this paper, we propose a max operation mechanism to model a newly-found transmedullary neuron Tm9 whose physiological properties do not map onto EMD and TQD. This proposed max operation mechanism is able to improve the detection performance of TQD in cluttered background by filtering out irrelevant motion signals. We will demonstrate the functionality of this proposed mechanism in wide-field motion perception.

* 6 pages, 11 figures
Click to Read Paper and Get Code
In this work, we tackle the problem of car license plate detection and recognition in natural scene images. We propose a unified deep neural network which can localize license plates and recognize the letters simultaneously in a single forward pass. The whole network can be trained end-to-end. In contrast to existing approaches which take license plate detection and recognition as two separate tasks and settle them step by step, our method jointly solves these two tasks by a single network. It not only avoids intermediate error accumulation, but also accelerates the processing speed. For performance evaluation, three datasets including images captured from various scenes under different conditions are tested. Extensive experiments show the effectiveness and efficiency of our proposed approach.

* 9 pages
Click to Read Paper and Get Code
In this work, we jointly address the problem of text detection and recognition in natural scene images based on convolutional recurrent neural networks. We propose a unified network that simultaneously localizes and recognizes text with a single forward pass, avoiding intermediate processes like image cropping and feature re-calculation, word separation, or character grouping. In contrast to existing approaches that consider text detection and recognition as two distinct tasks and tackle them one by one, the proposed framework settles these two tasks concurrently. The whole framework can be trained end-to-end, requiring only images, the ground-truth bounding boxes and text labels. Through end-to-end training, the learned features can be more informative, which improves the overall performance. The convolutional features are calculated only once and shared by both detection and recognition, which saves processing time. Our proposed method has achieved competitive performance on several benchmark datasets.

* 14 pages
Click to Read Paper and Get Code
This paper considers the problem of simultaneous 2-D room shape reconstruction and self-localization without the requirement of any pre-established infrastructure. A mobile device equipped with co-located microphone and loudspeaker as well as internal motion sensors is used to emit acoustic pulses and collect echoes reflected by the walls. Using only first order echoes, room shape recovery and self-localization is feasible when auxiliary information is obtained using motion sensors. In particular, it is established that using echoes collected at three measurement locations and the two distances between consecutive measurement points, unique localization and mapping can be achieved provided that the three measurement points are not collinear. Practical algorithms for room shape reconstruction and self-localization in the presence of noise and higher order echoes are proposed along with experimental results to demonstrate the effectiveness of the proposed approach.

* Submitted to the IEEE Transactions on Audio, Speech and Language Processing
Click to Read Paper and Get Code
In recent years, dynamically growing data and incrementally growing number of classes pose new challenges to large-scale data classification research. Most traditional methods struggle to balance the precision and computational burden when data and its number of classes increased. However, some methods are with weak precision, and the others are time-consuming. In this paper, we propose an incremental learning method, namely, heterogeneous incremental Nearest Class Mean Random Forest (hi-RF), to handle this issue. It is a heterogeneous method that either replaces trees or updates trees leaves in the random forest adaptively, to reduce the computational time in comparable performance, when data of new classes arrive. Specifically, to keep the accuracy, one proportion of trees are replaced by new NCM decision trees; to reduce the computational load, the rest trees are updated their leaves probabilities only. Most of all, out-of-bag estimation and out-of-bag boosting are proposed to balance the accuracy and the computational efficiency. Fair experiments were conducted and demonstrated its comparable precision with much less computational time.

* Accepted by AIIE2016
Click to Read Paper and Get Code
This paper considers the two-stage capacitated facility location problem (TSCFLP) in which products manufactured in plants are delivered to customers via storage depots. Customer demands are satisfied subject to limited plant production and limited depot storage capacity. The objective is to determine the locations of plants and depots in order to minimize the total cost including the fixed cost and transportation cost. A hybrid evolutionary algorithm (HEA) with genetic operations and local search is proposed. To avoid the expensive calculation of fitness of population in terms of computational time, the HEA uses extreme machine learning to approximate the fitness of most of the individuals. Moreover, two heuristics based on the characteristic of the problem is incorporated to generate a good initial population. Computational experiments are performed on two sets of test instances from the recent literature. The performance of the proposed algorithm is evaluated and analyzed. Compared with the state-of-the-art genetic algorithm, the proposed algorithm can find the optimal or near-optimal solutions in a reasonable computational time.

Click to Read Paper and Get Code
Segregation is a popular phenomenon. It has considerable effects on material performance. To the author's knowledge, there is still no automated objective quantitative indicator for segregation. In order to full fill this task, segregation of particles is analyzed. Edges of the particles are extracted from the digital picture. Then, the whole picture of particles is splintered to small rectangles with the same shape. Statistical index of the edges in each rectangle is calculated. Accordingly, segregation between the indexes corresponding to the rectangles is evaluated. The results show coincident with subjective evaluated results. Further more, it can be implemented as an automated system, which would facilitate the materials quality control mechanism during production process.

Click to Read Paper and Get Code