Research papers and code for "Teng Wang":
This article describes the final solution of team monkeytyping, who finished in second place in the YouTube-8M video understanding challenge. The dataset used in this challenge is a large-scale benchmark for multi-label video classification. We extend the work in [1] and propose several improvements for frame sequence modeling. We propose a network structure called Chaining that can better capture the interactions between labels. Also, we report our approaches in dealing with multi-scale information and attention pooling. In addition, We find that using the output of model ensemble as a side target in training can boost single model performance. We report our experiments in bagging, boosting, cascade, and stacking, and propose a stacking algorithm called attention weighted stacking. Our final submission is an ensemble that consists of 74 sub models, all of which are listed in the appendix.

* Submitted to the CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding
Click to Read Paper and Get Code
With the rapid development of in-depth learning, neural network and deep learning algorithms have been widely used in various fields, e.g., image, video and voice processing. However, the neural network model is getting larger and larger, which is expressed in the calculation of model parameters. Although a wealth of existing efforts on GPU platforms currently used by researchers for improving computing performance, dedicated hardware solutions are essential and emerging to provide advantages over pure software solutions. In this paper, we systematically investigate the neural network accelerator based on FPGA. Specifically, we respectively review the accelerators designed for specific problems, specific algorithms, algorithm features, and general templates. We also compared the design and implementation of the accelerator based on FPGA under different devices and network models and compared it with the versions of CPU and GPU. Finally, we present to discuss the advantages and disadvantages of accelerators on FPGA platforms and to further explore the opportunities for future research.

Click to Read Paper and Get Code
Deep learning is a main focus of artificial intelligence and has greatly impacted other fields. However, deep learning is often criticized for its lack of interpretation. As a successful unsupervised model in deep learning, various autoencoders, especially convolutional autoencoders, are very popular and important. Since these autoencoders need improvements and insights, in this paper we shed light on the nonlinearity of a deep convolutional autoencoder in perspective of perfect signal recovery. In particular, we propose a new type of convolutional autoencoders, termed as Soft-Autoencoder (Soft-AE), in which the activations of encoding layers are implemented with adaptable soft-thresholding units while decoding layers are realized with linear units. Consequently, Soft-AE can be naturally interpreted as a learned cascaded wavelet shrinkage system. Our denoising numerical experiments on CIFAR-10, BSD-300 and Mayo Clinical Challenge Dataset demonstrate that Soft-AE gives a competitive performance relative to its counterparts.

Click to Read Paper and Get Code
To synthesize high quality person images with arbitrary poses is challenging. In this paper, we propose a novel Multi-scale Conditional Generative Adversarial Networks (MsCGAN), aiming to convert the input conditional person image to a synthetic image of any given target pose, whose appearance and the texture are consistent with the input image. MsCGAN is a multi-scale adversarial network consisting of two generators and two discriminators. One generator transforms the conditional person image into a coarse image of the target pose globally, and the other is to enhance the detailed quality of the synthetic person image through a local reinforcement network. The outputs of the two generators are then merged into a synthetic, discriminant and high-resolution image. On the other hand, the synthetic image is down-sampled to multiple resolutions as the input to multi-scale discriminator networks. The proposed multi-scale generators and discriminators handling different levels of visual features can benefit to synthesizing high resolution person images with realistic appearance and texture. Experiments are conducted on the Market-1501 and DeepFashion datasets to evaluate the proposed model, and both qualitative and quantitative results demonstrate superior performance of the proposed MsCGAN.

Click to Read Paper and Get Code
In this paper, we focus on general-purpose Distributed Stream Data Processing Systems (DSDPSs), which deal with processing of unbounded streams of continuous data at scale distributedly in real or near-real time. A fundamental problem in a DSDPS is the scheduling problem with the objective of minimizing average end-to-end tuple processing time. A widely-used solution is to distribute workload evenly over machines in the cluster in a round-robin manner, which is obviously not efficient due to lack of consideration for communication delay. Model-based approaches do not work well either due to the high complexity of the system environment. We aim to develop a novel model-free approach that can learn to well control a DSDPS from its experience rather than accurate and mathematically solvable system models, just as a human learns a skill (such as cooking, driving, swimming, etc). Specifically, we, for the first time, propose to leverage emerging Deep Reinforcement Learning (DRL) for enabling model-free control in DSDPSs; and present design, implementation and evaluation of a novel and highly effective DRL-based control framework, which minimizes average end-to-end tuple processing time by jointly learning the system environment via collecting very limited runtime statistics data and making decisions under the guidance of powerful Deep Neural Networks. To validate and evaluate the proposed framework, we implemented it based on a widely-used DSDPS, Apache Storm, and tested it with three representative applications. Extensive experimental results show 1) Compared to Storm's default scheduler and the state-of-the-art model-based method, the proposed framework reduces average tuple processing by 33.5% and 14.0% respectively on average. 2) The proposed framework can quickly reach a good scheduling solution during online learning, which justifies its practicability for online control in DSDPSs.

* 14 pages, this paper has been accepted by VLDB 2018
Click to Read Paper and Get Code
We present a simple and fast geometric method for modeling data by a union of affine subspaces. The method begins by forming a collection of local best-fit affine subspaces, i.e., subspaces approximating the data in local neighborhoods. The correct sizes of the local neighborhoods are determined automatically by the Jones' $\beta_2$ numbers (we prove under certain geometric conditions that our method finds the optimal local neighborhoods). The collection of subspaces is further processed by a greedy selection procedure or a spectral method to generate the final model. We discuss applications to tracking-based motion segmentation and clustering of faces under different illuminating conditions. We give extensive experimental evidence demonstrating the state of the art accuracy and speed of the suggested algorithms on these problems and also on synthetic hybrid linear data as well as the MNIST handwritten digits data; and we demonstrate how to use our algorithms for fast determination of the number of affine subspaces.

* International Journal of Computer Vision Volume 100, Issue 3 (2012), Page 217-240
* This version adds some clarifications and numerical experiments as well as strengthens the previous theorem. For face experiments, we use here the Extended Yale Face Database B (cropped faces unlike previous version). This database points to a failure mode of our algorithms, but we suggest and successfully test a workaround
Click to Read Paper and Get Code
The hybrid linear modeling problem is to identify a set of d-dimensional affine sets in a D-dimensional Euclidean space. It arises, for example, in object tracking and structure from motion. The hybrid linear model can be considered as the second simplest (behind linear) manifold model of data. In this paper we will present a very simple geometric method for hybrid linear modeling based on selecting a set of local best fit flats that minimize a global l1 error measure. The size of the local neighborhoods is determined automatically by the Jones' l2 beta numbers; it is proven under certain geometric conditions that good local neighborhoods exist and are found by our method. We also demonstrate how to use this algorithm for fast determination of the number of affine subspaces. We give extensive experimental evidence demonstrating the state of the art accuracy and speed of the algorithm on synthetic and real hybrid linear data.

* 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (13-18 June 2010), pp. 1927-1934
* To appear in the proceedings of CVPR 2010
Click to Read Paper and Get Code
Computed Tomography (CT) imaging technique is widely used in geological exploration, medical diagnosis and other fields. In practice, however, the resolution of CT image is usually limited by scanning devices and great expense. Super resolution (SR) methods based on deep learning have achieved surprising performance in two-dimensional (2D) images. Unfortunately, there are few effective SR algorithms for three-dimensional (3D) images. In this paper, we proposed a novel network named as three-dimensional super resolution convolutional neural network (3DSRCNN) to realize voxel super resolution for CT images. To solve the practical problems in training process such as slow convergence of network training, insufficient memory, etc., we utilized adjustable learning rate, residual-learning, gradient clipping, momentum stochastic gradient descent (SGD) strategies to optimize training procedure. In addition, we have explored the empirical guidelines to set appropriate number of layers of network and how to use residual learning strategy. Additionally, previous learning-based algorithms need to separately train for different scale factors for reconstruction, yet our single model can complete the multi-scale SR. At last, our method has better performance in terms of PSNR, SSIM and efficiency compared with conventional methods.

Click to Read Paper and Get Code
Neural Architecture Search (NAS) has emerged as a promising technique for automatic neural network design. However, existing NAS approaches often utilize manually designed action space, which is not directly related to the performance metric to be optimized (e.g., accuracy). As a result, using manually designed action space to perform NAS often leads to sample-inefficient explorations of architectures and thus can be sub-optimal. In order to improve sample efficiency, this paper proposes Latent Action Neural Architecture Search (LaNAS) that learns the action space to recursively partition the architecture search space into regions, each with concentrated performance metrics (\emph{i.e.}, low variance). During the search phase, as different architecture search action sequences lead to regions of different performance, the search efficiency can be significantly improved by biasing towards the regions with good performance. On the largest NAS dataset NasBench-101, our experimental results demonstrated that LaNAS is 22x, 14.6x and 12.4x more sample-efficient than random search, regularized evolution, and Monte Carlo Tree Search (MCTS) respectively. When applied to the open domain, LaNAS finds an architecture that achieves SoTA 98.0% accuracy on CIFAR-10 and 75.0% top1 accuracy on ImageNet (mobile setting), after exploring only 6,000 architectures.

Click to Read Paper and Get Code
Binary image segmentation plays an important role in computer vision and has been widely used in many applications such as image and video editing, object extraction, and photo composition. In this paper, we propose a novel interactive binary image segmentation method based on the Markov Random Field (MRF) framework and the fast bilateral solver (FBS) technique. Specifically, we employ the geodesic distance component to build the unary term. To ensure both computation efficiency and effective responsiveness for interactive segmentation, superpixels are used in computing geodesic distances instead of pixels. Furthermore, we take a bilateral affinity approach for the pairwise term in order to preserve edge information and denoise. Through the alternating direction strategy, the MRF energy minimization problem is divided into two subproblems, which then can be easily solved by steepest gradient descent (SGD) and FBS respectively. Experimental results on the VGG interactive image segmentation dataset show that the proposed algorithm outperforms several state-of-the-art ones, and in particular, it can achieve satisfactory edge-smooth segmentation results even when the foreground and background color appearances are quite indistinctive.

Click to Read Paper and Get Code
Data augmentation is usually used by supervised learning approaches for offline writer identification, but such approaches require extra training data and potentially lead to overfitting errors. In this study, a semi-supervised feature learning pipeline was proposed to improve the performance of writer identification by training with extra unlabeled data and the original labeled data simultaneously. Specifically, we proposed a weighted label smoothing regularization (WLSR) method for data augmentation, which assigned the weighted uniform label distribution to the extra unlabeled data. The WLSR method could regularize the convolutional neural network (CNN) baseline to allow more discriminative features to be learned to represent the properties of different writing styles. The experimental results on well-known benchmark datasets (ICDAR2013 and CVL) showed that our proposed semi-supervised feature learning approach could significantly improve the baseline measurement and perform competitively with existing writer identification approaches. Our findings provide new insights into offline write identification.

* This manuscript is submitting to Information Science
Click to Read Paper and Get Code
Unmanned Aerial Vehicles (UAVs) have been implemented for environmental monitoring by using their capabilities of mobile sensing, autonomous navigation, and remote operation. However, in real-world applications, the limitations of on-board resources (e.g., power supply) of UAVs will constrain the coverage of the monitored area and the number of the acquired samples, which will hinder the performance of field estimation and mapping. Therefore, the issue of constrained resources calls for an efficient sampling planner to schedule UAV-based sensing tasks in environmental monitoring. This paper presents a mission planner of coverage sampling and path planning for a UAV-enabled mobile sensor to effectively explore and map an unknown environment that is modeled as a random field. The proposed planner can generate a coverage path with an optimal coverage density for exploratory sampling, and the associated energy cost is subjected to a power supply constraint. The performance of the developed framework is evaluated and compared with the existing state-of-the-art algorithms, using a real-world dataset that is collected from an environmental monitoring program as well as physical field experiments. The experimental results illustrate the reliability and accuracy of the presented coverage sampling planner in a prior survey for environmental exploration and field mapping.

Click to Read Paper and Get Code
Automated scene analysis has been a topic of great interest in computer vision and cognitive science. Recently, with the growth of crowd phenomena in the real world, crowded scene analysis has attracted much attention. However, the visual occlusions and ambiguities in crowded scenes, as well as the complex behaviors and scene semantics, make the analysis a challenging task. In the past few years, an increasing number of works on crowded scene analysis have been reported, covering different aspects including crowd motion pattern learning, crowd behavior and activity analysis, and anomaly detection in crowds. This paper surveys the state-of-the-art techniques on this topic. We first provide the background knowledge and the available features related to crowded scenes. Then, existing models, popular algorithms, evaluation protocols, as well as system performance are provided corresponding to different aspects of crowded scene analysis. We also outline the available datasets for performance evaluation. Finally, some research problems and promising future directions are presented with discussions.

* 20 pages in IEEE Transactions on Circuits and Systems for Video Technology, 2015
Click to Read Paper and Get Code
In this paper, we propose an integrated framework for the autonomous robotic exploration in indoor environments. Specially, we present a hybrid map, named Semantic Road Map (SRM), to represent the topological structure of the explored environment and facilitate decision-making in the exploration. The SRM is built incrementally along with the exploration process. It is a graph structure with collision-free nodes and edges that are generated within the sensor coverage. Moreover, each node has a semantic label and the expected information gain at that location. Based on the concise SRM, we present a novel and effective decision-making model to determine the next-best-target (NBT) during the exploration. The model concerns the semantic information, the information gain, and the path cost to the target location. We use the nodes of SRM to represent the candidate targets, which enables the target evaluation to be performed directly on the SRM. With the SRM, both the information gain of a node and the path cost to the node can be obtained efficiently. Besides, we adopt the cross-entropy method to optimize the path to make it more informative. We conduct experimental studies in both simulated and real-world environments, which demonstrate the effectiveness of the proposed method.

Click to Read Paper and Get Code
In this paper we study grasp problem in dense cluster, a challenging task in warehouse logistics scenario. By introducing a two-step robust suction affordance detection method, we focus on using vacuum suction pad to clear up a box filled with seen and unseen objects. Two CNN based neural networks are proposed. A Fast Region Estimation Network (FRE-Net) predicts which region contains pickable objects, and a Suction Grasp Point Affordance network (SGPA-Net) determines which point in that region is pickable. So as to enable such two networks, we design a self-supervised learning pipeline to accumulate data, train and test the performance of our method. In both virtual and real environment, within 1500 picks (~5 hours), we reach a picking accuracy of 95% for known objects and 90% for unseen objects with similar geometry features.

Click to Read Paper and Get Code
With the rapid development of artificial intelligence (AI), ethical issues surrounding AI have attracted increasing attention. In particular, autonomous vehicles may face moral dilemmas in accident scenarios, such as staying the course resulting in hurting pedestrians or swerving leading to hurting passengers. To investigate such ethical dilemmas, recent studies have adopted preference aggregation, in which each voter expresses her/his preferences over decisions for the possible ethical dilemma scenarios, and a centralized system aggregates these preferences to obtain the winning decision. Although a useful methodology for building ethical AI systems, such an approach can potentially violate the privacy of voters since moral preferences are sensitive information and their disclosure can be exploited by malicious parties. In this paper, we report a first-of-its-kind privacy-preserving crowd-guided AI decision-making approach in ethical dilemmas. We adopt the notion of differential privacy to quantify privacy and consider four granularities of privacy protection by taking voter-/record-level privacy protection and centralized/distributed perturbation into account, resulting in four approaches VLCP, RLCP, VLDP, and RLDP. Moreover, we propose different algorithms to achieve these privacy protection granularities, while retaining the accuracy of the learned moral preference model. Specifically, VLCP and RLCP are implemented with the data aggregator setting a universal privacy parameter and perturbing the averaged moral preference to protect the privacy of voters' data. VLDP and RLDP are implemented in such a way that each voter perturbs her/his local moral preference with a personalized privacy parameter. Extensive experiments on both synthetic and real data demonstrate that the proposed approach can achieve high accuracy of preference aggregation while protecting individual voter's privacy.

* 11pages
Click to Read Paper and Get Code
Robotic grasp detection is a fundamental capability for intelligent manipulation in unstructured environments. Previous work mainly employed visual and tactile fusion to achieve stable grasp, while, the whole process depending heavily on regrasping, which wastes much time to regulate and evaluate. We propose a novel way to improve robotic grasping: by using learned tactile knowledge, a robot can achieve a stable grasp from an image. First, we construct a prior tactile knowledge learning framework with novel grasp quality metric which is determined by measuring its resistance to external perturbations. Second, we propose a multi-phases Bayesian Grasp architecture to generate stable grasp configurations through a single RGB image based on prior tactile knowledge. Results show that this framework can classify the outcome of grasps with an average accuracy of 86% on known objects and 79% on novel objects. The prior tactile knowledge improves the successful rate of 55% over traditional vision-based strategies.

* ICRA2019: ViTac Workshop
Click to Read Paper and Get Code
An automatic classification method has been studied to effectively detect and recognize Electrocardiogram (ECG). Based on the synchronizing and orthogonal relationships of multiple leads, we propose a Multi-branch Convolution and Residual Network (MBCRNet) with three kinds of feature fusion methods for automatic detection of normal and abnormal ECG signals. Experiments are conducted on the Chinese Cardiovascular Disease Database (CCDD). Through 10-fold cross-validation, we achieve an average accuracy of 87.04% and a sensitivity of 89.93%, which outperforms previous methods under the same database. It is also shown that the multi-lead feature fusion network can improve the classification accuracy over the network only with the single lead features.

* 6 pages, 5 figures
Click to Read Paper and Get Code
In this paper, with respect to multichannel synthetic aperture radars (SAR), we first formulate the problems of Doppler ambiguities on the radial velocity (RV) estimation of a ground moving target in range-compressed domain, range-Doppler domain and image domain, respectively. It is revealed that in these problems, a cascaded time-space Doppler ambiguity (CTSDA) may encounter, i.e., time domain Doppler ambiguity (TDDA) in each channel arises first and then spatial domain Doppler ambiguity (SDDA) among multi-channels arises second. Accordingly, the multichannel SAR systems with different parameters are investigated in three different cases with diverse Doppler ambiguity properties, and a multi-frequency SAR is then proposed to obtain the RV estimation by solving the ambiguity problem based on Chinese remainder theorem (CRT). In the first two cases, the ambiguity problem can be solved by the existing closed-form robust CRT. In the third case, it is found that the problem is different from the conventional CRT problems and we call it a double remaindering problem in this paper. We then propose a sufficient condition under which the double remaindering problem, i.e., the CTSDA, can also be solved by the closed-form robust CRT. When the sufficient condition is not satisfied for a multi-channel SAR, a searching based method is proposed. Finally, some results of numerical experiments are provided to demonstrate the effectiveness of the proposed methods.

* 14 double-column pages, 11 figures, 4 tables
Click to Read Paper and Get Code
The emergence of big data enables us to evaluate the various human emotions at places from a statistic perspective by applying affective computing. In this study, a novel framework for extracting human emotions from large-scale georeferenced photos at different places is proposed. After the construction of places based on spatial clustering of user generated footprints collected in social media websites, online cognitive services are utilized to extract human emotions from facial expressions using the state-of-the-art computer vision techniques. And two happiness metrics are defined for measuring the human emotions at different places. To validate the feasibility of the framework, we take 80 tourist attractions around the world as an example and a happiness ranking list of places is generated based on human emotions calculated over 2 million faces detected out from over 6 million photos. Different kinds of geographical contexts are taken into consideration to find out the relationship between human emotions and environmental factors. Results show that much of the emotional variation at different places can be explained by a few factors such as openness. The research may offer insights on integrating human emotions to enrich the understanding of sense of place in geography and in place-based GIS.

* Transactions in GIS, Year 2019, Volume 23, Issue 3
* 40 pages; 9 figures
Click to Read Paper and Get Code