Research papers and code for "Cheng Wang":
In this paper, we propose a recurrent neural network (RNN) with residual attention (RRA) to learn long-range dependencies from sequential data. We propose to add residual connections across timesteps to RNN, which explicitly enhances the interaction between current state and hidden states that are several timesteps apart. This also allows training errors to be directly back-propagated through residual connections and effectively alleviates gradient vanishing problem. We further reformulate an attention mechanism over residual connections. An attention gate is defined to summarize the individual contribution from multiple previous hidden states in computing the current state. We evaluate RRA on three tasks: the adding problem, pixel-by-pixel MNIST classification and sentiment analysis on the IMDB dataset. Our experiments demonstrate that RRA yields better performance, faster convergence and more stable training compared to a standard LSTM network. Furthermore, RRA shows highly competitive performance to the state-of-the-art methods.

* 9 pages, 7 figures
Click to Read Paper and Get Code
Recent years have witnessed wide application of hashing for large-scale image retrieval. However, most existing hashing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Recently, deep hashing methods have been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels. For another common application scenario with pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing(DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. Experiments on real datasets show that our DPSH method can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.

* IJCAI 2016
Click to Read Paper and Get Code
Building effective recommender systems for domains like fashion is challenging due to the high level of subjectivity and the semantic complexity of the features involved (i.e., fashion styles). Recent work has shown that approaches to `visual' recommendation (e.g.~clothing, art, etc.) can be made more accurate by incorporating visual signals directly into the recommendation objective, using `off-the-shelf' feature representations derived from deep networks. Here, we seek to extend this contribution by showing that recommendation performance can be significantly improved by learning `fashion aware' image representations directly, i.e., by training the image representation (from the pixel level) and the recommender system jointly; this contribution is related to recent work using Siamese CNNs, though we are able to show improvements over state-of-the-art recommendation techniques such as BPR and variants that make use of pre-trained visual features. Furthermore, we show that our model can be used \emph{generatively}, i.e., given a user and a product category, we can generate new images (i.e., clothing items) that are most consistent with their personal taste. This represents a first step towards building systems that go beyond recommending existing items from a product corpus, but which can be used to suggest styles and aid the design of new products.

* 10 pages, 6 figures. Accepted by ICDM'17 as a long paper
Click to Read Paper and Get Code
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.

Click to Read Paper and Get Code
Recurrent neural networks are a widely used class of neural architectures. They have, however, two shortcomings. First, it is difficult to understand what exactly they learn. Second, they tend to work poorly on sequences requiring long-term memorization, despite having this capacity in principle. We aim to address both shortcomings with a class of recurrent networks that use a stochastic state transition mechanism between cell applications. This mechanism, which we term state-regularization, makes RNNs transition between a finite set of learnable states. We evaluate state-regularized RNNs on (1) regular languages for the purpose of automata extraction; (2) nonregular languages such as balanced parentheses, palindromes, and the copy task where external memory is required; and (3) real-word sequence learning tasks for sentiment analysis, visual object recognition, and language modeling. We show that state-regularization (a) simplifies the extraction of finite state automata modeling an RNN's state transition dynamics; (b) forces RNNs to operate more like automata with external memory and less like finite state machines; (c) makes RNNs have better interpretability and explainability.

* 20 pages
Click to Read Paper and Get Code
A controllable crack propagation (CCP) strategy is suggested. It is well known that crack always leads the failure by crossing the critical domain in engineering structure. Therefore, the CCP method is proposed to control the crack to propagate along the specified path, which is away from the critical domain. To complete this strategy, two optimization methods are engaged. Firstly, a back propagation neural network (BPNN) assisted particle swarm optimization (PSO) is suggested. In this method, to improve the efficiency of CCP, the BPNN is used to build the metamodel instead of the forward evaluation. Secondly, the popular PSO is used. Considering the optimization iteration is a time consuming process, an efficient reanalysis based extended finite element methods (X-FEM) is used to substitute the complete X-FEM solver to calculate the crack propagation path. Moreover, an adaptive subdomain partition strategy is suggested to improve the fitting accuracy between real crack and specified paths. Several typical numerical examples demonstrate that both optimization methods can carry out the CCP. The selection of them should be determined by the tradeoff between efficiency and accuracy.

* 17 pages, 29 figures, Crack propagation path, Reanalysis solver, Back propagation neural network, Particle swarm optimization, Extended finite element method
Click to Read Paper and Get Code
In recent years, Deep Neural Networks (DNN) based methods have achieved remarkable performance in a wide range of tasks and have been among the most powerful and widely used techniques in computer vision. However, DNN-based methods are both computational-intensive and resource-consuming, which hinders the application of these methods on embedded systems like smart phones. To alleviate this problem, we introduce a novel Fixed-point Factorized Networks (FFN) for pretrained models to reduce the computational complexity as well as the storage requirement of networks. The resulting networks have only weights of -1, 0 and 1, which significantly eliminates the most resource-consuming multiply-accumulate operations (MACs). Extensive experiments on large-scale ImageNet classification task show the proposed FFN only requires one-thousandth of multiply operations with comparable accuracy.

Click to Read Paper and Get Code
This paper describes an updated interactive performance system for floor and Aerial Dance that controls visual and sonic aspects of the presentation via a depth sensing camera (MS Kinect). In order to detect, measure and track free movement in space, 3 degree of freedom (3-DOF) tracking in space (on the ground and in the air) is performed using IR markers with a method for multi target tracking capabilities added and described in detail. An improved gesture tracking and recognition system, called Action Graph (AG), is described in the paper. Action Graph uses an efficient incremental construction from a single long sequence of movement features and automatically captures repeated sub-segments in the movement from start to finish with no manual interaction needed with other advanced capabilities discussed as well. By using the new model for the gesture we can unify an entire choreography piece by dynamically tracking and recognizing gestures and sub-portions of the piece. This gives the performer the freedom to improvise based on a set of recorded gestures/portions of the choreography and have the system dynamically respond in relation to the performer within a set of related rehearsed actions, an ability that has not been seen in any other system to date.

* 8 pages. Technical paper
Click to Read Paper and Get Code
Sequential dynamics are a key feature of many modern recommender systems, which seek to capture the `context' of users' activities on the basis of actions they have performed recently. To capture such patterns, two approaches have proliferated: Markov Chains (MCs) and Recurrent Neural Networks (RNNs). Markov Chains assume that a user's next action can be predicted on the basis of just their last (or last few) actions, while RNNs in principle allow for longer-term semantics to be uncovered. Generally speaking, MC-based methods perform best in extremely sparse datasets, where model parsimony is critical, while RNNs perform better in denser datasets where higher model complexity is affordable. The goal of our work is to balance these two goals, by proposing a self-attention based sequential model (SASRec) that allows us to capture long-term semantics (like an RNN), but, using an attention mechanism, makes its predictions based on relatively few actions (like an MC). At each time step, SASRec seeks to identify which items are `relevant' from a user's action history, and use them to predict the next item. Extensive empirical studies show that our method outperforms various state-of-the-art sequential models (including MC/CNN/RNN-based approaches) on both sparse and dense datasets. Moreover, the model is an order of magnitude more efficient than comparable CNN/RNN-based models. Visualizations on attention weights also show how our model adaptively handles datasets with various density, and uncovers meaningful patterns in activity sequences.

* Accepted by ICDM'18 as a long paper
Click to Read Paper and Get Code
StochAstic Recursive grAdient algoritHm (SARAH), originally proposed for convex optimization and also proven to be effective for general nonconvex optimization, has received great attention due to its simple recursive framework for updating stochastic gradient estimates. The performance of SARAH significantly depends on the choice of step size sequence. However, SARAH and its variants often employ a best-tuned step size by mentor, which is time consuming in practice. Motivated by this gap, we proposed a variant of the Barzilai-Borwein (BB) method, referred to as the Random Barzilai-Borwein (RBB) method, to calculate step size for SARAH in the mini-batch setting, thereby leading to a new SARAH method: MB-SARAH-RBB. We prove that MB-SARAH-RBB converges linearly in expectation for strongly convex objective functions. We analyze the complexity of MB-SARAH-RBB and show that it is better than the original method. Numerical experiments on standard data sets indicate that MB-SARAH-RBB outperforms or matches state-of-the-art algorithms.

Click to Read Paper and Get Code
Brain tumor segmentation from Magnetic Resonance Images (MRIs) is an important task to measure tumor responses to treatments. However, automatic segmentation is very challenging. This paper presents an automatic brain tumor segmentation method based on a Normalized Gaussian Bayesian classification and a new 3D Fluid Vector Flow (FVF) algorithm. In our method, a Normalized Gaussian Mixture Model (NGMM) is proposed and used to model the healthy brain tissues. Gaussian Bayesian Classifier is exploited to acquire a Gaussian Bayesian Brain Map (GBBM) from the test brain MR images. GBBM is further processed to initialize the 3D FVF algorithm, which segments the brain tumor. This algorithm has two major contributions. First, we present a NGMM to model healthy brains. Second, we extend our 2D FVF algorithm to 3D space and use it for brain tumor segmentation. The proposed method is validated on a publicly available dataset.

* ICIP 2010
Click to Read Paper and Get Code
Person re-identification becomes a more and more important task due to its wide applications. In practice, person re-identification still remains challenging due to the variation of person pose, different lighting, occlusion, misalignment, background clutter, etc. In this paper, we propose a multi-scale body-part mask guided attention network (MMGA), which jointly learns whole-body and part body attention to help extract global and local features simultaneously. In MMGA, body-part masks are used to guide the training of corresponding attention. Experiments show that our proposed method can reduce the negative influence of variation of person pose, misalignment and background clutter. Our method achieves rank-1/mAP of 95.0%/87.2% on the Market1501 dataset, 89.5%/78.1% on the DukeMTMC-reID dataset, outperforming current state-of-the-art methods.

Click to Read Paper and Get Code
In recent years, perceptual-quality driven super-resolution methods show satisfactory results. However, super-resolved images have uncertain texture details and unpleasant artifact. We build a novel perceptual loss function composed of morphological components adversarial loss and color adversarial loss and salient content loss to ameliorate these problems. The adversarial loss is applied to constrain color and morphological components distribution of super-resolved images and the salient content loss highlights the perceptual similarity of feature-rich regions. Experiments show that proposed method achieves significant improvements in terms of perceptual index and visual quality compared with the state-of-the-art methods.

* 11 pages, 7 figures, 1 table
Click to Read Paper and Get Code
Data sparsity and data imbalance are practical and challenging issues in cross-domain recommender systems. This paper addresses those problems by leveraging the concepts which derive from representation learning, adversarial learning and transfer learning (particularly, domain adaptation). Although various transfer learning methods have shown promising performance in this context, our proposed novel method RecSys-DAN focuses on alleviating the cross-domain and within-domain data sparsity and data imbalance and learns transferable latent representations for users, items and their interactions. Different from existing approaches, the proposed method transfers the latent representations from a source domain to a target domain in an adversarial way. The mapping functions in the target domain are learned by playing a min-max game with an adversarial loss, aiming to generate domain indistinguishable representations for a discriminator. Four neural architectural instances of ResSys-DAN are proposed and explored. Empirical results on real-world Amazon data show that, even without using labeled data (i.e., ratings) in the target domain, RecSys-DAN achieves competitive performance as compared to the state-of-the-art supervised methods. More importantly, RecSys-DAN is highly flexible to both unimodal and multimodal scenarios, and thus it is more robust to the cold-start recommendation which is difficult for previous methods.

* 10 pages, IEEE-TNNLS
Click to Read Paper and Get Code
We present recurrent geometry-aware neural networks that integrate visual information across multiple views of a scene into 3D latent feature tensors, while maintaining an one-to-one mapping between 3D physical locations in the world scene and latent feature locations. Object detection, object segmentation, and 3D reconstruction is then carried out directly using the constructed 3D feature memory, as opposed to any of the input 2D images. The proposed models are equipped with differentiable egomotion-aware feature warping and (learned) depth-aware unprojection operations to achieve geometrically consistent mapping between the features in the input frame and the constructed latent model of the scene. We empirically show the proposed model generalizes much better than geometryunaware LSTM/GRU networks, especially under the presence of multiple objects and cross-object occlusions. Combined with active view selection policies, our model learns to select informative viewpoints to integrate information from by "undoing" cross-object occlusions, seamlessly combining geometry with learning from experience.

* To appear in NIPS2018
Click to Read Paper and Get Code
Multimodal learning has shown promising performance in content-based recommendation due to the auxiliary user and item information of multiple modalities such as text and images. However, the problem of incomplete and missing modality is rarely explored and most existing methods fail in learning a recommendation model with missing or corrupted modalities. In this paper, we propose LRMM, a novel framework that mitigates not only the problem of missing modalities but also more generally the cold-start problem of recommender systems. We propose modality dropout (m-drop) and a multimodal sequential autoencoder (m-auto) to learn multimodal representations for complementing and imputing missing modalities. Extensive experiments on real-world Amazon data show that LRMM achieves state-of-the-art performance on rating prediction tasks. More importantly, LRMM is more robust to previous methods in alleviating data-sparsity and the cold-start problem.

* 11 pages, EMNLP 2018
Click to Read Paper and Get Code
Depth estimation from a single image is a fundamental problem in computer vision. In this paper, we propose a simple yet effective convolutional spatial propagation network (CSPN) to learn the affinity matrix for depth prediction. Specifically, we adopt an efficient linear propagation model, where the propagation is performed with a manner of recurrent convolutional operation, and the affinity among neighboring pixels is learned through a deep convolutional neural network (CNN). We apply the designed CSPN to two depth estimation tasks given a single image: (1) To refine the depth output from state-of-the-art (SOTA) existing methods; and (2) to convert sparse depth samples to a dense depth map by embedding the depth samples within the propagation procedure. The second task is inspired by the availability of LIDARs that provides sparse but accurate depth measurements. We experimented the proposed CSPN over two popular benchmarks for depth estimation, i.e. NYU v2 and KITTI, where we show that our proposed approach improves in not only quality (e.g., 30% more reduction in depth error), but also speed (e.g., 2 to 5 times faster) than prior SOTA methods.

* 14 pages, 8 figures, ECCV 2018
Click to Read Paper and Get Code
In this paper, we propose a novel pixel-wise visual object tracking framework that can track any anonymous object in a noisy background. The framework consists of two submodels, a global attention model and a local segmentation model. The global model generates a region of interests (ROI) that the object may lie in the new frame based on the past object segmentation maps, while the local model segments the new image in the ROI. Each model uses a LSTM structure to model the temporal dynamics of the motion and appearance, respectively. To circumvent the dependency of the training data between the two models, we use an iterative update strategy. Once the models are trained, there is no need to refine them to track specific objects, making our method efficient compared to online learning approaches. We demonstrate our real time pixel-wise object tracking framework on a challenging VOT dataset

Click to Read Paper and Get Code
Deep convolutional neural networks (CNNs) have shown appealing performance on various computer vision tasks in recent years. This motivates people to deploy CNNs to realworld applications. However, most of state-of-art CNNs require large memory and computational resources, which hinders the deployment on mobile devices. Recent studies show that low-bit weight representation can reduce much storage and memory demand, and also can achieve efficient network inference. To achieve this goal, we propose a novel approach named BWNH to train Binary Weight Networks via Hashing. In this paper, we first reveal the strong connection between inner-product preserving hashing and binary weight networks, and show that training binary weight networks can be intrinsically regarded as a hashing problem. Based on this perspective, we propose an alternating optimization method to learn the hash codes instead of directly learning binary weights. Extensive experiments on CIFAR10, CIFAR100 and ImageNet demonstrate that our proposed BWNH outperforms current state-of-art by a large margin.

Click to Read Paper and Get Code
We compare and contrast the statistical physics and quantum physics inspired approaches for unsupervised generative modeling of classical data. The two approaches represent probabilities of observed data using energy-based models and quantum states respectively.Classical and quantum information patterns of the target datasets therefore provide principled guidelines for structural design and learning in these two approaches. Taking the restricted Boltzmann machines (RBM) as an example, we analyze the information theoretical bounds of the two approaches. We verify our reasonings by comparing the performance of RBMs of various architectures on the standard MNIST datasets.

* Entropy 2018, 20(8), 583
* 7 pages, 4 figures
Click to Read Paper and Get Code