Deep learning has emerged as a powerful machine learning technique that learns multiple layers of representations or features of the data and produces state-of-the-art prediction results. Along with the success of deep learning in many other application domains, deep learning is also popularly used in sentiment analysis in recent years. This paper first gives an overview of deep learning and then provides a comprehensive survey of its current applications in sentiment analysis.

* 34 pages, 9 figures, 2 tables
Click to Read Paper
Fully Convolution Networks (FCN) have achieved great success in dense prediction tasks including semantic segmentation. In this paper, we start from discussing FCN by understanding its architecture limitations in building a strong segmentation network. Next, we present our Improved Fully Convolution Network (IFCN). In contrast to FCN, IFCN introduces a context network that progressively expands the receptive fields of feature maps. In addition, dense skip connections are added so that the context network can be effectively optimized. More importantly, these dense skip connections enable IFCN to fuse rich-scale context to make reliable predictions. Empirically, those architecture modifications are proven to be significant to enhance the segmentation performance. Without engaging any contextual post-processing, IFCN significantly advances the state-of-the-arts on ADE20K (ImageNet scene parsing), Pascal Context, Pascal VOC 2012 and SUN-RGBD segmentation datasets.

Click to Read Paper
We adopt Convolutional Neural Networks (CNNs) to be our parametric model to learn discriminative features and classifiers for local patch classification. Based on the occurrence frequency distribution of classes, an ensemble of CNNs (CNN-Ensemble) are learned, in which each CNN component focuses on learning different and complementary visual patterns. The local beliefs of pixels are output by CNN-Ensemble. Considering that visually similar pixels are indistinguishable under local context, we leverage the global scene semantics to alleviate the local ambiguity. The global scene constraint is mathematically achieved by adding a global energy term to the labeling energy function, and it is practically estimated in a non-parametric framework. A large margin based CNN metric learning method is also proposed for better global belief estimation. In the end, the integration of local and global beliefs gives rise to the class likelihood of pixels, based on which maximum marginal inference is performed to generate the label prediction maps. Even without any post-processing, we achieve state-of-the-art results on the challenging SiftFlow and Barcelona benchmarks.

* 13 Pages, 6 figures, IEEE Transactions on Image Processing (T-IP) 2016
Click to Read Paper
In image labeling, local representations for image units are usually generated from their surrounding image patches, thus long-range contextual information is not effectively encoded. In this paper, we introduce recurrent neural networks (RNNs) to address this issue. Specifically, directed acyclic graph RNNs (DAG-RNNs) are proposed to process DAG-structured images, which enables the network to model long-range semantic dependencies among image units. Our DAG-RNNs are capable of tremendously enhancing the discriminative power of local representations, which significantly benefits the local classification. Meanwhile, we propose a novel class weighting function that attends to rare classes, which phenomenally boosts the recognition accuracy for non-frequent classes. Integrating with convolution and deconvolution layers, our DAG-RNNs achieve new state-of-the-art results on the challenging SiftFlow, CamVid and Barcelona benchmarks.

Click to Read Paper
In crowd counting datasets, people appear at different scales, depending on their distance to the camera. To address this issue, we propose a novel multi-branch scale-aware attention network that exploits the hierarchical structure of convolutional neural networks and generates, in a single forward pass, multi-scale density predictions from different layers of the architecture. To aggregate these maps into our final prediction, we present a new soft attention mechanism that learns a set of gating masks. Furthermore, we introduce a scale-aware loss function to regularize the training of different branches and guide them to specialize on a particular scale. As this new training requires ground-truth annotations for the size of each head, we also propose a simple, yet effective technique to estimate it automatically. Finally, we present an ablation study on each of these components and compare our approach against the literature on 4 crowd counting datasets: UCF-QNRF, ShanghaiTech A & B and UCF_CC_50. Without bells and whistles, our approach achieves state-of-the-art on all these datasets. We observe a remarkable improvement on the UCF-QNRF (25%) and a significant one on the others (around 10%).

Click to Read Paper
Existing deep convolutional neural networks (CNNs) have shown their great success on image classification. CNNs mainly consist of convolutional and pooling layers, both of which are performed on local image areas without considering the dependencies among different image regions. However, such dependencies are very important for generating explicit image representation. In contrast, recurrent neural networks (RNNs) are well known for their ability of encoding contextual information among sequential data, and they only require a limited number of network parameters. General RNNs can hardly be directly applied on non-sequential data. Thus, we proposed the hierarchical RNNs (HRNNs). In HRNNs, each RNN layer focuses on modeling spatial dependencies among image regions from the same scale but different locations. While the cross RNN scale connections target on modeling scale dependencies among regions from the same location but different scales. Specifically, we propose two recurrent neural network models: 1) hierarchical simple recurrent network (HSRN), which is fast and has low computational cost; and 2) hierarchical long-short term memory recurrent network (HLSTM), which performs better than HSRN with the price of more computational cost. In this manuscript, we integrate CNNs with HRNNs, and develop end-to-end convolutional hierarchical recurrent neural networks (C-HRNNs). C-HRNNs not only make use of the representation power of CNNs, but also efficiently encodes spatial and scale dependencies among different image regions. On four of the most challenging object/scene image classification benchmarks, our C-HRNNs achieve state-of-the-art results on Places 205, SUN 397, MIT indoor, and competitive results on ILSVRC 2012.

Click to Read Paper
Given a target name, which can be a product aspect or entity, identifying its aspect words and opinion words in a given corpus is a fine-grained task in target-based sentiment analysis (TSA). This task is challenging, especially when we have no labeled data and we want to perform it for any given domain. To address it, we propose a general two-stage approach. Stage one extracts/groups the target-related words (call t-words) for a given target. This is relatively easy as we can apply an existing semantics-based learning technique. Stage two separates the aspect and opinion words from the grouped t-words, which is challenging because we often do not have enough word-level aspect and opinion labels. In this work, we formulate this problem in a PU learning setting and incorporate the idea of lifelong learning to solve it. Experimental results show the effectiveness of our approach.

Click to Read Paper
While existing machine learning models have achieved great success for sentiment classification, they typically do not explicitly capture sentiment-oriented word interaction, which can lead to poor results for fine-grained analysis at the snippet level (a phrase or sentence). Factorization Machine provides a possible approach to learning element-wise interaction for recommender systems, but they are not directly applicable to our task due to the inability to model contexts and word sequences. In this work, we develop two Position-aware Factorization Machines which consider word interaction, context and position information. Such information is jointly encoded in a set of sentiment-oriented word interaction vectors. Compared to traditional word embeddings, SWI vectors explicitly capture sentiment-oriented word interaction and simplify the parameter learning. Experimental results show that while they have comparable performance with state-of-the-art methods for document-level classification, they benefit the snippet/sentence-level sentiment analysis.

Click to Read Paper
In order to encode the class correlation and class specific information in image representation, we propose a new local feature learning approach named Deep Discriminative and Shareable Feature Learning (DDSFL). DDSFL aims to hierarchically learn feature transformation filter banks to transform raw pixel image patches to features. The learned filter banks are expected to: (1) encode common visual patterns of a flexible number of categories; (2) encode discriminative information; and (3) hierarchically extract patterns at different visual levels. Particularly, in each single layer of DDSFL, shareable filters are jointly learned for classes which share the similar patterns. Discriminative power of the filters is achieved by enforcing the features from the same category to be close, while features from different categories to be far away from each other. Furthermore, we also propose two exemplar selection methods to iteratively select training data for more efficient and effective learning. Based on the experimental results, DDSFL can achieve very promising performance, and it also shows great complementary effect to the state-of-the-art Caffe features.

* Pattern Recognition, Elsevier, 2015
Click to Read Paper
For visual tracking, most of the traditional correlation filters (CF) based methods suffer from the bottleneck of feature redundancy and lack of motion information. In this paper, we design a novel tracking framework, called multi-hierarchical independent correlation filters (MHIT). The framework consists of motion estimation module, hierarchical features selection, independent CF online learning, and adaptive multi-branch CF fusion. Specifically, the motion estimation module is introduced to capture motion information, which effectively alleviates the object partial occlusion in the temporal video. The multi-hierarchical deep features of CNN representing different semantic information can be fully excavated to track multi-scale objects. To better overcome the deep feature redundancy, each hierarchical features are independently fed into a single branch to implement the online learning of parameters. Finally, an adaptive weight scheme is integrated into the framework to fuse these independent multi-branch CFs for the better and more robust visual object tracking. Extensive experiments on OTB and VOT datasets show that the proposed MHIT tracker can significantly improve the tracking performance. Especially, it obtains a 20.1% relative performance gain compared to the top trackers on the VOT2017 challenge, and also achieves new state-of-the-art performance on the VOT2018 challenge.

Click to Read Paper
Matching pedestrians across multiple camera views known as human re-identification (re-identification) is a challenging problem in visual surveillance. In the existing works concentrating on feature extraction, representations are formed locally and independent of other regions. We present a novel siamese Long Short-Term Memory (LSTM) architecture that can process image regions sequentially and enhance the discriminative capability of local feature representation by leveraging contextual information. The feedback connections and internal gating mechanism of the LSTM cells enable our model to memorize the spatial dependencies and selectively propagate relevant contextual information through the network. We demonstrate improved performance compared to the baseline algorithm with no LSTM units and promising results compared to state-of-the-art methods on Market-1501, CUHK03 and VIPeR datasets. Visualization of the internal mechanism of LSTM cells shows meaningful patterns can be learned by our method.

Click to Read Paper
Recently, segmentation neural networks have been significantly improved by demonstrating very promising accuracies on public benchmarks. However, these models are very heavy and generally suffer from low inference speed, which limits their application scenarios in practice. Meanwhile, existing fast segmentation models usually fail to obtain satisfactory segmentation accuracies on public benchmarks. In this paper, we propose a teacher-student learning framework that transfers the knowledge gained by a heavy and better performed segmentation network (i.e. teacher) to guide the learning of fast segmentation networks (i.e. student). Specifically, both zero-order and first-order knowledge depicted in the fine annotated images and unlabeled auxiliary data are transferred to regularize our student learning. The proposed method can improve existing fast segmentation models without incurring extra computational overhead, so it can still process images with the same fast speed. Extensive experiments on the Pascal Context, Cityscape and VOC 2012 datasets demonstrate that the proposed teacher-student learning framework is able to significantly boost the performance of student network.

* BMVC 2018
* 13 pages, 3 figures, conference
Click to Read Paper
This paper proposes a new method called Multimodal RNNs for RGB-D scene semantic segmentation. It is optimized to classify image pixels given two input sources: RGB color channels and Depth maps. It simultaneously performs training of two recurrent neural networks (RNNs) that are crossly connected through information transfer layers, which are learnt to adaptively extract relevant cross-modality features. Each RNN model learns its representations from its own previous hidden states and transferred patterns from the other RNNs previous hidden states; thus, both model-specific and crossmodality features are retained. We exploit the structure of quad-directional 2D-RNNs to model the short and long range contextual information in the 2D input image. We carefully designed various baselines to efficiently examine our proposed model structure. We test our Multimodal RNNs method on popular RGB-D benchmarks and show how it outperforms previous methods significantly and achieves competitive results with other state-of-the-art works.

* IEEE Transactions on Multimedia 2017
* 15 pages, 13 figures, IEEE TMM 2017
Click to Read Paper
In this paper, we study the challenging problem of multi-object tracking in a complex scene captured by a single camera. Different from the existing tracklet association-based tracking methods, we propose a novel and efficient way to obtain discriminative appearance-based tracklet affinity models. Our proposed method jointly learns the convolutional neural networks (CNNs) and temporally constrained metrics. In our method, a Siamese convolutional neural network (CNN) is first pre-trained on the auxiliary data. Then the Siamese CNN and temporally constrained metrics are jointly learned online to construct the appearance-based tracklet affinity models. The proposed method can jointly learn the hierarchical deep features and temporally constrained segment-wise metrics under a unified framework. For reliable association between tracklets, a novel loss function incorporating temporally constrained multi-task learning mechanism is proposed. By employing the proposed method, tracklet association can be accomplished even in challenging situations. Moreover, a new dataset with 40 fully annotated sequences is created to facilitate the tracking evaluation. Experimental results on five public datasets and the new large-scale dataset show that our method outperforms several state-of-the-art approaches in multi-object tracking.

Click to Read Paper
In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing. Among different types of deep neural networks, convolutional neural networks have been most extensively studied. Leveraging on the rapid growth in the amount of the annotated data and the great improvements in the strengths of graphics processor units, the research on convolutional neural networks has been emerged swiftly and achieved state-of-the-art results on various tasks. In this paper, we provide a broad survey of the recent advances in convolutional neural networks. We detailize the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation. Besides, we also introduce various applications of convolutional neural networks in computer vision, speech and natural language processing.

* Pattern Recognition, Elsevier
Click to Read Paper
Removing rain streaks from a single image continues to draw attentions today in outdoor vision systems. In this paper, we present an efficient method to remove rain streaks. First, the location map of rain pixels needs to be known as precisely as possible, to which we implement a relatively accurate detection of rain streaks by utilizing two characteristics of rain streaks.The key component of our method is to represent the intensity of each detected rain pixel using a linear model: $p=\alpha s + \beta$, where $p$ is the observed intensity of a rain pixel and $s$ represents the intensity of the background (i.e., before rain-affected). To solve $\alpha$ and $\beta$ for each detected rain pixel, we concentrate on a window centered around it and form an $L_2$-norm cost function by considering all detected rain pixels within the window, where the corresponding rain-removed intensity of each detected rain pixel is estimated by some neighboring non-rain pixels. By minimizing this cost function, we determine $\alpha$ and $\beta$ so as to construct the final rain-removed pixel intensity. Compared with several state-of-the-art works, our proposed method can remove rain streaks from a single color image much more efficiently - it offers not only a better visual quality but also a speed-up of several times to one degree of magnitude.

* 12 pages, 12 figures
Click to Read Paper
This paper studies a Nystr\"om type subsampling approach to large kernel learning methods in the misspecified case, where the target function is not assumed to belong to the reproducing kernel Hilbert space generated by the underlying kernel. This case is less understood, in spite of its practical importance. To model such a case, the smoothness of target functions is described in terms of general source conditions. It is surprising that almost for the whole range of the source conditions, describing the misspecified case, the corresponding learning rate bounds can be achieved with just one value of the regularization parameter. This observation allows a formulation of mild conditions under which the plain Nystr\"om subsampling can be realized with subquadratic cost maintaining the guaranteed learning rates.

Click to Read Paper
Rain streaks will inevitably be captured by some outdoor vision systems, which lowers the image visual quality and also interferes various computer vision applications. We present a novel rain removal method in this paper, which consists of two steps, i.e., detection of rain streaks and reconstruction of the rain-removed image. An accurate detection of rain streaks determines the quality of the overall performance. To this end, we first detect rain streaks according to pixel intensities, motivated by the observation that rain streaks often possess higher intensities compared to other neighboring image structures. Some mis-detected locations are then refined through a morphological processing and the principal component analysis (PCA) such that only locations corresponding to real rain streaks are retained. In the second step, we separate image gradients into a background layer and a rain streak layer, thanks to the image quasi-sparsity prior, so that a rain image can be decomposed into a background layer and a rain layer. We validate the effectiveness of our method through quantitative and qualitative evaluations. We show that our method can remove rain (even for some relatively bright rain) from images robustly and outperforms some state-of-the-art rain removal algorithms.

* 12 pages, 12 figures
Click to Read Paper
In this paper, we propose a deep learning architecture that produces accurate dense depth for the outdoor scene from a single color image and a sparse depth. Inspired by the indoor depth completion, our network estimates surface normals as the intermediate representation to produce dense depth, and can be trained end-to-end. With a modified encoder-decoder structure, our network effectively fuses the dense color image and the sparse LiDAR depth. To address outdoor specific challenges, our network predicts a confidence mask to handle mixed LiDAR signals near foreground boundaries due to occlusion, and combines estimates from the color image and surface normals with learned attention maps to improve the depth accuracy especially for distant areas. Extensive experiments demonstrate that our model improves upon the state-of-the-art performance on KITTI depth completion benchmark. Ablation study shows the positive impact of each model components to the final performance, and comprehensive analysis shows that our model generalizes well to the input with higher sparsity or from indoor scenes.

* 10 pages
Click to Read Paper
In video compression, most of the existing deep learning approaches concentrate on the visual quality of a single frame, while ignoring the useful priors as well as the temporal information of adjacent frames. In this paper, we propose a multi-frame guided attention network (MGANet) to enhance the quality of compressed videos. Our network is composed of a temporal encoder that discovers inter-frame relations, a guided encoder-decoder subnet that encodes and enhances the visual patterns of target-frame, and a multi-supervised reconstruction component that aggregates information to predict details. We design a bidirectional residual convolutional LSTM unit to implicitly discover frames variations over time with respect to the target frame. Meanwhile, the guided map is proposed to guide our network to concentrate more on the block boundary. Our approach takes advantage of intra-frame prior information and inter-frame information to improve the quality of compressed video. Experimental results show the robustness and superior performance of the proposed method.

* 12 pages, 12 figures
Click to Read Paper