Knowledge of users' emotion states helps improve human-computer interaction. In this work, we presented EmoNet, an emotion detector of Chinese daily dialogues based on deep convolutional neural networks. In order to maintain the original linguistic features, such as the order, commonly used methods like segmentation and keywords extraction were not adopted, instead we increased the depth of CNN and tried to let CNN learn inner linguistic relationships. Our main contribution is that we presented a new model and a new pipeline which can be used in multi-language environment to solve sentimental problems. Experimental results shows EmoNet has a great capacity in learning the emotion of dialogues and achieves a better result than other state of art detectors do.

* 7 pages, 7 figures
Click to Read Paper
This paper introduces Dex, a reinforcement learning environment toolkit specialized for training and evaluation of continual learning methods as well as general reinforcement learning problems. We also present the novel continual learning method of incremental learning, where a challenging environment is solved using optimal weight initialization learned from first solving a similar easier environment. We show that incremental learning can produce vastly superior results than standard methods by providing a strong baseline method across ten Dex environments. We finally develop a saliency method for qualitative analysis of reinforcement learning, which shows the impact incremental learning has on network attention.

* NIPS 2017 submission, 10 pages, 26 figures
Click to Read Paper
While previous researches in eye fixation prediction typically rely on integrating low-level features (e.g. color, edge) to form a saliency map, recently it has been found that the structural organization of these features into a proto-object representation can play a more significant role. In this work, we present a computational framework based on deep network to demonstrate that proto-object representations can be learned from low-resolution image patches from fixation regions. We advocate the use of low-resolution inputs in this work due to the following reasons: (1) Proto-objects are computed in parallel over an entire visual field (2) People can perceive or recognize objects well even it is in low resolution. (3) Fixations from lower resolution images can predict fixations on higher resolution images. In the proposed computational model, we extract multi-scale image patches on fixation regions from eye fixation datasets, resize them to low resolution and feed them into a hierarchical. With layer-wise unsupervised feature learning, we find that many proto-objects like features responsive to different shapes of object blobs are learned out. Visualizations also show that these features are selective to potential objects in the scene and the responses of these features work well in predicting eye fixations on the images when combined with learned weights.

* This paper has been withdrawn by the author due to incompletion of the submission
Click to Read Paper
Over the last decade, it has been demonstrated that many systems in science and engineering can be modeled more accurately by fractional-order than integer-order derivatives, and many methods are developed to solve the problem of fractional systems. Due to the extra free parameter order, fractional-order based methods provide additional degree of freedom in optimization performance. Not surprisingly, many fractional-order based methods have been used in image processing field. Herein recent studies are reviewed in ten sub-fields, which include image enhancement, image denoising, image edge detection, image segmentation, image registration, image recognition, image fusion, image encryption, image compression and image restoration. In sum, it is well proved that as a fundamental mathematic tool, fractional-order derivative shows great success in image processing.

* 26 pages, 9 figures, 117 conference
Click to Read Paper
Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect. There exist many inexpensive data sources on the web, but they tend to contain inaccurate labels. Training on noisy labeled datasets causes performance degradation because DNNs can easily overfit to the label noise. To overcome this problem, we propose a noise-tolerant training algorithm, where a meta-learning update is performed prior to conventional gradient update. The proposed meta-learning method simulates actual training by generating synthetic noisy labels, and train the model such that after one gradient update using each set of synthetic noisy labels, the model does not overfit to the specific noise. We conduct extensive experiments on the noisy CIFAR-10 dataset and the Clothing1M dataset. The results demonstrate the advantageous performance of the proposed method compared to several state-of-the-art baselines.

Click to Read Paper
This paper explores a new natural language processing task, review-driven multi-label music style classification. This task requires the system to identify multiple styles of music based on its reviews on websites. The biggest challenge lies in the complicated relations of music styles. It has brought failure to many multi-label classification methods. To tackle this problem, we propose a novel deep learning approach to automatically learn and exploit style correlations. The proposed method consists of two parts: a label-graph based neural network, and a soft training mechanism with correlation-based continuous label representation. Experimental results show that our approach achieves large improvements over the baselines on the proposed dataset. Especially, the micro F1 is improved from 53.9 to 64.5, and the one-error is reduced from 30.5 to 22.6. Furthermore, the visualized analysis shows that our approach performs well in capturing style correlations.

Click to Read Paper
Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.

* ACM Multimedia, 2017
Click to Read Paper
Image super-resolution using self-optimizing mask via fractional-order gradient interpolation and reconstruction aims to recover detailed information from low-resolution images and reconstruct them into high-resolution images. Due to the limited amount of data and information retrieved from low-resolution images, it is difficult to restore clear, artifact-free images, while still preserving enough structure of the image such as the texture. This paper presents a new single image super-resolution method which is based on adaptive fractional-order gradient interpolation and reconstruction. The interpolated image gradient via optimal fractional-order gradient is first constructed according to the image similarity and afterwards the minimum energy function is employed to reconstruct the final high-resolution image. Fractional-order gradient based interpolation methods provide an additional degree of freedom which helps optimize the implementation quality due to the fact that an extra free parameter $\alpha$-order is being used. The proposed method is able to produce a rich texture detail while still being able to maintain structural similarity even under large zoom conditions. Experimental results show that the proposed method performs better than current single image super-resolution techniques.

* 24 pages, 13 figures, it is to appear in ISA Transactions
Click to Read Paper
Fisher vector (FV) has become a popular image representation. One notable underlying assumption of the FV framework is that local descriptors are well decorrelated within each cluster so that the covariance matrix for each Gaussian can be simplified to be diagonal. Though the FV usually relies on the Principal Component Analysis (PCA) to decorrelate local features, the PCA is applied to the entire training data and hence it only diagonalizes the \textit{universal} covariance matrix, rather than those w.r.t. the local components. As a result, the local decorrelation assumption is usually not supported in practice. To relax this assumption, this paper proposes a completed model of the Fisher vector, which is termed as the Completed Fisher vector (CFV). The CFV is a more general framework of the FV, since it encodes not only the variances but also the correlations of the whitened local descriptors. The CFV thus leads to improved discriminative power. We take the task of material categorization as an example and experimentally show that: 1) the CFV outperforms the FV under all parameter settings; 2) the CFV is robust to the changes in the number of components in the mixture; 3) even with a relatively small visual vocabulary the CFV still works well on two challenging datasets.

* It is manuscript submitted to Neurocomputing on the end of April, 2015 (!). One year past but no review comments we received yet!
Click to Read Paper
\textit{Indirect Immunofluorescence Imaging of Human Epithelial Type 2} (HEp-2) cells is an effective way to identify the presence of Anti-Nuclear Antibody (ANA). Most existing works on HEp-2 cell classification mainly focus on feature extraction, feature encoding and classifier design. Very few efforts have been devoted to study the importance of the pre-processing techniques. In this paper, we analyze the importance of the pre-processing, and investigate the role of Gaussian Scale Space (GSS) theory as a pre-processing approach for the HEp-2 cell classification task. We validate the GSS pre-processing under the Local Binary Pattern (LBP) and the Bag-of-Words (BoW) frameworks. Under the BoW framework, the introduced pre-processing approach, using only one Local Orientation Adaptive Descriptor (LOAD), achieved superior performance on the Executable Thematic on Pattern Recognition Techniques for Indirect Immunofluorescence (ET-PRT-IIF) image analysis. Our system, using only one feature, outperformed the winner of the ICPR 2014 contest that combined four types of features. Meanwhile, the proposed pre-processing method is not restricted to this work; it can be generalized to many existing works.

* 9 pages, 6 figures
Click to Read Paper
Social relationships form the basis of social structure of humans. Developing computational models to understand social relationships from visual data is essential for building intelligent machines that can better interact with humans in a social environment. In this work, we study the problem of visual social relationship recognition in images. We propose a Dual-Glance model for social relationship recognition, where the first glance fixates at the person of interest and the second glance deploys attention mechanism to exploit contextual cues. To enable this study, we curated a large scale People in Social Context (PISC) dataset, which comprises of 23,311 images and 79,244 person pairs with annotated social relationships. Since visually identifying social relationship bears certain degree of uncertainty, we further propose an Adaptive Focal Loss to leverage the ambiguous annotations for more effective learning. We conduct extensive experiments to quantitatively and qualitatively demonstrate the efficacy of our proposed method, which yields state-of-the-art performance on social relationship recognition.

* arXiv admin note: text overlap with arXiv:1708.00634
Click to Read Paper
The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.

* NIPS 2018
Click to Read Paper
Fine-grained visual recognition is challenging because it highly relies on the modeling of various semantic parts and fine-grained feature learning. Bilinear pooling based models have been shown to be effective at fine-grained recognition, while most previous approaches neglect the fact that inter-layer part feature interaction and fine-grained feature learning are mutually correlated and can reinforce each other. In this paper, we present a novel model to address these issues. First, a cross-layer bilinear pooling approach is proposed to capture the inter-layer part feature relations, which results in superior performance compared with other bilinear pooling based approaches. Second, we propose a novel hierarchical bilinear pooling framework to integrate multiple cross-layer bilinear features to enhance their representation capability. Our formulation is intuitive, efficient and achieves state-of-the-art results on the widely used fine-grained recognition datasets.

* 16 pages, 3 figures
Click to Read Paper
Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this work, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a Residual Bidirectional Recurrent Neural Network to leverage contextual information from past and future. Second, we propose a Narrator model to discover the underlying storyline. The Narrator is formulated as a reinforcement learning agent which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the Video Story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines, and show that our method achieves better performance, in terms of quantitative measures and user study.

Click to Read Paper
Since the beginning of early civilizations, social relationships derived from each individual fundamentally form the basis of social structure in our daily life. In the computer vision literature, much progress has been made in scene understanding, such as object detection and scene parsing. Recent research focuses on the relationship between objects based on its functionality and geometrical relations. In this work, we aim to study the problem of social relationship recognition, in still images. We have proposed a dual-glance model for social relationship recognition, where the first glance fixates at the individual pair of interest and the second glance deploys attention mechanism to explore contextual cues. We have also collected a new large scale People in Social Context (PISC) dataset, which comprises of 22,670 images and 76,568 annotated samples from 9 types of social relationship. We provide benchmark results on the PISC dataset, and qualitatively demonstrate the efficacy of the proposed model.

* IEEE International Conference on Computer Vision (ICCV), 2017
Click to Read Paper
The objective of this paper is the effective transfer of the Convolutional Neural Network (CNN) feature in image search and classification. Systematically, we study three facts in CNN transfer. 1) We demonstrate the advantage of using images with a properly large size as input to CNN instead of the conventionally resized one. 2) We benchmark the performance of different CNN layers improved by average/max pooling on the feature maps. Our observation suggests that the Conv5 feature yields very competitive accuracy under such pooling step. 3) We find that the simple combination of pooled features extracted across various CNN layers is effective in collecting evidences from both low and high level descriptors. Following these good practices, we are capable of improving the state of the art on a number of benchmarks to a large margin.

* 9 pages. It will be submitted to an appropriate journal
Click to Read Paper
We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion unit to quickly achieve high-quality segmentation. Our system yields real-time inference on a single GPU card with decent quality results evaluated on challenging datasets like Cityscapes, CamVid and COCO-Stuff.

* ECCV 2018
Click to Read Paper
Visual grounding aims to localize an object in an image referred to by a textual query phrase. Various visual grounding approaches have been proposed, and the problem can be modularized into a general framework: proposal generation, multi-modal feature representation, and proposal ranking. Of these three modules, most existing approaches focus on the latter two, with the importance of proposal generation generally neglected. In this paper, we rethink the problem of what properties make a good proposal generator. We introduce the diversity and discrimination simultaneously when generating proposals, and in doing so propose Diversified and Discriminative Proposal Networks model (DDPN). Based on the proposals generated by DDPN, we propose a high performance baseline model for visual grounding and evaluate it on four benchmark datasets. Experimental results demonstrate that our model delivers significant improvements on all the tested data-sets (e.g., 18.8\% improvement on ReferItGame and 8.2\% improvement on Flickr30k Entities over the existing state-of-the-arts respectively)

* Accepted in IJCAI 2018
Click to Read Paper
Reversible data hiding (RDH) is desirable in applications where both the hidden message and the cover medium need to be recovered without loss. Among many RDH approaches is prediction-error expansion (PEE), containing two steps: i) prediction of a target pixel value, and ii) embedding according to the value of prediction-error. In general, higher prediction performance leads to larger embedding capacity and/or lower signal distortion. Leveraging on recent advances in graph signal processing (GSP), we pose pixel prediction as a graph-signal restoration problem, where the appropriate edge weights of the underlying graph are computed using a similar patch searched in a semi-local neighborhood. Specifically, for each candidate patch, we first examine eigenvalues of its structure tensor to estimate its local smoothness. If sufficiently smooth, we pose a maximum a posteriori (MAP) problem using either a quadratic Laplacian regularizer or a graph total variation (GTV) term as signal prior. While the MAP problem using the first prior has a closed-form solution, we design an efficient algorithm for the second prior using alternating direction method of multipliers (ADMM) with nested proximal gradient descent. Experimental results show that with better quality GSP-based prediction, at low capacity the visual quality of the embedded image exceeds state-of-the-art methods noticeably.

Click to Read Paper
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction tasks. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

* CVPR 2017
Click to Read Paper