Knowledge Base (KB) completion, which aims to determine missing relation between entities, has raised increasing attention in recent years. Most existing methods either focus on the positional relationship between entity pair and single relation (1-hop path) in semantic space or concentrate on the joint probability of Random Walks on multi-hop paths among entities. However, they do not fully consider the intrinsic relationships of all the links among entities. By observing that the single relation and multi-hop paths between the same entity pair generally contain shared/similar semantic information, this paper proposes a novel method to capture the shared features between them as the basis for inferring missing relations. To capture the shared features jointly, we develop Hierarchical Attention Networks (HANs) to automatically encode the inputs into low-dimensional vectors, and exploit two partial parameter-shared components, one for feature source discrimination and the other for determining missing relations. By joint Adversarial Training (AT) the entire model, our method minimizes the classification error of missing relations, and ensures the source of shared features are difficult to discriminate in the meantime. The AT mechanism encourages our model to extract features that are both discriminative for missing relation prediction and shareable between single relation and multi-hop paths. We extensively evaluate our method on several large-scale KBs for relation completion. Experimental results show that our method consistently outperforms the baseline approaches. In addition, the hierarchical attention mechanism and the feature extractor in our model can be well interpreted and utilized in the related downstream tasks.

Click to Read Paper
Appropriate comments of code snippets provide insight for code functionality, which are helpful for program comprehension. However, due to the great cost of authoring with the comments, many code projects do not contain adequate comments. Automatic comment generation techniques have been proposed to generate comments from pieces of code in order to alleviate the human efforts in annotating the code. Most existing approaches attempt to exploit certain correlations (usually manually given) between code and generated comments, which could be easily violated if the coding patterns change and hence the performance of comment generation declines. In this paper, we first build C2CGit, a large dataset from open projects in GitHub, which is more than 20$\times$ larger than existing datasets. Then we propose a new attention module called Code Attention to translate code to comments, which is able to utilize the domain features of code snippets, such as symbols and identifiers. We make ablation studies to determine effects of different parts in Code Attention. Experimental results demonstrate that the proposed module has better performance over existing approaches in both BLEU and METEOR.

Click to Read Paper
Many real-world problems can be represented as graph-based learning problems. In this paper, we propose a novel framework for learning spatial and attentional convolution neural networks on arbitrary graphs. Different from previous convolutional neural networks on graphs, we first design a motif-matching guided subgraph normalization method to capture neighborhood information. Then we implement self-attentional layers to learn different importances from different subgraphs to solve graph classification problems. Analogous to image-based attentional convolution networks that operate on locally connected and weighted regions of the input, we also extend graph normalization from one-dimensional node sequence to two-dimensional node grid by leveraging motif-matching, and design self-attentional layers without requiring any kinds of cost depending on prior knowledge of the graph structure. Our results on both bioinformatics and social network datasets show that we can significantly improve graph classification benchmarks over traditional graph kernel and existing deep models.

Click to Read Paper
Understanding the diffusion in social network is an important task. However, this task is challenging since (1) the network structure is usually hidden with only observations of events like "post" or "repost" associated with each node, and (2) the interactions between nodes encompass multiple distinct patterns which in turn affect the diffusion patterns. For instance, social interactions seldom develop on a single channel, and multiple relationships can bind pairs of people due to their various common interests. Most previous work considers only one of these two challenges which is apparently unrealistic. In this paper, we study the problem of \emph{inferring multiplex network} in social networks. We propose the Multiplex Diffusion Model (MDM) which incorporates the multivariate marked Hawkes process and topic model to infer the multiplex structure of social network. A MCMC based algorithm is developed to infer the latent multiplex structure and to estimate the node-related parameters. We evaluate our model based on both synthetic and real-world datasets. The results show that our model is more effective in terms of uncovering the multiplex network structure.

Click to Read Paper
The models developed to date for knowledge base embedding are all based on the assumption that the relations contained in knowledge bases are binary. For the training and testing of these embedding models, multi-fold (or n-ary) relational data are converted to triples (e.g., in FB15K dataset) and interpreted as instances of binary relations. This paper presents a canonical representation of knowledge bases containing multi-fold relations. We show that the existing embedding models on the popular FB15K datasets correspond to a sub-optimal modelling framework, resulting in a loss of structural information. We advocate a novel modelling framework, which models multi-fold relations directly using this canonical representation. Using this framework, the existing TransH model is generalized to a new model, m-TransH. We demonstrate experimentally that m-TransH outperforms TransH by a large margin, thereby establishing a new state of the art.

* 8 pages, to appear in IJCAI 2016
Click to Read Paper
Deep learning researches on the transformation problems for image and text have raised great attention. However, present methods for music feature transfer using neural networks are far from practical application. In this paper, we initiate a novel system for transferring the texture of music, and release it as an open source project. Its core algorithm is composed of a converter which represents sounds as texture spectra, a corresponding reconstructor and a feed-forward transfer network. We evaluate this system from multiple perspectives, and experimental results reveal that it achieves convincing results in both sound effects and computational performance.

* 12 pages
Click to Read Paper
Vision-based automatic counting of people has widespread applications in intelligent transportation systems, security, and logistics. However, there is currently no large-scale public dataset for benchmarking approaches on this problem. This work fills this gap by introducing the first real-world RGB-D People Counting DataSet (PCDS) containing over 4,500 videos recorded at the entrance doors of buses in normal and cluttered conditions. It also proposes an efficient method for counting people in real-world cluttered scenes related to public transportations using depth videos. The proposed method computes a point cloud from the depth video frame and re-projects it onto the ground plane to normalize the depth information. The resulting depth image is analyzed for identifying potential human heads. The human head proposals are meticulously refined using a 3D human model. The proposals in each frame of the continuous video stream are tracked to trace their trajectories. The trajectories are again refined to ascertain reliable counting. People are eventually counted by accumulating the head trajectories leaving the scene. To enable effective head and trajectory identification, we also propose two different compound features. A thorough evaluation on PCDS demonstrates that our technique is able to count people in cluttered scenes with high accuracy at 45 fps on a 1.7 GHz processor, and hence it can be deployed for effective real-time people counting for intelligent transportation systems.

* Submitted to a journal
Click to Read Paper
In this paper, a new network-transmission-based (NTB) algorithm is proposed for human activity recognition in videos. The proposed NTB algorithm models the entire scene as an error-free network. In this network, each node corresponds to a patch of the scene and each edge represents the activity correlation between the corresponding patches. Based on this network, we further model people in the scene as packages while human activities can be modeled as the process of package transmission in the network. By analyzing these specific "package transmission" processes, various activities can be effectively detected. The implementation of our NTB algorithm into abnormal activity detection and group activity recognition are described in detail in the paper. Experimental results demonstrate the effectiveness of our proposed algorithm.

* IEEE Trans. Circuits and Systems for Video Technology, vol. 24, no. 5, pp. 826-841, 2014
* This manuscript is the accepted version for TCSVT (IEEE Transactions on Circuits and Systems for Video Technology)
Click to Read Paper
Kernel methods are powerful tools to capture nonlinear patterns behind data. They implicitly learn high (even infinite) dimensional nonlinear features in the Reproducing Kernel Hilbert Space (RKHS) while making the computation tractable by leveraging the kernel trick. Classic kernel methods learn a single layer of nonlinear features, whose representational power may be limited. Motivated by recent success of deep neural networks (DNNs) that learn multi-layer hierarchical representations, we propose a Stacked Kernel Network (SKN) that learns a hierarchy of RKHS-based nonlinear features. SKN interleaves several layers of nonlinear transformations (from a linear space to a RKHS) and linear transformations (from a RKHS to a linear space). Similar to DNNs, a SKN is composed of multiple layers of hidden units, but each parameterized by a RKHS function rather than a finite-dimensional vector. We propose three ways to represent the RKHS functions in SKN: (1)nonparametric representation, (2)parametric representation and (3)random Fourier feature representation. Furthermore, we expand SKN into CNN architecture called Stacked Kernel Convolutional Network (SKCN). SKCN learning a hierarchy of RKHS-based nonlinear features by convolutional operation with each filter also parameterized by a RKHS function rather than a finite-dimensional matrix in CNN, which is suitable for image inputs. Experiments on various datasets demonstrate the effectiveness of SKN and SKCN, which outperform the competitive methods.

Click to Read Paper
Reusable model design becomes desirable with the rapid expansion of machine learning applications. In this paper, we focus on the reusability of pre-trained deep convolutional models. Specifically, different from treating pre-trained models as feature extractors, we reveal more treasures beneath convolutional layers, i.e., the convolutional activations could act as a detector for the common object in the image co-localization problem. We propose a simple but effective method, named Deep Descriptor Transforming (DDT), for evaluating the correlations of descriptors and then obtaining the category-consistent regions, which can accurately locate the common object in a set of images. Empirical studies validate the effectiveness of the proposed DDT method. On benchmark image co-localization datasets, DDT consistently outperforms existing state-of-the-art methods by a large margin. Moreover, DDT also demonstrates good generalization ability for unseen categories and robustness for dealing with noisy data.

* Accepted by IJCAI 2017
Click to Read Paper
Channel pruning is an important family of methods to speedup deep model's inference. Previous filter pruning algorithms regard channel pruning and model fine-tuning as two independent steps. This paper argues that combining them in a single end-to-end trainable system will lead to better results. We propose an efficient channel selection layer, namely AutoPruner, to find less important filters automatically in a joint training manner. AutoPruner takes previous activation responses as input and generates a true binary index code for pruning. Hence, all the filters corresponding to zero index values can be removed safely after training. We empirically demonstrate that the gradient information of this channel selection layer is also helpful for the whole model training. Compared with previous state-of-the-art pruning algorithms, AutoPruner achieves significantly better performance. Furthermore, ablation experiments show that the proposed novel mini-batch pooling and binarization operations are vital for the success of filter pruning.

* Submitted to NIPS 2018
Click to Read Paper
Although traditionally binary visual representations are mainly designed to reduce computational and storage costs in the image retrieval research, this paper argues that binary visual representations can be applied to large scale recognition and detection problems in addition to hashing in retrieval. Furthermore, the binary nature may make it generalize better than its real-valued counterparts. Existing binary hashing methods are either two-stage or hinging on loss term regularization or saturated functions, hence converge slowly and only emit soft binary values. This paper proposes Approximately Binary Clamping (ABC), which is non-saturating, end-to-end trainable, with fast convergence and can output true binary visual representations. ABC achieves comparable accuracy in ImageNet classification as its real-valued counterpart, and even generalizes better in object detection. On benchmark image retrieval datasets, ABC also outperforms existing hashing methods.

* 16 pages, 3 figures
Click to Read Paper
This paper aims to simultaneously accelerate and compress off-the-shelf CNN models via filter pruning strategy. The importance of each filter is evaluated by the proposed entropy-based method first. Then several unimportant filters are discarded to get a smaller CNN model. Finally, fine-tuning is adopted to recover its generalization ability which is damaged during filter pruning. Our method can reduce the size of intermediate activations, which would dominate most memory footprint during model training stage but is less concerned in previous compression methods. Experiments on the ILSVRC-12 benchmark demonstrate the effectiveness of our method. Compared with previous filter importance evaluation criteria, our entropy-based method obtains better performance. We achieve 3.3x speed-up and 16.64x compression on VGG-16, 1.54x acceleration and 1.47x compression on ResNet-50, both with about 1% top-5 accuracy decrease.

Click to Read Paper
Compared with other semantic segmentation tasks, portrait segmentation requires both higher precision and faster inference speed. However, this problem has not been well studied in previous works. In this paper, we propose a lightweight network architecture, called Boundary-Aware Network (BANet) which selectively extracts detail information in boundary area to make high-quality segmentation output with real-time( >25FPS) speed. In addition, we design a new loss function called refine loss which supervises the network with image level gradient information. Our model is able to produce finer segmentation results which has richer details than annotations.

Click to Read Paper
We present \emph{Deep Image Retargeting} (\emph{DeepIR}), a coarse-to-fine framework for content-aware image retargeting. Our framework first constructs the semantic structure of input image with a deep convolutional neural network. Then a uniform re-sampling that suits for semantic structure preserving is devised to resize feature maps to target aspect ratio at each feature layer. The final retargeting result is generated by coarse-to-fine nearest neighbor field search and step-by-step nearest neighbor field fusion. We empirically demonstrate the effectiveness of our model with both qualitative and quantitative results on widely used RetargetMe dataset.

* 8 pages, 10 figures
Click to Read Paper
With the rapid increase in online photo sharing activities, image obfuscation algorithms become particularly important for protecting the sensitive information in the shared photos. However, existing image obfuscation methods based on hand-crafted principles are challenged by the dramatic development of deep learning techniques. To address this problem, we propose to maximize the distribution discrepancy between the original image domain and the encrypted image domain. Accordingly, we introduce a collaborative training scheme: a discriminator $D$ is trained to discriminate the reconstructed image from the encrypted image, and an encryption model $G_e$ is required to generate these two kinds of images to maximize the recognition rate of $D$, leading to the same training objective for both $D$ and $G_e$. We theoretically prove that such a training scheme maximizes two distributions' discrepancy. Compared with commonly-used image obfuscation methods, our model can produce satisfactory defense against the attack of deep recognition models indicated by significant accuracy decreases on FaceScrub, Casia-WebFace and LFW datasets.

* 8 pages, 6 figures
Click to Read Paper
Restoring face images from distortions is important in face recognition applications and is challenged by multiple scale issues, which is still not well-solved in research area. In this paper, we present a Sequential Gating Ensemble Network (SGEN) for multi-scale face restoration issue. We first employ the principle of ensemble learning into SGEN architecture design to reinforce predictive performance of the network. The SGEN aggregates multi-level base-encoders and base-decoders into the network, which enables the network to contain multiple scales of receptive field. Instead of combining these base-en/decoders directly with non-sequential operations, the SGEN takes base-en/decoders from different levels as sequential data. Specifically, the SGEN learns to sequentially extract high level information from base-encoders in bottom-up manner and restore low level information from base-decoders in top-down manner. Besides, we propose to realize bottom-up and top-down information combination and selection with Sequential Gating Unit (SGU). The SGU sequentially takes two inputs from different levels and decides the output based on one active input. Experiment results demonstrate that our SGEN is more effective at multi-scale human face restoration with more image details and less noise than state-of-the-art image restoration models. By using adversarial training, SGEN also produces more visually preferred results than other models through subjective evaluation.

* 8 pages, 7 figures, Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)
Click to Read Paper
Hashing has proven a valuable tool for large-scale information retrieval. Despite much success, existing hashing methods optimize over simple objectives such as the reconstruction error or graph Laplacian related loss functions, instead of the performance evaluation criteria of interest---multivariate performance measures such as the AUC and NDCG. Here we present a general framework (termed StructHash) that allows one to directly optimize multivariate performance measures. The resulting optimization problem can involve exponentially or infinitely many variables and constraints, which is more challenging than standard structured output learning. To solve the StructHash optimization problem, we use a combination of column generation and cutting-plane techniques. We demonstrate the generality of StructHash by applying it to ranking prediction and image retrieval, and show that it outperforms a few state-of-the-art hashing methods.

* Appearing in Proc. European Conference on Computer Vision 2014
Click to Read Paper
We propose an efficient and unified framework, namely ThiNet, to simultaneously accelerate and compress CNN models in both training and inference stages. We focus on the filter level pruning, i.e., the whole filter would be discarded if it is less important. Our method does not change the original network structure, thus it can be perfectly supported by any off-the-shelf deep learning libraries. We formally establish filter pruning as an optimization problem, and reveal that we need to prune filters based on statistics information computed from its next layer, not the current layer, which differentiates ThiNet from existing methods. Experimental results demonstrate the effectiveness of this strategy, which has advanced the state-of-the-art. We also show the performance of ThiNet on ILSVRC-12 benchmark. ThiNet achieves 3.31$\times$ FLOPs reduction and 16.63$\times$ compression on VGG-16, with only 0.52$\%$ top-5 accuracy drop. Similar experiments with ResNet-50 reveal that even for a compact network, ThiNet can also reduce more than half of the parameters and FLOPs, at the cost of roughly 1$\%$ top-5 accuracy drop. Moreover, the original VGG-16 model can be further pruned into a very small model with only 5.05MB model size, preserving AlexNet level accuracy but showing much stronger generalization ability.

* To appear in ICCV 2017
Click to Read Paper
In computer vision, an entity such as an image or video is often represented as a set of instance vectors, which can be SIFT, motion, or deep learning feature vectors extracted from different parts of that entity. Thus, it is essential to design efficient and effective methods to compare two sets of instance vectors. Existing methods such as FV, VLAD or Super Vectors have achieved excellent results. However, this paper shows that these methods are designed based on a generative perspective, and a discriminative method can be more effective in categorizing images or videos. The proposed D3 (discriminative distribution distance) method effectively compares two sets as two distributions, and proposes a directional total variation distance (DTVD) to measure how separated are they. Furthermore, a robust classifier-based method is proposed to estimate DTVD robustly. The D3 method is evaluated in action and image recognition tasks and has achieved excellent accuracy and speed. D3 also has a synergy with FV. The combination of D3 and FV has advantages over D3, FV, and VLAD.

Click to Read Paper