Models, code, and papers for "Xiangyang Ji":

Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Jun 14, 2019
Zihan Zhang, Xiangyang Ji

We present an algorithm based on the Optimism in the Face of Uncertainty (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SAHT})$for MDP with $S$ states and $A$ actions, in the case that an upper bound $H$ on the span of $h^{*}$, i.e., $sp(h^{*})$ is known. This result outperforms the best previous regret bounds $\tilde{O}(HS\sqrt{AT})$ [Bartlett and Tewari, 2009] by a factor of $\sqrt{SH}$. Furthermore, this regret bound matches the lower bound of $\Omega(\sqrt{SAHT})$ [Jaksch et al., 2010] up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{SADT})$ for MDPs with finite diameter $D$ compared to the lower bound of $\Omega(\sqrt{SADT})$ [Jaksch et al., 2010].


  Click for Model/Code and Paper
Action Recognition with Joint Attention on Multi-Level Deep Features

Jul 09, 2016
Jialin Wu, Gu Wang, Wukui Yang, Xiangyang Ji

We propose a novel deep supervised neural network for the task of action recognition in videos, which implicitly takes advantage of visual tracking and shares the robustness of both deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In our method, a multi-branch model is proposed to suppress noise from background jitters. Specifically, we firstly extract multi-level deep features from deep CNNs and feed them into 3d-convolutional network. After that we feed those feature cubes into our novel joint LSTM module to predict labels and to generate attention regularization. We evaluate our model on two challenging datasets: UCF101 and HMDB51. The results show that our model achieves the state-of-art by only using convolutional features.

* 13 pages, submitted to BMVC 

  Click for Model/Code and Paper
Sliding-Window Optimization on an Ambiguity-Clearness Graph for Multi-object Tracking

Nov 28, 2015
Qi Guo, Le Dan, Dong Yin, Xiangyang Ji

Multi-object tracking remains challenging due to frequent occurrence of occlusions and outliers. In order to handle this problem, we propose an Approximation-Shrink Scheme for sequential optimization. This scheme is realized by introducing an Ambiguity-Clearness Graph to avoid conflicts and maintain sequence independent, as well as a sliding window optimization framework to constrain the size of state space and guarantee convergence. Based on this window-wise framework, the states of targets are clustered in a self-organizing manner. Moreover, we show that the traditional online and batch tracking methods can be embraced by the window-wise framework. Experiments indicate that with only a small window, the optimization performance can be much better than online methods and approach to batch methods.


  Click for Model/Code and Paper
Fast and High Quality Highlight Removal from A Single Image

Dec 01, 2015
Dongsheng An, Jinli Suo, Xiangyang Ji, Haoqian Wang, Qionghai Dai

Specular reflection exists widely in photography and causes the recorded color deviating from its true value, so fast and high quality highlight removal from a single nature image is of great importance. In spite of the progress in the past decades in highlight removal, achieving wide applicability to the large diversity of nature scenes is quite challenging. To handle this problem, we propose an analytic solution to highlight removal based on an L2 chromaticity definition and corresponding dichromatic model. Specifically, this paper derives a normalized dichromatic model for the pixels with identical diffuse color: a unit circle equation of projection coefficients in two subspaces that are orthogonal to and parallel with the illumination, respectively. In the former illumination orthogonal subspace, which is specular-free, we can conduct robust clustering with an explicit criterion to determine the cluster number adaptively. In the latter illumination parallel subspace, a property called pure diffuse pixels distribution rule (PDDR) helps map each specular-influenced pixel to its diffuse component. In terms of efficiency, the proposed approach involves few complex calculation, and thus can remove highlight from high resolution images fast. Experiments show that this method is of superior performance in various challenging cases.

* 11 pages, 10 figures, submitted to IEEE TIP 

  Click for Model/Code and Paper
DeepIM: Deep Iterative Matching for 6D Pose Estimation

Mar 14, 2019
Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, Dieter Fox

Estimating the 6D pose of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the observed image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using an untangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.

* updated Tekin et al.'s results 

  Click for Model/Code and Paper
Dynamic Sampling Convolutional Neural Networks

Mar 22, 2018
Jialin Wu, Dai Li, Yu Yang, Chandrajit Bajaj, Xiangyang Ji

We present Dynamic Sampling Convolutional Neural Networks (DSCNN), where the position-specific kernels learn from not only the current position but also multiple sampled neighbour regions. During sampling, residual learning is introduced to ease training and an attention mechanism is applied to fuse features from different samples. And the kernels are further factorized to reduce parameters. The multiple sampling strategy enlarges the effective receptive fields significantly without requiring more parameters. While DSCNNs inherit the advantages of DFN, namely avoiding feature map blurring by position-specific kernels while keeping translation invariance, it also efficiently alleviates the overfitting issue caused by much more parameters than normal CNNs. Our model is efficient and can be trained end-to-end via standard back-propagation. We demonstrate the merits of our DSCNNs on both sparse and dense prediction tasks involving object detection and flow estimation. Our results show that DSCNNs enjoy stronger recognition abilities and achieve 81.7% in VOC2012 detection dataset. Also, DSCNNs obtain much sharper responses in flow estimation on FlyingChairs dataset compared to multiple FlowNet models' baselines.

* Submitted to ECCV 2018 

  Click for Model/Code and Paper
Fully Convolutional Instance-aware Semantic Segmentation

Apr 10, 2017
Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, Yichen Wei

We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It inherits all the merits of FCNs for semantic segmentation and instance mask proposal. It performs instance mask prediction and classification jointly. The underlying convolutional representation is fully shared between the two sub-tasks, as well as between all regions of interest. The proposed network is highly integrated and achieves state-of-the-art performance in both accuracy and efficiency. It wins the COCO 2016 segmentation competition by a large margin. Code would be released at \url{https://github.com/daijifeng001/TA-FCN}.


  Click for Model/Code and Paper
DLocRL: A Deep Learning Pipeline for Fine-Grained Location Recognition and Linking in Tweets

Jan 24, 2019
Canwen Xu, Jing Li, Xiangyang Luo, Jiaxin Pei, Chenliang Li, Donghong Ji

In recent years, with the prevalence of social media and smart devices, people causally reveal their locations such as shops, hotels, and restaurants in their tweets. Recognizing and linking such fine-grained location mentions to well-defined location profiles are beneficial for retrieval and recommendation systems. Prior studies heavily rely on hand-crafted linguistic features. Recently, deep learning has shown its effectiveness in feature representation learning for many NLP tasks. In this paper, we propose DLocRL, a new Deep pipeline for fine-grained Location Recognition and Linking in tweets. DLocRL leverages representation learning, semantic composition, attention and gate mechanisms to exploit semantic context features for location recognition and linking. Furthermore, a novel post-processing strategy, named Geographical Pair Linking, is developed to improve the linking performance. In this way, DLocRL does not require hand-crafted features. The experimental results show the effectiveness of DLocRL on fine-grained location recognition and linking with a real-world Twitter dataset.

* 7 pages, 4 figures, accepted by The Web Conf (WWW) 2019; minor footnote update 

  Click for Model/Code and Paper
C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection

Apr 11, 2019
Fang Wan, Chang Liu, Wei Ke, Xiangyang Ji, Jianbin Jiao, Qixiang Ye

Weakly supervised object detection (WSOD) is a challenging task when provided with image category supervision but required to simultaneously learn object locations and object detectors. Many WSOD approaches adopt multiple instance learning (MIL) and have non-convex loss functions which are prone to get stuck into local minima (falsely localize object parts) while missing full object extent during training. In this paper, we introduce a continuation optimization method into MIL and thereby creating continuation multiple instance learning (C-MIL), with the intention of alleviating the non-convexity problem in a systematic way. We partition instances into spatially related and class related subsets, and approximate the original loss function with a series of smoothed loss functions defined within the subsets. Optimizing smoothed loss functions prevents the training procedure falling prematurely into local minima and facilitates the discovery of Stable Semantic Extremal Regions (SSERs) which indicate full object extent. On the PASCAL VOC 2007 and 2012 datasets, C-MIL improves the state-of-the-art of weakly supervised object detection and weakly supervised object localization with large margins.

* Accept by CVPR2019 

  Click for Model/Code and Paper
Bi-stream Pose Guided Region Ensemble Network for Fingertip Localization from Stereo Images

Feb 26, 2019
Guijin Wang, Cairong Zhang, Xinghao Chen, Xiangyang Ji, Jing-Hao Xue, Hang Wang

In human-computer interaction, it is important to accurately estimate the hand pose especially fingertips. However, traditional approaches for fingertip localization mainly rely on depth images and thus suffer considerably from the noise and missing values. Instead of depth images, stereo images can also provide 3D information of hands and promote 3D hand pose estimation. There are nevertheless limitations on the dataset size, global viewpoints, hand articulations and hand shapes in the publicly available stereo-based hand pose datasets. To mitigate these limitations and promote further research on hand pose estimation from stereo images, we propose a new large-scale binocular hand pose dataset called THU-Bi-Hand, offering a new perspective for fingertip localization. In the THU-Bi-Hand dataset, there are 447k pairs of stereo images of different hand shapes from 10 subjects with accurate 3D location annotations of the wrist and five fingertips. Captured with minimal restriction on the range of hand motion, the dataset covers large global viewpoint space and hand articulation space. To better present the performance of fingertip localization on THU-Bi-Hand, we propose a novel scheme termed Bi-stream Pose Guided Region Ensemble Network (Bi-Pose-REN). It extracts more representative feature regions around joint points in the feature maps under the guidance of the previously estimated pose. The feature regions are integrated hierarchically according to the topology of hand joints to regress the refined hand pose. Bi-Pose-REN and several existing methods are evaluated on THU-Bi-Hand so that benchmarks are provided for further research. Experimental results show that our new method has achieved the best performance on THU-Bi-Hand.

* Cairong Zhang and Xinghao Chen are equally contributed 

  Click for Model/Code and Paper
A Graphical Social Topology Model for Multi-Object Tracking

Sep 29, 2017
Shan Gao, Xiaogang Chen, Qixiang Ye, Junliang Xing, Arjan Kuijper, Xiangyang Ji

Tracking multiple objects is a challenging task when objects move in groups and occlude each other. Existing methods have investigated the problems of group division and group energy-minimization; however, lacking overall object-group topology modeling limits their ability in handling complex object and group dynamics. Inspired with the social affinity property of moving objects, we propose a Graphical Social Topology (GST) model, which estimates the group dynamics by jointly modeling the group structure and the states of objects using a topological representation. With such topology representation, moving objects are not only assigned to groups, but also dynamically connected with each other, which enables in-group individuals to be correctly associated and the cohesion of each group to be precisely modeled. Using well-designed topology learning modules and topology training, we infer the birth/death and merging/splitting of dynamic groups. With the GST model, the proposed multi-object tracker can naturally facilitate the occlusion problem by treating the occluded object and other in-group members as a whole while leveraging overall state transition. Experiments on both RGB and RGB-D datasets confirm that the proposed multi-object tracker improves the state-of-the-arts especially in crowded scenes.

* there is an input error in experiments, so we should change the results in all results tables 

  Click for Model/Code and Paper
Efficient Divide-And-Conquer Classification Based on Feature-Space Decomposition

Jan 29, 2015
Qi Guo, Bo-Wei Chen, Feng Jiang, Xiangyang Ji, Sun-Yuan Kung

This study presents a divide-and-conquer (DC) approach based on feature space decomposition for classification. When large-scale datasets are present, typical approaches usually employed truncated kernel methods on the feature space or DC approaches on the sample space. However, this did not guarantee separability between classes, owing to overfitting. To overcome such problems, this work proposes a novel DC approach on feature spaces consisting of three steps. Firstly, we divide the feature space into several subspaces using the decomposition method proposed in this paper. Subsequently, these feature subspaces are sent into individual local classifiers for training. Finally, the outcomes of local classifiers are fused together to generate the final classification results. Experiments on large-scale datasets are carried out for performance evaluation. The results show that the error rates of the proposed DC method decreased comparing with the state-of-the-art fast SVM solvers, e.g., reducing error rates by 10.53% and 7.53% on RCV1 and covtype datasets respectively.

* 5 pages 

  Click for Model/Code and Paper
PgNN: Physics-guided Neural Network for Fourier Ptychographic Microscopy

Sep 19, 2019
Yongbing Zhang, Yangzhe Liu, Xiu Li, Shaowei Jiang, Krishna Dixit, Xinfeng Zhang, Xiangyang Ji

Fourier ptychography (FP) is a newly developed computational imaging approach that achieves both high resolution and wide field of view by stitching a series of low-resolution images captured under angle-varied illumination. So far, many supervised data-driven models have been applied to solve inverse imaging problems. These models need massive amounts of data to train, and are limited by the dataset characteristics. In FP problems, generic datasets are always scarce, and the optical aberration varies greatly under different acquisition conditions. To address these dilemmas, we model the forward physical imaging process as an interpretable physics-guided neural network (PgNN), where the reconstructed image in the complex domain is considered as the learnable parameters of the neural network. Since the optimal parameters of the PgNN can be derived by minimizing the difference between the model-generated images and real captured angle-varied images corresponding to the same scene, the proposed PgNN can get rid of the problem of massive training data as in traditional supervised methods. Applying the alternate updating mechanism and the total variation regularization, PgNN can flexibly reconstruct images with improved performance. In addition, the Zernike mode is incorporated to compensate for optical aberrations to enhance the robustness of FP reconstructions. As a demonstration, we show our method can reconstruct images with smooth performance and detailed information in both simulated and experimental datasets. In particular, when validated in an extension of a high-defocus, high-exposure tissue section dataset, PgNN outperforms traditional FP methods with fewer artifacts and distinguishable structures.


  Click for Model/Code and Paper
Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition

Jul 11, 2019
Xiangyang Li, Luis Herranz, Shuqiang Jiang

In recent years, convolutional neural networks (CNNs) have achieved impressive performance for various visual recognition scenarios. CNNs trained on large labeled datasets can not only obtain significant performance on most challenging benchmarks but also provide powerful representations, which can be used to a wide range of other tasks. However, the requirement of massive amounts of data to train deep neural networks is a major drawback of these models, as the data available is usually limited or imbalanced. Fine-tuning (FT) is an effective way to transfer knowledge learned in a source dataset to a target task. In this paper, we introduce and systematically investigate several factors that influence the performance of fine-tuning for visual recognition. These factors include parameters for the retraining procedure (e.g., the initial learning rate of fine-tuning), the distribution of the source and target data (e.g., the number of categories in the source dataset, the distance between the source and target datasets) and so on. We quantitatively and qualitatively analyze these factors, evaluate their influence, and present many empirical observations. The results reveal insights into what fine-tuning changes CNN parameters and provide useful and evidence-backed intuitions about how to implement fine-tuning for computer vision tasks.

* Accepted by ACM Transactions on Data Science 

  Click for Model/Code and Paper
Spatial Mixture Models with Learnable Deep Priors for Perceptual Grouping

Feb 07, 2019
Jinyang Yuan, Bin Li, Xiangyang Xue

Humans perceive the seemingly chaotic world in a structured and compositional way with the prerequisite of being able to segregate conceptual entities from the complex visual scenes. The mechanism of grouping basic visual elements of scenes into conceptual entities is termed as perceptual grouping. In this work, we propose a new type of spatial mixture models with learnable priors for perceptual grouping. Different from existing methods, the proposed method disentangles the representation of an object into `shape' and `appearance' which are modeled separately by the mixture weights and the conditional probability distributions. More specifically, each object in the visual scene is modeled by one mixture component, whose mixture weights and the parameter of the conditional probability distribution are generated by two neural networks, respectively. The mixture weights focus on modeling spatial dependencies (i.e., shape) and the conditional probability distributions deal with intra-object variations (i.e., appearance). In addition, the background is separately modeled as a special component complementary to the foreground objects. Our extensive empirical tests on two perceptual grouping datasets demonstrate that the proposed method outperforms the state-of-the-art methods under most experimental configurations. The learned conceptual entities are generalizable to novel visual scenes and insensitive to the diversity of objects.

* AAAI 2019 

  Click for Model/Code and Paper
Scene recognition with CNNs: objects, scales and dataset bias

Jan 21, 2018
Luis Herranz, Shuqiang Jiang, Xiangyang Li

Since scenes are composed in part of objects, accurate recognition of scenes requires knowledge about both scenes and objects. In this paper we address two related problems: 1) scale induced dataset bias in multi-scale convolutional neural network (CNN) architectures, and 2) how to combine effectively scene-centric and object-centric knowledge (i.e. Places and ImageNet) in CNNs. An earlier attempt, Hybrid-CNN, showed that incorporating ImageNet did not help much. Here we propose an alternative method taking the scale into account, resulting in significant recognition gains. By analyzing the response of ImageNet-CNNs and Places-CNNs at different scales we find that both operate in different scale ranges, so using the same network for all the scales induces dataset bias resulting in limited performance. Thus, adapting the feature extractor to each particular scale (i.e. scale-specific CNNs) is crucial to improve recognition, since the objects in the scenes have their specific range of scales. Experimental results show that the recognition accuracy highly depends on the scale, and that simple yet carefully chosen multi-scale combinations of ImageNet-CNNs and Places-CNNs, can push the state-of-the-art recognition accuracy in SUN397 up to 66.26% (and even 70.17% with deeper architectures, comparable to human performance).

* L. Herranz, S. Jiang, X. Li, "Scene recognition with CNNs: objects, scales and dataset bias", Proc. International Conference on Computer Vision and Pattern Recognition (CVPR16), Las Vegas, Nevada, USA, June 2016 
* CVPR 2016 

  Click for Model/Code and Paper
Power-Law Graph Cuts

Nov 25, 2014
Xiangyang Zhou, Jiaxin Zhang, Brian Kulis

Algorithms based on spectral graph cut objectives such as normalized cuts, ratio cuts and ratio association have become popular in recent years because they are widely applicable and simple to implement via standard eigenvector computations. Despite strong performance for a number of clustering tasks, spectral graph cut algorithms still suffer from several limitations: first, they require the number of clusters to be known in advance, but this information is often unknown a priori; second, they tend to produce clusters with uniform sizes. In some cases, the true clusters exhibit a known size distribution; in image segmentation, for instance, human-segmented images tend to yield segment sizes that follow a power-law distribution. In this paper, we propose a general framework of power-law graph cut algorithms that produce clusters whose sizes are power-law distributed, and also does not fix the number of clusters upfront. To achieve our goals, we treat the Pitman-Yor exchangeable partition probability function (EPPF) as a regularizer to graph cut objectives. Because the resulting objectives cannot be solved by relaxing via eigenvectors, we derive a simple iterative algorithm to locally optimize the objectives. Moreover, we show that our proposed algorithm can be viewed as performing MAP inference on a particular Pitman-Yor mixture model. Our experiments on various data sets show the effectiveness of our algorithms.


  Click for Model/Code and Paper
Learning to Point and Count

Dec 08, 2015
Jie Shao, Dequan Wang, Xiangyang Xue, Zheng Zhang

This paper proposes the problem of point-and-count as a test case to break the what-and-where deadlock. Different from the traditional detection problem, the goal is to discover key salient points as a way to localize and count the number of objects simultaneously. We propose two alternatives, one that counts first and then point, and another that works the other way around. Fundamentally, they pivot around whether we solve "what" or "where" first. We evaluate their performance on dataset that contains multiple instances of the same class, demonstrating the potentials and their synergies. The experiences derive a few important insights that explains why this is a much harder problem than classification, including strong data bias and the inability to deal with object scales robustly in state-of-art convolutional neural networks.


  Click for Model/Code and Paper