The deep Convolutional Neural Network (CNN) is the state-of-the-art solution for large-scale visual recognition. Following basic principles such as increasing the depth and constructing highway connections, researchers have manually designed a lot of fixed network structures and verified their effectiveness. In this paper, we discuss the possibility of learning deep network structures automatically. Note that the number of possible network structures increases exponentially with the number of layers in the network, which inspires us to adopt the genetic algorithm to efficiently traverse this large search space. We first propose an encoding method to represent each network structure in a fixed-length binary string, and initialize the genetic algorithm by generating a set of randomized individuals. In each generation, we define standard genetic operations, e.g., selection, mutation and crossover, to eliminate weak individuals and then generate more competitive ones. The competitiveness of each individual is defined as its recognition accuracy, which is obtained via training the network from scratch and evaluating it on a validation set. We run the genetic process on two small datasets, i.e., MNIST and CIFAR10, demonstrating its ability to evolve and find high-quality structures which are little studied before. These structures are also transferrable to the large-scale ILSVRC2012 dataset.

* Submitted to CVPR 2017 (10 pages, 5 figures)
Click to Read Paper
Many objects, especially these made by humans, are symmetric, e.g. cars and aeroplanes. This paper addresses the estimation of 3D structures of symmetric objects from multiple images of the same object category, e.g. different cars, seen from various viewpoints. We assume that the deformation between different instances from the same object category is non-rigid and symmetric. In this paper, we extend two leading non-rigid structure from motion (SfM) algorithms to exploit symmetry constraints. We model the both methods as energy minimization, in which we also recover the missing observations caused by occlusions. In particularly, we show that by rotating the coordinate system, the energy can be decoupled into two independent terms, which still exploit symmetry, to apply matrix factorization separately on each of them for initialization. The results on the Pascal3D+ dataset show that our methods significantly improve performance over baseline methods.

* Accepted to ECCV 2016
Click to Read Paper
Computer graphics can not only generate synthetic images and ground truth but it also offers the possibility of constructing virtual worlds in which: (i) an agent can perceive, navigate, and take actions guided by AI algorithms, (ii) properties of the worlds can be modified (e.g., material and reflectance), (iii) physical simulations can be performed, and (iv) algorithms can be learnt and evaluated. But creating realistic virtual worlds is not easy. The game industry, however, has spent a lot of effort creating 3D worlds, which a player can interact with. So researchers can build on these resources to create virtual worlds, provided we can access and modify the internal data structures of the games. To enable this we created an open-source plugin UnrealCV (http://unrealcv.github.io) for a popular game engine Unreal Engine 4 (UE4). We show two applications: (i) a proof of concept image dataset, and (ii) linking Caffe with the virtual world to test deep network algorithms.

Click to Read Paper
Recovering the occlusion relationships between objects is a fundamental human visual ability which yields important information about the 3D world. In this paper we propose a deep network architecture, called DOC, which acts on a single image, detects object boundaries and estimates the border ownership (i.e. which side of the boundary is foreground and which is background). We represent occlusion relations by a binary edge map, to indicate the object boundary, and an occlusion orientation variable which is tangential to the boundary and whose direction specifies border ownership by a left-hand rule. We train two related deep convolutional neural networks, called DOC, which exploit local and non-local image cues to estimate this representation and hence recover occlusion relations. In order to train and test DOC we construct a large-scale instance occlusion boundary dataset using PASCAL VOC images, which we call the PASCAL instance occlusion dataset (PIOD). This contains 10,000 images and hence is two orders of magnitude larger than existing occlusion datasets for outdoor images. We test two variants of DOC on PIOD and on the BSDS occlusion dataset and show they outperform state-of-the-art methods. Finally, we perform numerous experiments investigating multiple settings of DOC and transfer between BSDS and PIOD, which provides more insights for further study of occlusion estimation.

* Accepted to ECCV 2016
Click to Read Paper
This paper presents an approach to parsing humans when there is significant occlusion. We model humans using a graphical model which has a tree structure building on recent work [32, 6] and exploit the connectivity prior that, even in presence of occlusion, the visible nodes form a connected subtree of the graphical model. We call each connected subtree a flexible composition of object parts. This involves a novel method for learning occlusion cues. During inference we need to search over a mixture of different flexible models. By exploiting part sharing, we show that this inference can be done extremely efficiently requiring only twice as many computations as searching for the entire object (i.e., not modeling occlusion). We evaluate our model on the standard benchmarked "We Are Family" Stickmen dataset and obtain significant performance improvements over the best alternative algorithms.

* CVPR 15 Camera Ready
Click to Read Paper
In this paper, we study the problem of semantic part segmentation for animals. This is more challenging than standard object detection, object segmentation and pose estimation tasks because semantic parts of animals often have similar appearance and highly varying shapes. To tackle these challenges, we build a mixture of compositional models to represent the object boundary and the boundaries of semantic parts. And we incorporate edge, appearance, and semantic part cues into the compositional model. Given part-level segmentation annotation, we develop a novel algorithm to learn a mixture of compositional models under various poses and viewpoints for certain animal classes. Furthermore, a linear complexity algorithm is offered for efficient inference of the compositional model using dynamic programming. We evaluate our method for horse and cow using a newly annotated dataset on Pascal VOC 2010 which has pixelwise part labels. Experimental results demonstrate the effectiveness of our method.

Click to Read Paper
We present a method for estimating articulated human pose from a single static image based on a graphical model with novel pairwise relations that make adaptive use of local image measurements. More precisely, we specify a graphical model for human pose which exploits the fact the local image measurements can be used both to detect parts (or joints) and also to predict the spatial relationships between them (Image Dependent Pairwise Relations). These spatial relationships are represented by a mixture model. We use Deep Convolutional Neural Networks (DCNNs) to learn conditional probabilities for the presence of parts and their spatial relationships within image patches. Hence our model combines the representational flexibility of graphical models with the efficiency and statistical power of DCNNs. Our method significantly outperforms the state of the art methods on the LSP and FLIC datasets and also performs very well on the Buffy dataset without any training.

* NIPS 2014 Camera Ready
Click to Read Paper
We study linear models under heavy-tailed priors from a probabilistic viewpoint. Instead of computing a single sparse most probable (MAP) solution as in standard deterministic approaches, the focus in the Bayesian compressed sensing framework shifts towards capturing the full posterior distribution on the latent variables, which allows quantifying the estimation uncertainty and learning model parameters using maximum likelihood. The exact posterior distribution under the sparse linear model is intractable and we concentrate on variational Bayesian techniques to approximate it. Repeatedly computing Gaussian variances turns out to be a key requisite and constitutes the main computational bottleneck in applying variational techniques in large-scale problems. We leverage on the recently proposed Perturb-and-MAP algorithm for drawing exact samples from Gaussian Markov random fields (GMRF). The main technical contribution of our paper is to show that estimating Gaussian variances using a relatively small number of such efficiently drawn random samples is much more effective than alternative general-purpose variance estimation techniques. By reducing the problem of variance estimation to standard optimization primitives, the resulting variational algorithms are fully scalable and parallelizable, allowing Bayesian computations in extremely large-scale problems with the same memory and time complexity requirements as conventional point estimation techniques. We illustrate these ideas with experiments in image deblurring.

* Proc. IEEE Workshop on Information Theory in Computer Vision and Pattern Recognition (in conjunction with ICCV-11), pp. 1332-1339, Barcelona, Spain, Nov. 2011
* 8 pages, 3 figures, appears in Proc. IEEE Workshop on Information Theory in Computer Vision and Pattern Recognition (in conjunction with ICCV-11), Barcelona, Spain, Nov. 2011
Click to Read Paper
This is an opinion paper about the strengths and weaknesses of Deep Nets. They are at the center of recent progress on Artificial Intelligence and are of growing importance in Cognitive Science and Neuroscience since they enable the development of computational models that can deal with a large range of visually realistic stimuli and visual tasks. They have clear limitations but they also have enormous successes. There is also gradual, though incomplete, understanding of their inner workings. It seems unlikely that Deep Nets in their current form will be the best long-term solution either for building general purpose intelligent machines or for understanding the mind/brain, but it is likely that many aspects of them will remain. At present Deep Nets do very well on specific types of visual tasks and on specific benchmarked datasets. But Deep Nets are much less general purpose, flexible, and adaptive than the human visual system. Moreover, methods like Deep Nets may run into fundamental difficulties when faced with the enormous complexity of natural images. To illustrate our main points, while keeping the references small, this paper is slightly biased towards work from our group.

* MIT CBMM Memo No. 088
Click to Read Paper
We propose a method to generate multiple diverse and valid human pose hypotheses in 3D all consistent with the 2D detection of joints in a monocular RGB image. We use a novel generative model uniform (unbiased) in the space of anatomically plausible 3D poses. Our model is compositional (produces a pose by combining parts) and since it is restricted only by anatomical constraints it can generalize to every plausible human 3D pose. Removing the model bias intrinsically helps to generate more diverse 3D pose hypotheses. We argue that generating multiple pose hypotheses is more reasonable than generating only a single 3D pose based on the 2D joint detection given the depth ambiguity and the uncertainty due to occlusion and imperfect 2D joint detection. We hope that the idea of generating multiple consistent pose hypotheses can give rise to a new line of future work that has not received much attention in the literature. We used the Human3.6M dataset for empirical evaluation.

* accepted to ICCV 2017 (PeopleCap)
Click to Read Paper
Many man-made objects have intrinsic symmetries and Manhattan structure. By assuming an orthographic projection model, this paper addresses the estimation of 3D structures and camera projection using symmetry and/or Manhattan structure cues, which occur when the input is single- or multiple-image from the same category, e.g., multiple different cars. Specifically, analysis on the single image case implies that Manhattan alone is sufficient to recover the camera projection, and then the 3D structure can be reconstructed uniquely exploiting symmetry. However, Manhattan structure can be difficult to observe from a single image due to occlusion. To this end, we extend to the multiple-image case which can also exploit symmetry but does not require Manhattan axes. We propose a novel rigid structure from motion method, exploiting symmetry and using multiple images from the same category as input. Experimental results on the Pascal3D+ dataset show that our method significantly outperforms baseline methods.

* Accepted to CVPR 2017
Click to Read Paper
This paper describes serial and parallel compositional models of multiple objects with part sharing. Objects are built by part-subpart compositions and expressed in terms of a hierarchical dictionary of object parts. These parts are represented on lattices of decreasing sizes which yield an executive summary description. We describe inference and learning algorithms for these models. We analyze the complexity of this model in terms of computation time (for serial computers) and numbers of nodes (e.g., "neurons") for parallel computers. In particular, we compute the complexity gains by part sharing and its dependence on how the dictionary scales with the level of the hierarchy. We explore three regimes of scaling behavior where the dictionary size (i) increases exponentially with the level, (ii) is determined by an unsupervised compositional learning algorithm applied to real data, (iii) decreases exponentially with scale. This analysis shows that in some regimes the use of shared parts enables algorithms which can perform inference in time linear in the number of levels for an exponential number of objects. In other regimes part sharing has little advantage for serial computers but can give linear processing on parallel computers.

* ICLR 2013
Click to Read Paper
In this paper, we propose a fully convolutional network for 3D human pose estimation from monocular images. We use limb orientations as a new way to represent 3D poses and bind the orientation together with the bounding box of each limb region to better associate images and predictions. The 3D orientations are modeled jointly with 2D keypoint detections. Without additional constraints, this simple method can achieve good results on several large-scale benchmarks. Further experiments show that our method can generalize well to novel scenes and is robust to inaccurate bounding boxes.

* BMVC 2018 - Proceedings of the British Machine Vision Conference 2018
* BMVC 2018. Code available at https://github.com/chenxuluo/OriNet-demo
Click to Read Paper
This paper addresses the problem of semantic part parsing (segmentation) of cars, i.e.assigning every pixel within the car to one of the parts (e.g.body, window, lights, license plates and wheels). We formulate this as a landmark identification problem, where a set of landmarks specifies the boundaries of the parts. A novel mixture of graphical models is proposed, which dynamically couples the landmarks to a hierarchy of segments. When modeling pairwise relation between landmarks, this coupling enables our model to exploit the local image contents in addition to spatial deformation, an aspect that most existing graphical models ignore. In particular, our model enforces appearance consistency between segments within the same part. Parsing the car, including finding the optimal coupling between landmarks and segments in the hierarchy, is performed by dynamic programming. We evaluate our method on a subset of PASCAL VOC 2010 car images and on the car subset of 3D Object Category dataset (CAR3D). We show good results and, in particular, quantify the effectiveness of using the segment appearance consistency in terms of accuracy of part localization and segmentation.

* 12 pages, CBMM memo
Click to Read Paper
Human labeled datasets, along with their corresponding evaluation algorithms, play an important role in boundary detection. We here present a psychophysical experiment that addresses the reliability of such benchmarks. To find better remedies to evaluate the performance of any boundary detection algorithm, we propose a computational framework to remove inappropriate human labels and estimate the intrinsic properties of boundaries.

* NIPS 2012 Workshop on Human Computation for Science and Computational Sustainability
Click to Read Paper
Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process. To address these issues and complement similar efforts in visual question answering, we build CLEVR-Ref+, a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators. In addition to evaluating several state-of-the-art models on CLEVR-Ref+, we also propose IEP-Ref, a module network approach that significantly outperforms other models on our dataset. In particular, we present two interesting and important findings using IEP-Ref: (1) the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step; (2) even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions. To the best of our knowledge, this is the first direct and quantitative proof that neural modules behave in the way they are intended.

* All data and code concerning CLEVR-Ref+ and IEP-Ref have been released at https://cs.jhu.edu/~cxliu/2019/clevr-ref+
Click to Read Paper
While recent deep neural networks have achieved a promising performance on object recognition, they rely implicitly on the visual contents of the whole image. In this paper, we train deep neural net- works on the foreground (object) and background (context) regions of images respectively. Consider- ing human recognition in the same situations, net- works trained on the pure background without ob- jects achieves highly reasonable recognition performance that beats humans by a large margin if only given context. However, humans still outperform networks with pure object available, which indicates networks and human beings have different mechanisms in understanding an image. Furthermore, we straightforwardly combine multiple trained networks to explore different visual cues learned by different networks. Experiments show that useful visual hints can be explicitly learned separately and then combined to achieve higher performance, which verifies the advantages of the proposed framework.

* To Appear in IJCAI 2017
Click to Read Paper
This paper addresses the problem of face recognition when there is only few, or even only a single, labeled examples of the face that we wish to recognize. Moreover, these examples are typically corrupted by nuisance variables, both linear (i.e., additive nuisance variables such as bad lighting, wearing of glasses) and non-linear (i.e., non-additive pixel-wise nuisance variables such as expression changes). The small number of labeled examples means that it is hard to remove these nuisance variables between the training and testing faces to obtain good recognition performance. To address the problem we propose a method called Semi-Supervised Sparse Representation based Classification (S$^3$RC). This is based on recent work on sparsity where faces are represented in terms of two dictionaries: a gallery dictionary consisting of one or more examples of each person, and a variation dictionary representing linear nuisance variables (e.g., different lighting conditions, different glasses). The main idea is that (i) we use the variation dictionary to characterize the linear nuisance variables via the sparsity framework, then (ii) prototype face images are estimated as a gallery dictionary via a Gaussian Mixture Model (GMM), with mixed labeled and unlabeled samples in a semi-supervised manner, to deal with the non-linear nuisance variations between labeled and unlabeled samples. We have done experiments with insufficient labeled samples, even when there is only a single labeled sample per person. Our results on the AR, Multi-PIE, CAS-PEAL, and LFW databases demonstrate that the proposed method is able to deliver significantly improved performance over existing methods.

* To appear in IEEE Transactions on Image Processing, 2017
Click to Read Paper
In this paper, we propose a deep part-based model (DeePM) for symbiotic object detection and semantic part localization. For this purpose, we annotate semantic parts for all 20 object categories on the PASCAL VOC 2012 dataset, which provides information on object pose, occlusion, viewpoint and functionality. DeePM is a latent graphical model based on the state-of-the-art R-CNN framework, which learns an explicit representation of the object-part configuration with flexible type sharing (e.g., a sideview horse head can be shared by a fully-visible sideview horse and a highly truncated sideview horse with head and neck only). For comparison, we also present an end-to-end Object-Part (OP) R-CNN which learns an implicit feature representation for jointly mapping an image ROI to the object and part bounding boxes. We evaluate the proposed methods for both the object and part detection performance on PASCAL VOC 2012, and show that DeePM consistently outperforms OP R-CNN in detecting objects and parts. In addition, it obtains superior performance to Fast and Faster R-CNNs in object detection.

* the final revision to ICLR 2016, in which some color errors in the figures are fixed
Click to Read Paper
In this paper, we address the boundary detection task motivated by the ambiguities in current definition of edge detection. To this end, we generate a large database consisting of more than 10k images (which is 20x bigger than existing edge detection databases) along with ground truth boundaries between 459 semantic classes including both foreground objects and different types of background, and call it the PASCAL Boundaries dataset, which will be released to the community. In addition, we propose a novel deep network-based multi-scale semantic boundary detector and name it Multi-scale Deep Semantic Boundary Detector (M-DSBD). We provide baselines using models that were trained on edge detection and show that they transfer reasonably to the task of boundary detection. Finally, we point to various important research problems that this dataset can be used for.

Click to Read Paper