Research papers and code for "Yijun Li":
Egocentric interaction recognition aims to recognize the camera wearer's interactions with the interactor who faces the camera wearer in egocentric videos. In such a human-human interaction analysis problem, it is crucial to explore the relations between the camera wearer and the interactor. However, most existing works directly model the interactions as a whole and lack modeling the relations between the two interacting persons. To exploit the strong relations for egocentric interaction recognition, we introduce a dual relation modeling framework which learns to model the relations between the camera wearer and the interactor based on the individual action representations of the two persons. Specifically, we develop a novel interactive LSTM module, the key component of our framework, to explicitly model the relations between the two interacting persons based on their individual action representations, which are collaboratively learned with an interactor attention module and a global-local motion module. Experimental results on three egocentric interaction datasets show the effectiveness of our method and advantage over state-of-the-arts.

Click to Read Paper and Get Code
Asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and speedup properties, mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism. To fill the gaps in theory and provide theoretical supports, this paper studies two asynchronous parallel implementations of SG: one is on the computer network and the other is on the shared memory system. We establish an ergodic convergence rate $O(1/\sqrt{K})$ for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by $\sqrt{K}$ ($K$ is the total number of iterations). Our results generalize and improve existing analysis for convex minimization.

* 31 pages
Click to Read Paper and Get Code
In this paper, we propose an effective face completion algorithm using a deep generative model. Different from well-studied background completion, the face completion task is more challenging as it often requires to generate semantically new pixels for the missing key components (e.g., eyes and mouths) that contain large appearance variations. Unlike existing nonparametric algorithms that search for patches to synthesize, our algorithm directly generates contents for missing regions based on a neural network. The model is trained with a combination of a reconstruction loss, two adversarial losses and a semantic parsing loss, which ensures pixel faithfulness and local-global contents consistency. With extensive experimental results, we demonstrate qualitatively and quantitatively that our model is able to deal with a large area of missing pixels in arbitrary shapes and generate realistic face completion results.

* Accepted by CVPR 2017
Click to Read Paper and Get Code
Many scientific datasets are of high dimension, and the analysis usually requires visual manipulation by retaining the most important structures of data. Principal curve is a widely used approach for this purpose. However, many existing methods work only for data with structures that are not self-intersected, which is quite restrictive for real applications. A few methods can overcome the above problem, but they either require complicated human-made rules for a specific task with lack of convergence guarantee and adaption flexibility to different tasks, or cannot obtain explicit structures of data. To address these issues, we develop a new regularized principal graph learning framework that captures the local information of the underlying graph structure based on reversed graph embedding. As showcases, models that can learn a spanning tree or a weighted undirected $\ell_1$ graph are proposed, and a new learning algorithm is developed that learns a set of principal points and a graph structure from data, simultaneously. The new algorithm is simple with guaranteed convergence. We then extend the proposed framework to deal with large-scale data. Experimental results on various synthetic and six real world datasets show that the proposed method compares favorably with baselines and can uncover the underlying structure correctly.

Click to Read Paper and Get Code
In recent years, with the development of aerospace technology, we use more and more images captured by satellites to obtain information. But a large number of useless raw images, limited data storage resource and poor transmission capability on satellites hinder our use of valuable images. Therefore, it is necessary to deploy an on-orbit semantic segmentation model to filter out useless images before data transmission. In this paper, we present a detailed comparison on the recent deep learning models. Considering the computing environment of satellites, we compare methods from accuracy, parameters and resource consumption on the same public dataset. And we also analyze the relation between them. Based on experimental results, we further propose a viable on-orbit semantic segmentation strategy. It will be deployed on the TianZhi-2 satellite which supports deep learning methods and will be lunched soon.

* 8 pages, 3 figures, ICNC-FSKD 2019
Click to Read Paper and Get Code
Joint image filters leverage the guidance image as a prior and transfer the structural details from the guidance image to the target image for suppressing noise or enhancing spatial resolution. Existing methods either rely on various explicit filter constructions or hand-designed objective functions, thereby making it difficult to understand, improve, and accelerate these filters in a coherent framework. In this paper, we propose a learning-based approach for constructing joint filters based on Convolutional Neural Networks. In contrast to existing methods that consider only the guidance image, the proposed algorithm can selectively transfer salient structures that are consistent with both guidance and target images. We show that the model trained on a certain type of data, e.g., RGB and depth images, generalizes well to other modalities, e.g., flash/non-Flash and RGB/NIR images. We validate the effectiveness of the proposed joint filter through extensive experimental evaluations with state-of-the-art methods.

Click to Read Paper and Get Code
We propose a high-quality photo-to-pencil translation method with fine-grained control over the drawing style. This is a challenging task due to multiple stroke types (e.g., outline and shading), structural complexity of pencil shading (e.g., hatching), and the lack of aligned training data pairs. To address these challenges, we develop a two-branch model that learns separate filters for generating sketchy outlines and tonal shading from a collection of pencil drawings. We create training data pairs by extracting clean outlines and tonal illustrations from original pencil drawings using image filtering techniques, and we manually label the drawing styles. In addition, our model creates different pencil styles (e.g., line sketchiness and shading style) in a user-controllable manner. Experimental results on different types of pencil drawings show that the proposed algorithm performs favorably against existing methods in terms of quality, diversity and user evaluations.

* Accepted by CVPR 2019
Click to Read Paper and Get Code
Photorealistic image stylization concerns transferring style of a reference photo to a content photo with the constraint that the stylized photo should remain photorealistic. While several photorealistic image stylization methods exist, they tend to generate spatially inconsistent stylizations with noticeable artifacts. In this paper, we propose a method to address these issues. The proposed method consists of a stylization step and a smoothing step. While the stylization step transfers the style of the reference photo to the content photo, the smoothing step ensures spatially consistent stylizations. Each of the steps has a closed-form solution and can be computed efficiently. We conduct extensive experimental validations. The results show that the proposed method generates photorealistic stylization outputs that are more preferred by human subjects as compared to those by the competing methods while running much faster. Source code and additional results are available at https://github.com/NVIDIA/FastPhotoStyle .

* Accepted by ECCV 2018
Click to Read Paper and Get Code
Existing video prediction methods mainly rely on observing multiple historical frames or focus on predicting the next one-frame. In this work, we study the problem of generating consecutive multiple future frames by observing one single still image only. We formulate the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase. The multi-flow prediction is modeled in a variational probabilistic manner with spatial-temporal relationships learned through 3D convolutions. The flow-to-frame synthesis is modeled as a generative process in order to keep the predicted results lying closer to the manifold shape of real video sequence. Such a two-phase design prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more effective in predicting better and diverse results. Extensive experimental results on videos with different types of motion show that the proposed algorithm performs favorably against existing methods in terms of quality, diversity and human perceptual evaluation.

* Accepted by ECCV 2018
Click to Read Paper and Get Code
Universal style transfer aims to transfer arbitrary visual styles to content images. Existing feed-forward based methods, while enjoying the inference efficiency, are mainly limited by inability of generalizing to unseen styles or compromised visual quality. In this paper, we present a simple yet effective method that tackles these limitations without training on any pre-defined styles. The key ingredient of our method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network. The whitening and coloring transforms reflect a direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer. We demonstrate the effectiveness of our algorithm by generating high-quality stylized images with comparisons to a number of recent methods. We also analyze our method by visualizing the whitened features and synthesizing textures via simple feature coloring.

* Accepted by NIPS 2017
Click to Read Paper and Get Code
Recent progresses on deep discriminative and generative modeling have shown promising results on texture synthesis. However, existing feed-forward based methods trade off generality for efficiency, which suffer from many issues, such as shortage of generality (i.e., build one network per texture), lack of diversity (i.e., always produce visually identical output) and suboptimality (i.e., generate less satisfying visual effects). In this work, we focus on solving these issues for improved texture synthesis. We propose a deep generative feed-forward network which enables efficient synthesis of multiple textures within one single network and meaningful interpolation between them. Meanwhile, a suite of important techniques are introduced to achieve better convergence and diversity. With extensive experiments, we demonstrate the effectiveness of the proposed model and techniques for synthesizing a large number of textures and show its applications with the stylization.

* accepted by CVPR2017
Click to Read Paper and Get Code
Time-of-Flight (ToF) depth sensing camera is able to obtain depth maps at a high frame rate. However, its low resolution and sensitivity to the noise are always a concern. A popular solution is upsampling the obtained noisy low resolution depth map with the guidance of the companion high resolution color image. However, due to the constrains in the existing upsampling models, the high resolution depth map obtained in such way may suffer from either texture copy artifacts or blur of depth discontinuity. In this paper, a novel optimization framework is proposed with the brand new data term and smoothness term. The comprehensive experiments using both synthetic data and real data show that the proposed method well tackles the problem of texture copy artifacts and blur of depth discontinuity. It also demonstrates sufficient robustness to the noise. Moreover, a data driven scheme is proposed to adaptively estimate the parameter in the upsampling optimization framework. The encouraging performance is maintained even in the case of large upsampling e.g. $8\times$ and $16\times$.

Click to Read Paper and Get Code
Determining the localization of specific protein in human cells is important for understanding cellular functions and biological processes of underlying diseases. Among imaging techniques, high-throughput fluorescence microscopy imaging is an efficient biotechnology to stain the protein of interest in a cell. In this work, we present a novel classification model Twin U-Net (TUNet) for processing and classifying the belonging of protein in the Atlas images. Several notable Deep Learning models including GoogleNet and Resnet have been employed for comparison. Results have shown that our system obtaining competitive performance.

Click to Read Paper and Get Code
Mapping is an essential task for mobile robots and topological representation often works as a basis for the various applications. In this paper, a novel framework that can build topological maps incrementally is proposed. The algorithm is based on distance map, and in our framework the topological map can grow as we append more sensor data into it. To demonstrate the result, we show the result of the distance map based method on several popular maps and run the incremental framework with the raw sensor data to have a growing topological map as robot explores the environment.

Click to Read Paper and Get Code
Recently, we have witnessed the explosive growth of images with complex information and content. In order to effectively and precisely retrieve desired images from a large-scale image database with low time-consuming, we propose the multiple feature fusion image retrieval algorithm based on the texture feature and rough set theory in this paper. In contrast to the conventional approaches that only use the single feature or standard, we fuse the different features with operation of normalization. The rough set theory will assist us to enhance the robustness of retrieval system when facing with incomplete data warehouse. To enhance the texture extraction paradigm, we use the wavelet Gabor function that holds better robustness. In addition, from the perspectives of the internal and external normalization, we re-organize extracted feature with the better combination. The numerical experiment has verified general feasibility of our methodology. We enhance the overall accuracy compared with the other state-of-the-art algorithms.

Click to Read Paper and Get Code
Document classification tasks were primarily tackled at word level. Recent research that works with character-level inputs shows several benefits over word-level approaches such as natural incorporation of morphemes and better handling of rare words. We propose a neural network architecture that utilizes both convolution and recurrent layers to efficiently encode character inputs. We validate the proposed model on eight large scale document classification tasks and compare with character-level convolution-only models. It achieves comparable performances with much less parameters.

Click to Read Paper and Get Code
Many practical applications such as gene expression analysis, multi-task learning, image recognition, signal processing, and medical data analysis pursue a sparse solution for the feature selection purpose and particularly favor the nonzeros \emph{evenly} distributed in different groups. The exclusive sparsity norm has been widely used to serve to this purpose. However, it still lacks systematical studies for exclusive sparsity norm optimization. This paper offers two main contributions from the optimization perspective: 1) We provide several efficient algorithms to solve exclusive sparsity norm minimization with either smooth loss or hinge loss (non-smooth loss). All algorithms achieve the optimal convergence rate $O(1/k^2)$ ($k$ is the iteration number). To the best of our knowledge, this is the first time to guarantee such convergence rate for the general exclusive sparsity norm minimization; 2) When the group information is unavailable to define the exclusive sparsity norm, we propose to use the random grouping scheme to construct groups and prove that if the number of groups is appropriately chosen, the nonzeros (true features) would be grouped in the ideal way with high probability. Empirical studies validate the efficiency of proposed algorithms, and the effectiveness of random grouping scheme on the proposed exclusive SVM formulation.

Click to Read Paper and Get Code
We introduce a unique experimental testbed that consists of a fleet of 16 miniature Ackermann-steering vehicles. We are motivated by a lack of available low-cost platforms to support research and education in multi-car navigation and trajectory planning. This article elaborates the design of our miniature robotic car, the Cambridge Minicar, as well as the fleet's control architecture. Our experimental testbed allows us to implement state-of-the-art driver models as well as autonomous control strategies, and test their validity in a real, physical multi-lane setup. Through experiments on our miniature highway, we are able to tangibly demonstrate the benefits of cooperative driving on multi-lane road topographies. Our setup paves the way for indoor large-fleet experimental research.

* Accepted to ICRA 2019
Click to Read Paper and Get Code
Reliable uncertainty quantification is a first step towards building explainable, transparent, and accountable artificial intelligent systems. Recent progress in Bayesian deep learning has made such quantification realizable. In this paper, we propose novel methods to study the benefits of characterizing model and data uncertainties for natural language processing (NLP) tasks. With empirical experiments on sentiment analysis, named entity recognition, and language modeling using convolutional and recurrent neural network models, we show that explicitly modeling uncertainties is not only necessary to measure output confidence levels, but also useful at enhancing model performances in various NLP tasks.

* To appear at AAAI 2019
Click to Read Paper and Get Code
For rescue robots, flipper endows the robot with additional ability to pass through various terrain. Autonomous motion becomes more important. In recent work autonomy is done by either planning with several special states or based on collected data. We are considering if it is possible to find a way to build continues states without collecting old trail data. In this paper, we first model the possible states as a global planning path with parameter configuration of the scene. Then, we follows the path to achieve the autonomous run. We plot the morphology of each path points to show the correctness of the path and implement a simple path following on real robot to demonstrate the performance of our algorithm.

Click to Read Paper and Get Code