Research papers and code for "Fei Tian":
In the Bag-of-Words (BoW) model based image retrieval task, the precision of visual matching plays a critical role in improving retrieval performance. Conventionally, local cues of a keypoint are employed. However, such strategy does not consider the contextual evidences of a keypoint, a problem which would lead to the prevalence of false matches. To address this problem, this paper defines "true match" as a pair of keypoints which are similar on three levels, i.e., local, regional, and global. Then, a principled probabilistic framework is established, which is capable of implicitly integrating discriminative cues from all these feature levels. Specifically, the Convolutional Neural Network (CNN) is employed to extract features from regional and global patches, leading to the so-called "Deep Embedding" framework. CNN has been shown to produce excellent performance on a dozen computer vision tasks such as image classification and detection, but few works have been done on BoW based image retrieval. In this paper, firstly we show that proper pre-processing techniques are necessary for effective usage of CNN feature. Then, in the attempt to fit it into our model, a novel indexing structure called "Deep Indexing" is introduced, which dramatically reduces memory usage. Extensive experiments on three benchmark datasets demonstrate that, the proposed Deep Embedding method greatly promotes the retrieval accuracy when CNN feature is integrated. We show that our method is efficient in terms of both memory and time cost, and compares favorably with the state-of-the-art methods.

* 10 pages, 13 figures, 7 tables, submitted to ACM Multimedia 2014
Click to Read Paper and Get Code
Since the proposal of big data analysis and Graphic Processing Unit (GPU), the deep learning technology has received a great deal of attention and has been widely applied in the field of imaging processing. In this paper, we have an aim to completely review and summarize the deep learning technologies for image denoising proposed in recent years. Morever, we systematically analyze the conventional machine learning methods for image denoising. Finally, we point out some research directions for the deep learning technologies in image denoising.

Click to Read Paper and Get Code
In the area of computer vision, deep learning has produced a variety of state-of-the-art models that rely on massive labeled data. However, collecting and annotating images from the real world has a great demand for labor and money investments and is usually too passive to build datasets with specific characteristics, such as small area of objects and high occlusion level. Under the framework of Parallel Vision, this paper presents a purposeful way to design artificial scenes and automatically generate virtual images with precise annotations. A virtual dataset named ParallelEye is built, which can be used for several computer vision tasks. Then, by training the DPM (Deformable Parts Model) and Faster R-CNN detectors, we prove that the performance of models can be significantly improved by combining ParallelEye with publicly available real-world datasets during the training phase. In addition, we investigate the potential of testing the trained models from a specific aspect using intentionally designed virtual datasets, in order to discover the flaws of trained models. From the experimental results, we conclude that our virtual dataset is viable to train and test the object detectors.

* To be published in IEEE/CAA Journal of Automatica Sinica
Click to Read Paper and Get Code
We propose Sentence Level Recurrent Topic Model (SLRTM), a new topic model that assumes the generation of each word within a sentence to depend on both the topic of the sentence and the whole history of its preceding words in the sentence. Different from conventional topic models that largely ignore the sequential order of words or their topic coherence, SLRTM gives full characterization to them by using a Recurrent Neural Networks (RNN) based framework. Experimental results have shown that SLRTM outperforms several strong baselines on various tasks. Furthermore, SLRTM can automatically generate sentences given a topic (i.e., topics to sentences), which is a key technology for real world applications such as personalized short text conversation.

* The submitted version was done in Feb.2016. Still in improvement
Click to Read Paper and Get Code
Word embedding, which refers to low-dimensional dense vector representations of natural words, has demonstrated its power in many natural language processing tasks. However, it may suffer from the inaccurate and incomplete information contained in the free text corpus as training data. To tackle this challenge, there have been quite a few works that leverage knowledge graphs as an additional information source to improve the quality of word embedding. Although these works have achieved certain success, they have neglected some important facts about knowledge graphs: (i) many relationships in knowledge graphs are \emph{many-to-one}, \emph{one-to-many} or even \emph{many-to-many}, rather than simply \emph{one-to-one}; (ii) most head entities and tail entities in knowledge graphs come from very different semantic spaces. To address these issues, in this paper, we propose a new algorithm named ProjectNet. ProjecNet models the relationships between head and tail entities after transforming them with different low-rank projection matrices. The low-rank projection can allow non \emph{one-to-one} relationships between entities, while different projection matrices for head and tail entities allow them to originate in different semantic spaces. The experimental results demonstrate that ProjectNet yields more accurate word embedding than previous works, thus leads to clear improvements in various natural language processing tasks.

Click to Read Paper and Get Code
Video image datasets are playing an essential role in design and evaluation of traffic vision algorithms. Nevertheless, a longstanding inconvenience concerning image datasets is that manually collecting and annotating large-scale diversified datasets from real scenes is time-consuming and prone to error. For that virtual datasets have begun to function as a proxy of real datasets. In this paper, we propose to construct large-scale artificial scenes for traffic vision research and generate a new virtual dataset called "ParallelEye". First of all, the street map data is used to build 3D scene model of Zhongguancun Area, Beijing. Then, the computer graphics, virtual reality, and rule modeling technologies are utilized to synthesize large-scale, realistic virtual urban traffic scenes, in which the fidelity and geography match the real world well. Furthermore, the Unity3D platform is used to render the artificial scenes and generate accurate ground-truth labels, e.g., semantic/instance segmentation, object bounding box, object tracking, optical flow, and depth. The environmental conditions in artificial scenes can be controlled completely. As a result, we present a viable implementation pipeline for constructing large-scale artificial scenes for traffic vision research. The experimental results demonstrate that this pipeline is able to generate photorealistic virtual datasets with low modeling time and high accuracy labeling.

* To be published in IEEE ITSC 2017
Click to Read Paper and Get Code
Recent years have witnessed a great development of deep learning based video person re-identification (Re-ID). A key factor for video person Re-ID is how to effectively construct discriminative video feature representations for the robustness to many complicated situations like occlusions. Recent part-based approaches employ spatial and temporal attention to extract the representative local features. While the correlations between the parts are ignored in the previous methods, to leverage the relations of different parts, we propose an innovative adaptive graph representation learning scheme for video person Re-ID, which enables the contextual interactions between the relevant regional features. Specifically, we exploit pose alignment connection and feature affinity connection to construct an adaptive structure-aware adjacency graph, which models the intrinsic relations between graph nodes. We perform feature propagation on the adjacency graph to refine the original regional features iteratively, the neighbor nodes information is taken into account for part feature representation. To learn the compact and discriminative representations, we further propose a novel temporal resolution-aware regularization, which enforces the consistency among different temporal resolutions for the same identities. We conduct extensive evaluations on four benchmarks, i.e. iLIDS-VID, PRID2011, MARS, and DukeMTMC-VideoReID, the experimental results achieve the competitive performance which demonstrates the effectiveness of our proposed method.

* 10 pages, 7 figures
Click to Read Paper and Get Code
Dense depth perception is critical for autonomous driving and other robotics applications. However, modern LiDAR sensors only provide sparse depth measurement. It is thus necessary to complete the sparse LiDAR data, where a synchronized guidance RGB image is often used to facilitate this completion. Many neural networks have been designed for this task. However, they often na\"{\i}vely fuse the LiDAR data and RGB image information by performing feature concatenation or element-wise addition. Inspired by the guided image filtering, we design a novel guided network to predict kernel weights from the guidance image. These predicted kernels are then applied to extract the depth image features. In this way, our network generates content-dependent and spatially-variant kernels for multi-modal feature fusion. Dynamically generated spatially-variant kernels could lead to prohibitive GPU memory consumption and computation overhead. We further design a convolution factorization to reduce computation and memory consumption. The GPU memory reduction makes it possible for feature fusion to work in multi-stage scheme. We conduct comprehensive experiments to verify our method on real-world outdoor, indoor and synthetic datasets. Our method produces strong results. It outperforms state-of-the-art methods on the NYUv2 dataset and ranks 1st on the KITTI depth completion benchmark at the time of submission. It also presents strong generalization capability under different 3D point densities, various lighting and weather conditions as well as cross-dataset evaluations. The code will be released for reproduction.

* Submitted to the IEEE Transactions on Image Processing (TIP)
Click to Read Paper and Get Code
Automatic neural architecture design has shown its potential in discovering powerful neural network architectures. Existing methods, no matter based on reinforcement learning or evolutionary algorithms (EA), conduct architecture search in a discrete space, which is highly inefficient. In this paper, we propose a simple and efficient method to automatic neural architecture design based on continuous optimization. We call this new approach neural architecture optimization (NAO). There are three key components in our proposed approach: (1) An encoder embeds/maps neural network architectures into a continuous space. (2) A predictor takes the continuous representation of a network as input and predicts its accuracy. (3) A decoder maps a continuous representation of a network back to its architecture. The performance predictor and the encoder enable us to perform gradient based optimization in the continuous space to find the embedding of a new architecture with potentially better accuracy. Such a better embedding is then decoded to a network by the decoder. Experiments show that the architecture discovered by our method is very competitive for image classification task on CIFAR-10 and language modeling task on PTB, outperforming or on par with the best results of previous architecture search methods with a significantly reduction of computational resources. Specifically we obtain $2.11\%$ test set error rate for CIFAR-10 image classification task and $56.0$ test set perplexity of PTB language modeling task. Furthermore, combined with the recent proposed weight sharing mechanism, we discover powerful architecture on CIFAR-10 (with error rate $3.53\%$) and on PTB (with test set perplexity $56.6$), with very limited computational resources (less than $10$ GPU hours) for both tasks.

* NIPS 2018. Code available at: https://github.com/renqianluo/NAO
Click to Read Paper and Get Code
Recent studies have shown that reinforcement learning (RL) is an effective approach for improving the performance of neural machine translation (NMT) system. However, due to its instability, successfully RL training is challenging, especially in real-world systems where deep models and large datasets are leveraged. In this paper, taking several large-scale translation tasks as testbeds, we conduct a systematic study on how to train better NMT models using reinforcement learning. We provide a comprehensive comparison of several important factors (e.g., baseline reward, reward shaping) in RL training. Furthermore, to fill in the gap that it remains unclear whether RL is still beneficial when monolingual data is used, we propose a new method to leverage RL to further boost the performance of NMT systems trained with source/target monolingual data. By integrating all our findings, we obtain competitive results on WMT14 English- German, WMT17 English-Chinese, and WMT17 Chinese-English translation tasks, especially setting a state-of-the-art performance on WMT17 Chinese-English translation task.

* EMNLP 2018
Click to Read Paper and Get Code
Machine learning is essentially the sciences of playing with data. An adaptive data selection strategy, enabling to dynamically choose different data at various training stages, can reach a more effective model in a more efficient way. In this paper, we propose a deep reinforcement learning framework, which we call \emph{\textbf{N}eural \textbf{D}ata \textbf{F}ilter} (\textbf{NDF}), to explore automatic and adaptive data selection in the training process. In particular, NDF takes advantage of a deep neural network to adaptively select and filter important data instances from a sequential stream of training data, such that the future accumulative reward (e.g., the convergence speed) is maximized. In contrast to previous studies in data selection that is mainly based on heuristic strategies, NDF is quite generic and thus can be widely suitable for many machine learning tasks. Taking neural network training with stochastic gradient descent (SGD) as an example, comprehensive experiments with respect to various neural network modeling (e.g., multi-layer perceptron networks, convolutional neural networks and recurrent neural networks) and several applications (e.g., image classification and text understanding) demonstrate that NDF powered SGD can achieve comparable accuracy with standard SGD process by using less data and fewer iterations.

* A preliminary version will appear in ICLR 2017, workshop track. https://openreview.net/forum?id=SyJNmVqgg&noteId=SyJNmVqgg
Click to Read Paper and Get Code
Intelligence Quotient (IQ) Test is a set of standardized questions designed to evaluate human intelligence. Verbal comprehension questions appear very frequently in IQ tests, which measure human's verbal ability including the understanding of the words with multiple senses, the synonyms and antonyms, and the analogies among words. In this work, we explore whether such tests can be solved automatically by artificial intelligence technologies, especially the deep learning technologies that are recently developed and successfully applied in a number of fields. However, we found that the task was quite challenging, and simply applying existing technologies (e.g., word embedding) could not achieve a good performance, mainly due to the multiple senses of words and the complex relations among words. To tackle these challenges, we propose a novel framework consisting of three components. First, we build a classifier to recognize the specific type of a verbal question (e.g., analogy, classification, synonym, or antonym). Second, we obtain distributed representations of words and relations by leveraging a novel word embedding method that considers the multi-sense nature of words and the relational knowledge among words (or their senses) contained in dictionaries. Third, for each type of questions, we propose a specific solver based on the obtained distributed word representations and relation representations. Experimental results have shown that the proposed framework can not only outperform existing methods for solving verbal comprehension questions but also exceed the average performance of the Amazon Mechanical Turk workers involved in the study. The results indicate that with appropriate uses of the deep learning technologies we might be a further step closer to the human intelligence.

Click to Read Paper and Get Code
For Internet applications like sponsored search, cautions need to be taken when using machine learning to optimize their mechanisms (e.g., auction) since self-interested agents in these applications may change their behaviors (and thus the data distribution) in response to the mechanisms. To tackle this problem, a framework called game-theoretic machine learning (GTML) was recently proposed, which first learns a Markov behavior model to characterize agents' behaviors, and then learns the optimal mechanism by simulating agents' behavior changes in response to the mechanism. While GTML has demonstrated practical success, its generalization analysis is challenging because the behavior data are non-i.i.d. and dependent on the mechanism. To address this challenge, first, we decompose the generalization error for GTML into the behavior learning error and the mechanism learning error; second, for the behavior learning error, we obtain novel non-asymptotic error bounds for both parametric and non-parametric behavior learning methods; third, for the mechanism learning error, we derive a uniform convergence bound based on a new concept called nested covering number of the mechanism space and the generalization analysis techniques developed for mixing sequences. To the best of our knowledge, this is the first work on the generalization analysis of GTML, and we believe it has general implications to the theoretical analysis of other complicated machine learning problems.

Click to Read Paper and Get Code
As a new neural machine translation approach, Non-Autoregressive machine Translation (NAT) has attracted attention recently due to its high efficiency in inference. However, the high efficiency has come at the cost of not capturing the sequential dependency on the target side of translation, which causes NAT to suffer from two kinds of translation errors: 1) repeated translations (due to indistinguishable adjacent decoder hidden states), and 2) incomplete translations (due to incomplete transfer of source side information via the decoder hidden states). In this paper, we propose to address these two problems by improving the quality of decoder hidden representations via two auxiliary regularization terms in the training process of an NAT model. First, to make the hidden states more distinguishable, we regularize the similarity between consecutive hidden states based on the corresponding target tokens. Second, to force the hidden states to contain all the information in the source sentence, we leverage the dual nature of translation tasks (e.g., English to German and German to English) and minimize a backward reconstruction error to ensure that the hidden states of the NAT decoder are able to recover the source side sentence. Extensive experiments conducted on several benchmark datasets show that both regularization strategies are effective and can alleviate the issues of repeated translations and incomplete translations in NAT models. The accuracy of NAT models is therefore improved significantly over the state-of-the-art NAT models with even better efficiency for inference.

* AAAI 2019
Click to Read Paper and Get Code
Teaching plays a very important role in our society, by spreading human knowledge and educating our next generations. A good teacher will select appropriate teaching materials, impact suitable methodologies, and set up targeted examinations, according to the learning behaviors of the students. In the field of artificial intelligence, however, one has not fully explored the role of teaching, and pays most attention to machine \emph{learning}. In this paper, we argue that equal attention, if not more, should be paid to teaching, and furthermore, an optimization framework (instead of heuristics) should be used to obtain good teaching strategies. We call this approach `learning to teach'. In the approach, two intelligent agents interact with each other: a student model (which corresponds to the learner in traditional machine learning algorithms), and a teacher model (which determines the appropriate data, loss function, and hypothesis space to facilitate the training of the student model). The teacher model leverages the feedback from the student model to optimize its own teaching strategies by means of reinforcement learning, so as to achieve teacher-student co-evolution. To demonstrate the practical value of our proposed approach, we take the training of deep neural networks (DNN) as an example, and show that by using the learning to teach techniques, we are able to use much less training data and fewer iterations to achieve almost the same accuracy for different kinds of DNN models (e.g., multi-layer perceptron, convolutional neural networks and recurrent neural networks) under various machine learning tasks (e.g., image classification and text understanding).

* ICLR 2018
Click to Read Paper and Get Code
Machine learning algorithms have been applied to predict agent behaviors in real-world dynamic systems, such as advertiser behaviors in sponsored search and worker behaviors in crowdsourcing. The behavior data in these systems are generated by live agents: once the systems change due to the adoption of the prediction models learnt from the behavior data, agents will observe and respond to these changes by changing their own behaviors accordingly. As a result, the behavior data will evolve and will not be identically and independently distributed, posing great challenges to the theoretical analysis on the machine learning algorithms for behavior prediction. To tackle this challenge, in this paper, we propose to use Markov Chain in Random Environments (MCRE) to describe the behavior data, and perform generalization analysis of the machine learning algorithms on its basis. Since the one-step transition probability matrix of MCRE depends on both previous states and the random environment, conventional techniques for generalization analysis cannot be directly applied. To address this issue, we propose a novel technique that transforms the original MCRE into a higher-dimensional time-homogeneous Markov chain. The new Markov chain involves more variables but is more regular, and thus easier to deal with. We prove the convergence of the new Markov chain when time approaches infinity. Then we prove a generalization bound for the machine learning algorithms on the behavior data generated by the new Markov chain, which depends on both the Markovian parameters and the covering number of the function class compounded by the loss function for behavior prediction and the behavior prediction model. To the best of our knowledge, this is the first work that performs the generalization analysis on data generated by complex processes in real-world dynamic systems.

Click to Read Paper and Get Code
Due to the unparallelizable nature of the autoregressive factorization, AutoRegressive Translation (ART) models have to generate tokens sequentially during decoding and thus suffer from high inference latency. Non-AutoRegressive Translation (NART) models were proposed to reduce the inference time, but could only achieve inferior translation accuracy. In this paper, we proposed a novel approach to leveraging the hints from hidden states and word alignments to help the training of NART models. The results achieve significant improvement over previous NART models for the WMT14 En-De and De-En datasets and are even comparable to a strong LSTM-based ART baseline but one order of magnitude faster in inference.

* EMNLP-IJCNLP 2019
Click to Read Paper and Get Code
Neural machine translation usually adopts autoregressive models and suffers from exposure bias as well as the consequent error propagation problem. Many previous works have discussed the relationship between error propagation and the \emph{accuracy drop} (i.e., the left part of the translated sentence is often better than its right part in left-to-right decoding models) problem. In this paper, we conduct a series of analyses to deeply understand this problem and get several interesting findings. (1) The role of error propagation on accuracy drop is overstated in the literature, although it indeed contributes to the accuracy drop problem. (2) Characteristics of a language play a more important role in causing the accuracy drop: the left part of the translation result in a right-branching language (e.g., English) is more likely to be more accurate than its right part, while the right part is more accurate for a left-branching language (e.g., Japanese). Our discoveries are confirmed on different model structures including Transformer and RNN, and in other sequence generation tasks such as text summarization.

* EMNLP 2018
Click to Read Paper and Get Code
Long Short-Term Memory (LSTM) is one of the most widely used recurrent structures in sequence modeling. It aims to use gates to control information flow (e.g., whether to skip some information or not) in the recurrent computations, although its practical implementation based on soft gates only partially achieves this goal. In this paper, we propose a new way for LSTM training, which pushes the output values of the gates towards 0 or 1. By doing so, we can better control the information flow: the gates are mostly open or closed, instead of in a middle state, which makes the results more interpretable. Empirical studies show that (1) Although it seems that we restrict the model capacity, there is no performance drop: we achieve better or comparable performances due to its better generalization ability; (2) The outputs of gates are not sensitive to their inputs: we can easily compress the LSTM unit in multiple ways, e.g., low-rank approximation and low-precision approximation. The compressed models are even better than the baseline models without compression.

* ICML 2018
Click to Read Paper and Get Code
Teaching is critical to human society: it is with teaching that prospective students are educated and human civilization can be inherited and advanced. A good teacher not only provides his/her students with qualified teaching materials (e.g., textbooks), but also sets up appropriate learning objectives (e.g., course projects and exams) considering different situations of a student. When it comes to artificial intelligence, treating machine learning models as students, the loss functions that are optimized act as perfect counterparts of the learning objective set by the teacher. In this work, we explore the possibility of imitating human teaching behaviors by dynamically and automatically outputting appropriate loss functions to train machine learning models. Different from typical learning settings in which the loss function of a machine learning model is predefined and fixed, in our framework, the loss function of a machine learning model (we call it student) is defined by another machine learning model (we call it teacher). The ultimate goal of teacher model is cultivating the student to have better performance measured on development dataset. Towards that end, similar to human teaching, the teacher, a parametric model, dynamically outputs different loss functions that will be used and optimized by its student model at different training stages. We develop an efficient learning method for the teacher model that makes gradient based optimization possible, exempt of the ineffective solutions such as policy optimization. We name our method as "learning to teach with dynamic loss functions" (L2T-DLF for short). Extensive experiments on real world tasks including image classification and neural machine translation demonstrate that our method significantly improves the quality of various student models.

* NIPS 2018
Click to Read Paper and Get Code