Research papers and code for "Tsuhan Chen":
One common trend in image tagging research is to focus on visually relevant tags, and this tends to ignore the personal and social aspect of tags, especially on photoblogging websites such as Flickr. Previous work has correctly identified that many of the tags that users provide on images are not visually relevant (i.e. representative of the salient content in the image) and they go on to treat such tags as noise, ignoring that the users chose to provide those tags over others that could have been more visually relevant. Another common assumption about user generated tags for images is that the order of these tags provides no useful information for the prediction of tags on future images. This assumption also tends to define usefulness in terms of what is visually relevant to the image. For general tagging or labeling applications that focus on providing visual information about image content, these assumptions are reasonable, but when considering personalized image tagging applications, these assumptions are at best too rigid, ignoring user choice and preferences. We challenge the aforementioned assumptions, and provide a machine learning approach to the problem of personalized image tagging with the following contributions: 1.) We reformulate the personalized image tagging problem as a search/retrieval ranking problem, 2.) We leverage the order of tags, which does not always reflect visual relevance, provided by the user in the past as a cue to their tag preferences, similar to click data, 3.) We propose a technique to augment sparse user tag data (semi-supervision), and 4.) We demonstrate the efficacy of our method on a subset of Flickr images, showing improvement over previous state-of-art methods.

Click to Read Paper and Get Code
Autonomous indoor navigation of Micro Aerial Vehicles (MAVs) possesses many challenges. One main reason is that GPS has limited precision in indoor environments. The additional fact that MAVs are not able to carry heavy weight or power consuming sensors, such as range finders, makes indoor autonomous navigation a challenging task. In this paper, we propose a practical system in which a quadcopter autonomously navigates indoors and finds a specific target, i.e., a book bag, by using a single camera. A deep learning model, Convolutional Neural Network (ConvNet), is used to learn a controller strategy that mimics an expert pilot's choice of action. We show our system's performance through real-time experiments in diverse indoor locations. To understand more about our trained network, we use several visualization techniques.

Click to Read Paper and Get Code
We present a new algorithm for multi-region segmentation of 2D images with objects that may partially occlude each other. Our algorithm is based on the observation hat human performance on this task is based both on prior knowledge about plausible shapes and taking into account the presence of occluding objects whose shape is already known - once an occluded region is identified, the shape prior can be used to guess the shape of the missing part. We capture the former aspect using a deep learning model of shape; for the latter, we simultaneously minimize the energy of all regions and consider only unoccluded pixels for data agreement. Existing algorithms incorporating object shape priors consider every object separately in turn and can't distinguish genuine deviation from the expected shape from parts missing due to occlusion. We show that our method significantly improves on the performance of a representative algorithm, as evaluated on both preprocessed natural and synthetic images. Furthermore, on the synthetic images, we recover the ground truth segmentation with good accuracy.

* Camera ready version accepted at CVPR 2016
Click to Read Paper and Get Code
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Our proposed learning approach addresses the difficulty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards, which simultaneously solves the well-known exposure bias problem and the loss-evaluation mismatch problem. We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance.

* AAAI-2018, Oral Presentation
Click to Read Paper and Get Code
Language Models based on recurrent neural networks have dominated recent image caption generation tasks. In this paper, we introduce a Language CNN model which is suitable for statistical language modeling tasks and shows competitive performance in image captioning. In contrast to previous models which predict next word based on one previous word and hidden state, our language CNN is fed with all the previous words and can model the long-range dependencies of history words, which are critical for image captioning. The effectiveness of our approach is validated on two datasets MS COCO and Flickr30K. Our extensive experimental results show that our method outperforms the vanilla recurrent neural network based language models and is competitive with the state-of-the-art methods.

* Comments: 10 pages, In proceedings of ICCV 2017
Click to Read Paper and Get Code
Scene understanding includes many related sub-tasks, such as scene categorization, depth estimation, object detection, etc. Each of these sub-tasks is often notoriously hard, and state-of-the-art classifiers already exist for many of them. These classifiers operate on the same raw image and provide correlated outputs. It is desirable to have an algorithm that can capture such correlation without requiring any changes to the inner workings of any classifier. We propose Feedback Enabled Cascaded Classification Models (FE-CCM), that jointly optimizes all the sub-tasks, while requiring only a `black-box' interface to the original classifier for each sub-task. We use a two-layer cascade of classifiers, which are repeated instantiations of the original ones, with the output of the first layer fed into the second layer as input. Our training method involves a feedback step that allows later classifiers to provide earlier classifiers information about which error modes to focus on. We show that our method significantly improves performance in all the sub-tasks in the domain of scene understanding, where we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling and saliency detection. Our method also improves performance in two robotic applications: an object-grasping robot and an object-finding robot.

* 14 pages, 11 figures
Click to Read Paper and Get Code
In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing. Among different types of deep neural networks, convolutional neural networks have been most extensively studied. Leveraging on the rapid growth in the amount of the annotated data and the great improvements in the strengths of graphics processor units, the research on convolutional neural networks has been emerged swiftly and achieved state-of-the-art results on various tasks. In this paper, we provide a broad survey of the recent advances in convolutional neural networks. We detailize the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation. Besides, we also introduce various applications of convolutional neural networks in computer vision, speech and natural language processing.

* Pattern Recognition, Elsevier
Click to Read Paper and Get Code