Models, code, and papers for "Jiebo Luo":
Existing image-text matching approaches typically infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. However, they ignore the connections between the objects that are semantically related. These objects may collectively determine whether the image corresponds to a text or not. To address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN). In particular, given an input image-text pair, our model reorders the image objects based on the positions of their most related words in the text. In the same way as extracting the hidden features from word embeddings, the model leverages RNN to extract high-level object features from the reordered object inputs. We validate that the high-level object features contain useful joint information of semantically related objects, which benefit the retrieval task. To compute the image-text similarity, we incorporate a Multi-attention Cross Matching Model into DP-RNN. It aggregates the affinity between objects and words with cross-modality guided attention and self-attention. Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.
Skin lesion identification is a key step toward dermatological diagnosis. When describing a skin lesion, it is very important to note its body site distribution as many skin diseases commonly affect particular parts of the body. To exploit the correlation between skin lesions and their body site distributions, in this study, we investigate the possibility of improving skin lesion classification using the additional context information provided by body location. Specifically, we build a deep multi-task learning (MTL) framework to jointly optimize skin lesion classification and body location classification (the latter is used as an inductive bias). Our MTL framework uses the state-of-the-art ImageNet pretrained model with specialized loss functions for the two related tasks. Our experiments show that the proposed MTL based method performs more robustly than its standalone (single-task) counterpart.
In this paper, we focus on studying the appearing time of different kinds of cars on the road. This information will enable us to infer the life style of the car owners. The results can further be used to guide marketing towards car owners. Conventionally, this kind of study is carried out by sending out questionnaires, which is limited in scale and diversity. To solve this problem, we propose a fully automatic method to carry out this study. Our study is based on publicly available surveillance camera data. To make the results reliable, we only use the high resolution cameras (i.e. resolution greater than $1280 \times 720$). Images from the public cameras are downloaded every minute. After obtaining 50,000 images, we apply faster R-CNN (region-based convoluntional neural network) to detect the cars in the downloaded images and a fine-tuned VGG16 model is used to recognize the car makes. Based on the recognition results, we present a data-driven analysis on the relationship between car makes and their appearing times, with implications on lifestyles.
Small data challenges have emerged in many learning problems, since the success of deep neural networks often relies on the availability of a huge amount of labeled data that is expensive to collect. To address it, many efforts have been made on training complex models with small data in an unsupervised and semi-supervised fashion. In this paper, we will review the recent progresses on these two major categories of methods. A wide spectrum of small data models will be categorized in a big picture, where we will show how they interplay with each other to motivate explorations of new ideas. We will review the criteria of learning the transformation equivariant, disentangled, self-supervised and semi-supervised representations, which underpin the foundations of recent developments. Many instantiations of unsupervised and semi-supervised generative models have been developed on the basis of these criteria, greatly expanding the territory of existing autoencoders, generative adversarial nets (GANs) and other deep networks by exploring the distribution of unlabeled data for more powerful representations. While we focus on the unsupervised and semi-supervised methods, we will also provide a broader review of other emerging topics, from unsupervised and semi-supervised domain adaptation to the fundamental roles of transformation equivariance and invariance in training a wide spectrum of deep networks. It is impossible for us to write an exclusive encyclopedia to include all related works. Instead, we aim at exploring the main ideas, principles and methods in this area to reveal where we are heading on the journey towards addressing the small data challenges in this big data era.
Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding. However, most of the current models can only generate plain factual descriptions about the content of a given image. However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as emotion, humor and language styles, are often incorporated to produce diverse, emotional, or appealing captions. In particular, we are interested in generating sentiment-conveying image descriptions, which has received little attention. The main challenge is how to effectively inject sentiments into the generated captions without altering the semantic matching between the visual content and the generated descriptions. In this work, we propose two different models, which employ different schemes for injecting sentiments into image captions. Compared with the few existing approaches, the proposed models are much simpler and yet more effective. The experimental results show that our model outperform the state-of-the-art models in generating sentimental (i.e., sentiment-bearing) image captions. In addition, we can also easily manipulate the model by assigning different sentiments to the testing image to generate captions with the corresponding sentiments.
With the prevalence of e-commence websites and the ease of online shopping, consumers are embracing huge amounts of various options in products. Undeniably, shopping is one of the most essential activities in our society and studying consumer's shopping behavior is important for the industry as well as sociology and psychology. Indisputable, one of the most popular e-commerce categories is clothing business. There arises the needs for analysis of popular and attractive clothing features which could further boost many emerging applications, such as clothing recommendation and advertising. In this work, we design a novel system that consists of three major components: 1) exploring and organizing a large-scale clothing dataset from a online shopping website, 2) pruning and extracting images of best-selling products in clothing item data and user transaction history, and 3) utilizing a machine learning based approach to discovering fine-grained clothing attributes as the representative and discriminative characteristics of popular clothing style elements. Through the experiments over a large-scale online clothing shopping dataset, we demonstrate the effectiveness of our proposed system, and obtain useful insights on clothing consumption trends and profitable clothing features.
Deep neural networks have been successfully applied to solving the video-based person re-identification problem with impressive results reported. The existing networks for person re-id are designed to extract discriminative features that preserve the identity information. Usually, whole video frames are fed into the neural networks and all the regions in a frame are equally treated. This may be a suboptimal choice because many regions, e.g., background regions in the video, are not related to the person. Furthermore, the person of interest may be occluded by another person or something else. These unrelated regions may hinder person re-identification. In this paper, we introduce a novel gating mechanism to deep neural networks. Our gating mechanism will learn which regions are helpful for person re-identification and let these regions pass the gate. The unrelated background regions or occluding regions are filtered out by the gate. In each frame, the color channels and optical flow channels provide quite different information. To better leverage such information, we generate one gate using the color channels and another gate using the optical flow channels. These two gates are combined to provide a more reliable gate with a novel fusion method. Experimental results on two major datasets demonstrate the performance improvements due to the proposed gating mechanism.
Existing image-text matching approaches typically leverage triplet loss with online hard negatives to train the model. For each image or text anchor in a training mini-batch, the model is trained to distinguish between a positive and the most confusing negative of the anchor mined from the mini-batch (i.e. online hard negative). This strategy improves the model's capacity to discover fine-grained correspondences and non-correspondences between image and text inputs. However, the above training approach has the following drawbacks: (1) the negative selection strategy still provides limited chances for the model to learn from very hard-to-distinguish cases. (2) The trained model has weak generalization capability from the training set to the testing set. (3) The penalty lacks hierarchy and adaptiveness for hard negatives with different ``hardness'' degrees. In this paper, we propose solutions by sampling negatives offline from the whole training set. It provides ``harder'' offline negatives than online hard negatives for the model to distinguish. Based on the offline hard negatives, a quintuplet loss is proposed to improve the model's generalization capability to distinguish positives and negatives. In addition, a novel loss function that combines the knowledge of positives, offline hard negatives and online hard negatives is created. It leverages offline hard negatives as intermediary to adaptively penalize them based on their distance relations to the anchor. We evaluate the proposed training approach on three state-of-the-art image-text models on the MS-COCO and Flickr30K datasets. Significant performance improvements are observed for all the models, demonstrating the effectiveness and generality of the proposed approach.
YouTube, a world-famous video sharing website, maintains a list of the top trending videos on the platform. Due to its huge amount of users, it enables researchers to understand people's preference by analyzing the trending videos. Trending videos vary from country to country. By analyzing such differences and changes, we can tell how users' preferences differ over locations. Previous work focuses on analyzing such culture preferences from videos' metadata, while the culture information hidden within the visual content has not been discovered. In this study, we explore culture preferences among countries using the thumbnails of YouTube trending videos. We first process the thumbnail images of the videos using object detectors. The collected object information is then used for various statistical analysis. In particular, we examine the data from three perspectives: geographical locations, video genres and users' reactions. Experimental results indicate that the users from similar cultures shares interests in watching similar videos on YouTube. Our study demonstrates that discovering the culture preference through the thumbnails can be an effective mechanism for video social media analysis.
We address the problem of video moment localization with natural language, i.e. localizing a video segment described by a natural language sentence. While most prior work focuses on grounding the query as a whole, temporal dependencies and reasoning between events within the text are not fully considered. In this paper, we propose a novel Temporal Compositional Modular Network (TCMN) where a tree attention network first automatically decomposes a sentence into three descriptions with respect to the main event, context event and temporal signal. Two modules are then utilized to measure the visual similarity and location similarity between each segment and the decomposed descriptions. Moreover, since the main event and context event may rely on different modalities (RGB or optical flow), we use late fusion to form an ensemble of four models, where each model is independently trained by one combination of the visual input. Experiments show that our model outperforms the state-of-the-art methods on the TEMPO dataset.
As an intuitive way of expression emotion, the animated Graphical Interchange Format (GIF) images have been widely used on social media. Most previous studies on automated GIF emotion recognition fail to effectively utilize GIF's unique properties, and this potentially limits the recognition performance. In this study, we demonstrate the importance of human related information in GIFs and conduct human-centered GIF emotion recognition with a proposed Keypoint Attended Visual Attention Network (KAVAN). The framework consists of a facial attention module and a hierarchical segment temporal module. The facial attention module exploits the strong relationship between GIF contents and human characters, and extracts frame-level visual feature with a focus on human faces. The Hierarchical Segment LSTM (HS-LSTM) module is then proposed to better learn global GIF representations. Our proposed framework outperforms the state-of-the-art on the MIT GIFGIF dataset. Furthermore, the facial attention module provides reliable facial region mask predictions, which improves the model's interpretability.
In this study, we investigate what a practically useful approach is in order to achieve robust skin disease diagnosis. A direct approach is to target the ground truth diagnosis labels, while an alternative approach instead focuses on determining skin lesion characteristics that are more visually consistent and discernible. We argue that, for computer-aided skin disease diagnosis, it is both more realistic and more useful that lesion type tags should be considered as the target of an automated diagnosis system such that the system can first achieve a high accuracy in describing skin lesions, and in turn facilitate disease diagnosis using lesion characteristics in conjunction with other evidence. To further meet such an objective, we employ convolutional neural networks (CNNs) for both the disease-targeted and lesion-targeted classifications. We have collected a large-scale and diverse dataset of 75,665 skin disease images from six publicly available dermatology atlantes. Then we train and compare both disease-targeted and lesion-targeted classifiers, respectively. For disease-targeted classification, only 27.6% top-1 accuracy and 57.9% top-5 accuracy are achieved with a mean average precision (mAP) of 0.42. In contrast, for lesion-targeted classification, we can achieve a much higher mAP of 0.70.
Automatic vertebrae identification and localization from arbitrary CT images is challenging. Vertebrae usually share similar morphological appearance. Because of pathology and the arbitrary field-of-view of CT scans, one can hardly rely on the existence of some anchor vertebrae or parametric methods to model the appearance and shape. To solve the problem, we argue that one should make use of the short-range contextual information, such as the presence of some nearby organs (if any), to roughly estimate the target vertebrae; due to the unique anatomic structure of the spine column, vertebrae have fixed sequential order which provides the important long-range contextual information to further calibrate the results. We propose a robust and efficient vertebrae identification and localization system that can inherently learn to incorporate both the short-range and long-range contextual information in a supervised manner. To this end, we develop a multi-task 3D fully convolutional neural network (3D FCN) to effectively extract the short-range contextual information around the target vertebrae. For the long-range contextual information, we propose a multi-task bidirectional recurrent neural network (Bi-RNN) to encode the spatial and contextual information among the vertebrae of the visible spine column. We demonstrate the effectiveness of the proposed approach on a challenging dataset and the experimental results show that our approach outperforms the state-of-the-art methods by a significant margin.
Learning to rank has recently emerged as an attractive technique to train deep convolutional neural networks for various computer vision tasks. Pairwise ranking, in particular, has been successful in multi-label image classification, achieving state-of-the-art results on various benchmarks. However, most existing approaches use the hinge loss to train their models, which is non-smooth and thus is difficult to optimize especially with deep networks. Furthermore, they employ simple heuristics, such as top-k or thresholding, to determine which labels to include in the output from a ranked list of labels, which limits their use in the real-world setting. In this work, we propose two techniques to improve pairwise ranking based multi-label image classification: (1) we propose a novel loss function for pairwise ranking, which is smooth everywhere and thus is easier to optimize; and (2) we incorporate a label decision module into the model, estimating the optimal confidence thresholds for each visual concept. We provide theoretical analyses of our loss function in the Bayes consistency and risk minimization framework, and show its benefit over existing pairwise ranking formulations. We demonstrate the effectiveness of our approach on three large-scale datasets, VOC2007, NUS-WIDE and MS-COCO, achieving the best reported results in the literature.
Sentiment analysis is crucial for extracting social signals from social media content. Due to the prevalence of images in social media, image sentiment analysis is receiving increasing attention in recent years. However, most existing systems are black-boxes that do not provide insight on how image content invokes sentiment and emotion in the viewers. Psychological studies have confirmed that salient objects in an image often invoke emotions. In this work, we investigate more fine-grained and more comprehensive interaction between visual saliency and visual sentiment. In particular, we partition images in several primary scene-type dimensions, including: open-closed, natural-manmade, indoor-outdoor, and face-noface. Using state of the art saliency detection algorithm and sentiment classification algorithm, we examine how the sentiment of the salient region(s) in an image relates to the overall sentiment of the image. The experiments on a representative image emotion dataset have shown interesting correlation between saliency and sentiment in different scene types and in turn shed light on the mechanism of visual sentiment evocation.
With the rapid development of economy in China over the past decade, air pollution has become an increasingly serious problem in major cities and caused grave public health concerns in China. Recently, a number of studies have dealt with air quality and air pollution. Among them, some attempt to predict and monitor the air quality from different sources of information, ranging from deployed physical sensors to social media. These methods are either too expensive or unreliable, prompting us to search for a novel and effective way to sense the air quality. In this study, we propose to employ the state of the art in computer vision techniques to analyze photos that can be easily acquired from online social media. Next, we establish the correlation between the haze level computed directly from photos with the official PM 2.5 record of the taken city at the taken time. Our experiments based on both synthetic and real photos have shown the promise of this image-based approach to estimating and monitoring air pollution.
The key challenge in photorealistic style transfer is that an algorithm should faithfully transfer the style of a reference photo to a content photo while the generated image should look like one captured by a camera. Although several photorealistic style transfer algorithms have been proposed, they need to rely on post- and/or pre-processing to make the generated images look photorealistic. If we disable the additional processing, these algorithms would fail to produce plausible photorealistic stylization in terms of detail preservation and photorealism. In this work, we propose an effective solution to these issues. Our method consists of a construction step (C-step) to build a photorealistic stylization network and a pruning step (P-step) for acceleration. In the C-step, we propose a dense auto-encoder named PhotoNet based on a carefully designed pre-analysis. PhotoNet integrates a feature aggregation module (BFA) and instance normalized skip links (INSL). To generate faithful stylization, we introduce multiple style transfer modules in the decoder and INSLs. PhotoNet significantly outperforms existing algorithms in terms of both efficiency and effectiveness. In the P-step, we adopt a neural architecture search method to accelerate PhotoNet. We propose an automatic network pruning framework in the manner of teacher-student learning for photorealistic stylization. The network architecture named PhotoNAS resulted from the search achieves significant acceleration over PhotoNet while keeping the stylization effects almost intact. We conduct extensive experiments on both image and video transfer. The results show that our method can produce favorable results while achieving 20-30 times acceleration in comparison with the existing state-of-the-art approaches. It is worth noting that the proposed algorithm accomplishes better performance without any pre- or post-processing.
Generating radiology reports is time-consuming and requires extensive expertise in practice. Therefore, reliable automatic radiology report generation is highly desired to alleviate the workload. Although deep learning techniques have been successfully applied to image classification and image captioning tasks, radiology report generation remains challenging in regards to understanding and linking complicated medical visual contents with accurate natural language descriptions. In addition, the data scales of open-access datasets that contain paired medical images and reports remain very limited. To cope with these practical challenges, we propose a generative encoder-decoder model and focus on chest x-ray images and reports with the following improvements. First, we pretrain the encoder with a large number of chest x-ray images to accurately recognize 14 common radiographic observations, while taking advantage of the multi-view images by enforcing the cross-view consistency. Second, we synthesize multi-view visual features based on a sentence-level attention mechanism in a late fusion fashion. In addition, in order to enrich the decoder with descriptive semantics and enforce the correctness of the deterministic medical-related contents such as mentions of organs or diagnoses, we extract medical concepts based on the radiology reports in the training data and fine-tune the encoder to extract the most frequent medical concepts from the x-ray images. Such concepts are fused with each decoding step by a word-level attention model. The experimental results conducted on the Indiana University Chest X-Ray dataset demonstrate that the proposed model achieves the state-of-the-art performance compared with other baseline approaches.
Sentiment analysis on large-scale social media data is important to bridge the gaps between social media contents and real world activities including political election prediction, individual and public emotional status monitoring and analysis, and so on. Although textual sentiment analysis has been well studied based on platforms such as Twitter and Instagram, analysis of the role of extensive emoji uses in sentiment analysis remains light. In this paper, we propose a novel scheme for Twitter sentiment analysis with extra attention on emojis. We first learn bi-sense emoji embeddings under positive and negative sentimental tweets individually, and then train a sentiment classifier by attending on these bi-sense emoji embeddings with an attention-based long short-term memory network (LSTM). Our experiments show that the bi-sense embedding is effective for extracting sentiment-aware embeddings of emojis and outperforms the state-of-the-art models. We also visualize the attentions to show that the bi-sense emoji embedding provides better guidance on the attention mechanism to obtain a more robust understanding of the semantics and sentiments.