Research papers and code for "Jin Fang":
Precisely forecasting wind speed is essential for wind power producers and grid operators. However, this task is challenging due to the stochasticity of wind speed. To accurately predict short-term wind speed under uncertainties, this paper proposed a multi-variable stacked LSTMs model (MSLSTM). The proposed method utilizes multiple historical meteorological variables, such as wind speed, temperature, humidity, pressure, dew point and solar radiation to accurately predict wind speeds. The prediction performance is extensively assessed using real data collected in West Texas, USA. The experimental results show that the proposed MSLSTM can preferably capture and learn uncertainties while output competitive performance.

Click to Read Paper and Get Code
Opioid addiction is a severe public health threat in the U.S, causing massive deaths and many social problems. Accurate relapse prediction is of practical importance for recovering patients since relapse prediction promotes timely relapse preventions that help patients stay clean. In this paper, we introduce a Generative Adversarial Networks (GAN) model to predict the addiction relapses based on sentiment images and social influences. Experimental results on real social media data from Reddit.com demonstrate that the GAN model delivers a better performance than comparable alternative techniques. The sentiment images generated by the model show that relapse is closely connected with two emotions `joy' and `negative'. This work is one of the first attempts to predict relapses using massive social media data and generative adversarial nets. The proposed method, combined with knowledge of social media mining, has the potential to revolutionize the practice of opioid addiction prevention and treatment.

Click to Read Paper and Get Code
Image feature representation plays an essential role in image recognition and related tasks. The current state-of-the-art feature learning paradigm is supervised learning from labeled data. However, this paradigm requires large-scale category labels, which limits its applicability to domains where labels are hard to obtain. In this paper, we propose a new data-driven feature learning paradigm which does not rely on category labels. Instead, we learn from user behavior data collected on social media. Concretely, we use the image relationship discovered in the latent space from the user behavior data to guide the image feature learning. We collect a large-scale image and user behavior dataset from Behance.net. The dataset consists of 1.9 million images and over 300 million view records from 1.9 million users. We validate our feature learning paradigm on this dataset and find that the learned feature significantly outperforms the state-of-the-art image features in learning better image similarities. We also show that the learned feature performs competitively on various recognition benchmarks.

Click to Read Paper and Get Code
Partial domain adaptation (PDA) extends standard domain adaptation to a more realistic scenario where the target domain only has a subset of classes from the source domain. The key challenge of PDA is how to select the relevant samples in the shared classes for knowledge transfer. Previous PDA methods tackle this problem by re-weighting the source samples based on the prediction of classifier or discriminator, thus discarding the pixel-level information. In this paper, to utilize both high-level and pixel-level information, we propose a reinforced transfer network (RTNet), which is the first work to apply reinforcement learning to address the PDA problem. The RTNet simultaneously mitigates the negative transfer by adopting a reinforced data selector to filter out outlier source classes, and promotes the positive transfer by employing a domain adaptation model to minimize the distribution discrepancy in the shared label space. Extensive experiments indicate that RTNet can achieve state-of-the-art performance for partial domain adaptation tasks on several benchmark datasets. Codes and datasets will be available online.

* Submit to NeurIPS-2019
Click to Read Paper and Get Code
Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

* 10 pages, 5 figures, CVPR16
Click to Read Paper and Get Code
Complex design tasks often require performing diverse actions in a specific order. To (semi-)autonomously accomplish these tasks, applications need to understand and learn a wide range of design procedures, i.e., Creative Procedural-Knowledge (CPK). Prior knowledge base construction and mining have not typically addressed the creative fields, such as design and arts. In this paper, we formalize an ontology of CPK using five components: goal, workflow, action, command and usage; and extract components' values from online design tutorials. We scraped 19.6K tutorial-related webpages and built a web application for professional designers to identify and summarize CPK components. The annotated dataset consists of 819 unique commands, 47,491 actions, and 2,022 workflows and goals. Based on this dataset, we propose a general CPK extraction pipeline and demonstrate that existing text classification and sequence-to-sequence models are limited in identifying, predicting and summarizing complex operations described in heterogeneous styles. Through quantitative and qualitative error analysis, we discuss CPK extraction challenges that need to be addressed by future research.

Click to Read Paper and Get Code
A crucial and time-sensitive task when any disaster occurs is to rescue victims and distribute resources to the right groups and locations. This task is challenging in populated urban areas, due to the huge burst of help requests generated in a very short period. To improve the efficiency of the emergency response in the immediate aftermath of a disaster, we propose a heuristic multi-agent reinforcement learning scheduling algorithm, named as ResQ, which can effectively schedule the rapid deployment of volunteers to rescue victims in dynamic settings. The core concept is to quickly identify victims and volunteers from social network data and then schedule rescue parties with an adaptive learning algorithm. This framework performs two key functions: 1) identify trapped victims and rescue volunteers, and 2) optimize the volunteers' rescue strategy in a complex time-sensitive environment. The proposed ResQ algorithm can speed up the training processes through a heuristic function which reduces the state-action space by identifying the set of particular actions over others. Experimental results showed that the proposed heuristic multi-agent reinforcement learning based scheduling outperforms several state-of-art methods, in terms of both reward rate and response times.

Click to Read Paper and Get Code
Social networks can serve as a valuable communication channel for calls for help, offering assistance, and coordinating rescue activities in disaster. Social networks such as Twitter allow users to continuously update relevant information, which is especially useful during a crisis, where the rapidly changing conditions make it crucial to be able to access accurate information promptly. Social media helps those directly affected to inform others of conditions on the ground in real time and thus enables rescue workers to coordinate their efforts more effectively, better meeting the survivors' need. This paper presents a new sequence to sequence based framework for forecasting people's needs during disasters using social media and weather data. It consists of two Long Short-Term Memory (LSTM) models, one of which encodes input sequences of weather information and the other plays as a conditional decoder that decodes the encoded vector and forecasts the survivors' needs. Case studies utilizing data collected during Hurricane Sandy in 2012, Hurricane Harvey and Hurricane Irma in 2017 were analyzed and the results compared with those obtained using a statistical language model n-gram and an LSTM generative model. Our proposed sequence to sequence method forecast people's needs more successfully than either of the other models. This new approach shows great promise for enhancing disaster management activities such as evacuation planning and commodity flow management.

Click to Read Paper and Get Code
We propose a fast feed-forward network for arbitrary style transfer, which can generate stylized image for previously unseen content and style image pairs. Besides the traditional content and style representation based on deep features and statistics for textures, we use adversarial networks to regularize the generation of stylized images. Our adversarial network learns the intrinsic property of image styles from large-scale multi-domain artistic images. The adversarial training is challenging because both the input and output of our generator are diverse multi-domain images. We use a conditional generator that stylized content by shifting the statistics of deep features, and a conditional discriminator based on the coarse category of styles. Moreover, we propose a mask module to spatially decide the stylization level and stabilize adversarial training by avoiding mode collapse. As a side effect, our trained discriminator can be applied to rank and select representative stylized images. We qualitatively and quantitatively evaluate the proposed method, and compare with recent style transfer methods.

Click to Read Paper and Get Code
Visual-semantic embedding models have been recently proposed and shown to be effective for image classification and zero-shot learning, by mapping images into a continuous semantic label space. Although several approaches have been proposed for single-label embedding tasks, handling images with multiple labels (which is a more general setting) still remains an open problem, mainly due to the complex underlying corresponding relationship between image and its labels. In this work, we present Multi-Instance visual-semantic Embedding model (MIE) for embedding images associated with either single or multiple labels. Our model discovers and maps semantically-meaningful image subregions to their corresponding labels. And we demonstrate the superiority of our method over the state-of-the-art on two tasks, including multi-label image annotation and zero-shot learning.

* 9 pages, CVPR 2016 submission
Click to Read Paper and Get Code
Obtaining magnetic resonance images (MRI) with high resolution and generating quantitative image-based biomarkers for assessing tissue biochemistry is crucial in clinical and research applications. How- ever, acquiring quantitative biomarkers requires high signal-to-noise ratio (SNR), which is at odds with high-resolution in MRI, especially in a single rapid sequence. In this paper, we demonstrate how super-resolution can be utilized to maintain adequate SNR for accurate quantification of the T2 relaxation time biomarker, while simultaneously generating high- resolution images. We compare the efficacy of resolution enhancement using metrics such as peak SNR and structural similarity. We assess accuracy of cartilage T2 relaxation times by comparing against a standard reference method. Our evaluation suggests that SR can successfully maintain high-resolution and generate accurate biomarkers for accelerating MRI scans and enhancing the value of clinical and research MRI.

* Accepted for the Machine Learning for Medical Image Reconstruction Workshop at MICCAI 2018
Click to Read Paper and Get Code
For text-independent short-utterance speaker recognition (SUSR), the performance often degrades dramatically. This paper presents a combination approach to the SUSR tasks with two phonetic-aware systems: one is the DNN-based i-vector system and the other is our recently proposed subregion-based GMM-UBM system. The former employs phone posteriors to construct an i-vector model in which the shared statistics offers stronger robustness against limited test data, while the latter establishes a phone-dependent GMM-UBM system which represents speaker characteristics with more details. A score-level fusion is implemented to integrate the respective advantages from the two systems. Experimental results show that for the text-independent SUSR task, both the DNN-based i-vector system and the subregion-based GMM-UBM system outperform their respective baselines, and the score-level system combination delivers performance improvement.

* APSIPA ASC 2016
Click to Read Paper and Get Code
We combine generative adversarial network (GAN) with light microscopy to achieve deep learning super-resolution under a large field of view (FOV). By appropriately adopting prior microscopy data in an adversarial training, the neural network can recover a high-resolution, accurate image of new specimen from its single low-resolution measurement. Its capacity has been broadly demonstrated via imaging various types of samples, such as USAF resolution target, human pathological slides, fluorescence-labelled fibroblast cells, and deep tissues in transgenic mouse brain, by both wide-field and light-sheet microscopes. The gigapixel, multi-color reconstruction of these samples verifies a successful GAN-based single image super-resolution procedure. We also propose an image degrading model to generate low resolution images for training, making our approach free from the complex image registration during training dataset preparation. After a welltrained network being created, this deep learning-based imaging approach is capable of recovering a large FOV (~95 mm2), high-resolution (~1.7 {\mu}m) image at high speed (within 1 second), while not necessarily introducing any changes to the setup of existing microscopes.

* 21 pages, 9 figures and 1 table. Peng Fe and Di Jin conceived the ides, initiated the investigation. Hao Zhang, Di Jin and Peng Fei prepared the manuscript
Click to Read Paper and Get Code
Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision.

* 17 pages, 7 figures, ECCV 2018
Click to Read Paper and Get Code
Context plays an important role in human language understanding, thus it may also be useful for machines learning vector representations of language. In this paper, we explore an asymmetric encoder-decoder structure for unsupervised context-based sentence representation learning. We carefully designed experiments to show that neither an autoregressive decoder nor an RNN decoder is required. After that, we designed a model which still keeps an RNN as the encoder, while using a non-autoregressive convolutional decoder. We further combine a suite of effective designs to significantly improve model efficiency while also achieving better performance. Our model is trained on two different large unlabelled corpora, and in both cases the transferability is evaluated on a set of downstream NLP tasks. We empirically show that our model is simple and fast while producing rich sentence representations that excel in downstream tasks.

Click to Read Paper and Get Code
Automatic generation of facial images has been well studied after the Generative Adversarial Network (GAN) came out. There exists some attempts applying the GAN model to the problem of generating facial images of anime characters, but none of the existing work gives a promising result. In this work, we explore the training of GAN models specialized on an anime facial image dataset. We address the issue from both the data and the model aspect, by collecting a more clean, well-suited dataset and leverage proper, empirical application of DRAGAN. With quantitative analysis and case studies we demonstrate that our efforts lead to a stable and high-quality model. Moreover, to assist people with anime character design, we build a website (http://make.girls.moe) with our pre-trained model available online, which makes the model easily accessible to general public.

* 16 pages, 15 figures. This paper is presented as a Doujinshi in Comiket 92, summer 2017, with the booth number 05a, East-U, Third Day
Click to Read Paper and Get Code
The skip-thought model has been proven to be effective at learning sentence representations and capturing sentence semantics. In this paper, we propose a suite of techniques to trim and improve it. First, we validate a hypothesis that, given a current sentence, inferring the previous and inferring the next sentence provide similar supervision power, therefore only one decoder for predicting the next sentence is preserved in our trimmed skip-thought model. Second, we present a connection layer between encoder and decoder to help the model to generalize better on semantic relatedness tasks. Third, we found that a good word embedding initialization is also essential for learning better sentence representations. We train our model unsupervised on a large corpus with contiguous sentences, and then evaluate the trained model on 7 supervised tasks, which includes semantic relatedness, paraphrase detection, and text classification benchmarks. We empirically show that, our proposed model is a faster, lighter-weight and equally powerful alternative to the original skip-thought model.

Click to Read Paper and Get Code
We study the skip-thought model with neighborhood information as weak supervision. More specifically, we propose a skip-thought neighbor model to consider the adjacent sentences as a neighborhood. We train our skip-thought neighbor model on a large corpus with continuous sentences, and then evaluate the trained model on 7 tasks, which include semantic relatedness, paraphrase detection, and classification benchmarks. Both quantitative comparison and qualitative investigation are conducted. We empirically show that, our skip-thought neighbor model performs as well as the skip-thought model on evaluation tasks. In addition, we found that, incorporating an autoencoder path in our model didn't aid our model to perform better, while it hurts the performance of the skip-thought model.

Click to Read Paper and Get Code
This paper focuses on robotic picking tasks in cluttered scenario. Because of the diversity of poses, types of stack and complicated background in bin picking situation, it is much difficult to recognize and estimate their pose before grasping them. Here, this paper combines Resnet with U-net structure, a special framework of Convolution Neural Networks (CNN), to predict picking region without recognition and pose estimation. And it makes robotic picking system learn picking skills from scratch. At the same time, we train the network end to end with online samples. In the end of this paper, several experiments are conducted to demonstrate the performance of our methods.

* 6 pages, 7 figures, conference
Click to Read Paper and Get Code