Research papers and code for "Yi Zheng":
Matching images and sentences demands a fine understanding of both modalities. In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image / text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss is hard for network learning, since it starts from the two heterogeneous features to build inter-modal relationship. To address this problem, we propose the instance loss which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image / text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this paper constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.

* 15pages, 15 figures, 8 tables
Click to Read Paper and Get Code
In spite of the growing computational power offered by the commodity hardware, fast pump scheduling of complex water distribution systems is still a challenge. In this paper, the Artificial Neural Network (ANN) meta-modeling technique has been employed with a Genetic Algorithm (GA) for simultaneously optimizing the pump operation and the tank levels at the ends of the cycle. The generalized GA+ANN algorithm has been tested on a real system in the UK. Comparing to the existing operation, the daily cost is reduced by about 10-15%, while the number of pump switches are kept below 4 switches-per-day. In addition, tank levels are optimized ensure a periodic behavior, which results in a predictable and stable performance over repeated cycles.

* Procedia Engineering, 70, pp.103-112, 2014
* 12th International Conference on Computing and Control for the Water Industry (CCWI'2013)
Click to Read Paper and Get Code
The main contribution of this paper is a simple semi-supervised pipeline that only uses the original training set without collecting extra data. It is challenging in 1) how to obtain more training data only from the training set and 2) how to use the newly generated data. In this work, the generative adversarial network (GAN) is used to generate unlabeled samples. We propose the label smoothing regularization for outliers (LSRO). This method assigns a uniform label distribution to the unlabeled images, which regularizes the supervised model and improves the baseline. We verify the proposed method on a practical problem: person re-identification (re-ID). This task aims to retrieve a query person from other cameras. We adopt the deep convolutional generative adversarial network (DCGAN) for sample generation, and a baseline convolutional neural network (CNN) for representation learning. Experiments show that adding the GAN-generated data effectively improves the discriminative ability of learned CNN embeddings. On three large-scale datasets, Market-1501, CUHK03 and DukeMTMC-reID, we obtain +4.37%, +1.6% and +2.46% improvement in rank-1 precision over the baseline CNN, respectively. We additionally apply the proposed method to fine-grained bird recognition and achieve a +0.6% improvement over a strong baseline. The code is available at https://github.com/layumi/Person-reID_GAN.

* 9 pages, 6 figures, accepted by ICCV 2017
Click to Read Paper and Get Code
Person re-identification (person re-ID) is mostly viewed as an image retrieval problem. This task aims to search a query person in a large image pool. In practice, person re-ID usually adopts automatic detectors to obtain cropped pedestrian images. However, this process suffers from two types of detector errors: excessive background and part missing. Both errors deteriorate the quality of pedestrian alignment and may compromise pedestrian matching due to the position and scale variances. To address the misalignment problem, we propose that alignment can be learned from an identification procedure. We introduce the pedestrian alignment network (PAN) which allows discriminative embedding learning and pedestrian alignment without extra annotations. Our key observation is that when the convolutional neural network (CNN) learns to discriminate between different identities, the learned feature maps usually exhibit strong activations on the human body rather than the background. The proposed network thus takes advantage of this attention mechanism to adaptively locate and align pedestrians within a bounding box. Visual examples show that pedestrians are better aligned with PAN. Experiments on three large-scale re-ID datasets confirm that PAN improves the discriminative ability of the feature embeddings and yields competitive accuracy with the state-of-the-art methods.

Click to Read Paper and Get Code
We revisit two popular convolutional neural networks (CNN) in person re-identification (re-ID), i.e, verification and classification models. The two models have their respective advantages and limitations due to different loss functions. In this paper, we shed light on how to combine the two models to learn more discriminative pedestrian descriptors. Specifically, we propose a new siamese network that simultaneously computes identification loss and verification loss. Given a pair of training images, the network predicts the identities of the two images and whether they belong to the same identity. Our network learns a discriminative embedding and a similarity measurement at the same time, thus making full usage of the annotations. Albeit simple, the learned embedding improves the state-of-the-art performance on two public person re-ID benchmarks. Further, we show our architecture can also be applied in image retrieval.

Click to Read Paper and Get Code
State-of-the-art text spotting systems typically aim to detect isolated words or word-by-word text in images of natural scenes and ignore the semantic coherence within a region of text. However, when interpreted together, seemingly isolated words may be easier to recognize. On this basis, we propose a novel "semantic-based text recognition" (STR) deep learning model that reads text in images with the help of understanding context. STR consists of several modules. We introduce the Text Grouping and Arranging (TGA) algorithm to connect and order isolated text regions. A text-recognition network interprets isolated words. Benefiting from semantic information, a sequenceto-sequence network model efficiently corrects inaccurate and uncertain phrases produced earlier in the STR pipeline. We present experiments on two new distinct datasets that contain scanned catalog images of interior designs and photographs of protesters with hand-written signs, respectively. Our results show that our STR model outperforms a baseline method that uses state-of-the-art single-wordrecognition techniques on both datasets. STR yields a high accuracy rate of 90% on the catalog images and 71% on the more difficult protest images, suggesting its generality in recognizing text.

Click to Read Paper and Get Code
In acoustic scene classification researches, audio segment is usually split into multiple samples. Majority voting is then utilized to ensemble the results of the samples. In this paper, we propose a punishment voting algorithm based on the super categories construction method for acoustic scene classification. Specifically, we propose a DenseNet-like model as the base classifier. The base classifier is trained by the CQT spectrograms generated from the raw audio segments. Taking advantage of the results of the base classifier, we propose a super categories construction method using the spectral clustering. Super classifiers corresponding to the constructed super categories are further trained. Finally, the super classifiers are utilized to enhance the majority voting of the base classifier by punishment voting. Experiments show that the punishment voting obviously improves the performances on both the DCASE2017 Development dataset and the LITIS Rouen dataset.

* 11 pages, 5 figures
Click to Read Paper and Get Code
Colorization methods using deep neural networks have become a recent trend. However, most of them do not allow user inputs, or only allow limited user inputs (only global inputs or only local inputs), to control the output colorful images. The possible reason is that it's difficult to differentiate the influence of different kind of user inputs in network training. To solve this problem, we present a novel deep colorization method, which allows simultaneous global and local inputs to better control the output colorized images. The key step is to design an appropriate loss function that can differentiate the influence of input data, global inputs and local inputs. With this design, our method accepts no inputs, or global inputs, or local inputs, or both global and local inputs, which is not supported in previous deep colorization methods. In addition, we propose a global color theme recommendation system to help users determine global inputs. Experimental results shows that our methods can better control the colorized images and generate state-of-art results.

Click to Read Paper and Get Code
The superiority of deeply learned pedestrian representations has been reported in very recent literature of person re-identification (re-ID). In this paper, we consider the more pragmatic issue of learning a deep feature with no or only a few labels. We propose a progressive unsupervised learning (PUL) method to transfer pretrained deep representations to unseen domains. Our method is easy to implement and can be viewed as an effective baseline for unsupervised re-ID feature learning. Specifically, PUL iterates between 1) pedestrian clustering and 2) fine-tuning of the convolutional neural network (CNN) to improve the original model trained on the irrelevant labeled dataset. Since the clustering results can be very noisy, we add a selection operation between the clustering and fine-tuning. At the beginning when the model is weak, CNN is fine-tuned on a small amount of reliable examples which locate near to cluster centroids in the feature space. As the model becomes stronger in subsequent iterations, more images are being adaptively selected as CNN training samples. Progressively, pedestrian clustering and the CNN model are improved simultaneously until algorithm convergence. This process is naturally formulated as self-paced learning. We then point out promising directions that may lead to further improvement. Extensive experiments on three large-scale re-ID datasets demonstrate that PUL outputs discriminative features that improve the re-ID accuracy.

* Add more results, parameter analysis and comparisons
Click to Read Paper and Get Code
In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors (de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.

* Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence
Click to Read Paper and Get Code
In this paper we develop a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA. In particular, we show for all above problems (including asymmetric cases): 1) all local minima are also globally optimal; 2) no high-order saddle points exists. These results explain why simple algorithms such as stochastic gradient descent have global converge, and efficiently optimize these non-convex objective functions in practice. Our framework connects and simplifies the existing analyses on optimization landscapes for matrix sensing and symmetric matrix completion. The framework naturally leads to new results for asymmetric matrix completion and robust PCA.

Click to Read Paper and Get Code
Adversarial examples in recent works target at closed set recognition systems, in which the training and testing classes are identical. In real-world scenarios, however, the testing classes may have limited, if any, overlap with the training classes, a problem named open set recognition. To our knowledge, the community does not have a specific design of adversarial examples targeting at this practical setting. Arguably, the new setting compromises traditional closed set attack methods in two aspects. First, closed set attack methods are based on classification and target at classification as well, but the open set problem suggests a different task, \emph{i.e.,} retrieval. It is undesirable that the generation mechanism of closed set recognition is different from the aim of open set recognition. Second, given that the query image is usually of an unseen class, predicting its category from the training classes is not reasonable, which leads to an inferior adversarial gradient. In this work, we view open set recognition as a retrieval task and propose a new approach, Opposite-Direction Feature Attack (ODFA), to generate adversarial examples / queries. When using an attacked example as query, we aim that the true matches be ranked as low as possible. In addressing the two limitations of closed set attack methods, ODFA directly works on the features for retrieval. The idea is to push away the feature of the adversarial query in the opposite direction of the original feature. Albeit simple, ODFA leads to a larger drop in Recall@K and mAP than the close-set attack methods on two open set recognition datasets, \emph{i.e.,} Market-1501 and CUB-200-2011. We also demonstrate that the attack performance of ODFA is not evidently superior to the state-of-the-art methods under closed set recognition (Cifar-10), suggesting its specificity for open set problems.

* 10 pages, 6 figures, 6 tables
Click to Read Paper and Get Code
In this paper, we make two contributions to unsupervised domain adaptation (UDA) using the convolutional neural network (CNN). First, our approach transfers knowledge in all the convolutional layers through attention alignment. Most previous methods align high-level representations, e.g., activations of the fully connected (FC) layers. In these methods, however, the convolutional layers which underpin critical low-level domain knowledge cannot be updated directly towards reducing domain discrepancy. Specifically, we assume that the discriminative regions in an image are relatively invariant to image style changes. Based on this assumption, we propose an attention alignment scheme on all the target convolutional layers to uncover the knowledge shared by the source domain. Second, we estimate the posterior label distribution of the unlabeled data for target network training. Previous methods, which iteratively update the pseudo labels by the target network and refine the target network by the updated pseudo labels, are vulnerable to label estimation errors. Instead, our approach uses category distribution to calculate the cross-entropy loss for training, thereby ameliorating the error accumulation of the estimated labels. The two contributions allow our approach to outperform the state-of-the-art methods by +2.6% on the Office-31 dataset.

* Accepted by ECCV 2018
Click to Read Paper and Get Code
For a company looking to provide delightful user experiences, it is of paramount importance to take care of any customer issues. This paper proposes COTA, a system to improve speed and reliability of customer support for end users through automated ticket classification and answers selection for support representatives. Two machine learning and natural language processing techniques are demonstrated: one relying on feature engineering (COTA v1) and the other exploiting raw signals through deep learning architectures (COTA v2). COTA v1 employs a new approach that converts the multi-classification task into a ranking problem, demonstrating significantly better performance in the case of thousands of classes. For COTA v2, we propose an Encoder-Combiner-Decoder, a novel deep learning architecture that allows for heterogeneous input and output feature types and injection of prior knowledge through network architecture choices. This paper compares these models and their variants on the task of ticket classification and answer selection, showing model COTA v2 outperforms COTA v1, and analyzes their inner workings and shortcomings. Finally, an A/B test is conducted in a production setting validating the real-world impact of COTA in reducing issue resolution time by 10 percent without reducing customer satisfaction.

Click to Read Paper and Get Code
Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of large data volumes. Considering different tasks, we classify most current re-ID methods into two classes, i.e., image-based and video-based; in both tasks, hand-crafted and deep learning systems will be reviewed. Moreover, two new re-ID tasks which are much closer to real-world applications are described and discussed, i.e., end-to-end re-ID and fast re-ID in very large galleries. This paper: 1) introduces the history of person re-ID and its relationship with image classification and instance retrieval; 2) surveys a broad selection of the hand-crafted systems and the large-scale methods in both image- and video-based re-ID; 3) describes critical future directions in end-to-end re-ID and fast retrieval in large galleries; and 4) finally briefs some important yet under-developed issues.

* 20 pages, 5 tables, 10 images
Click to Read Paper and Get Code
Cartoons receive increasingly attention and have a huge global market. Cartoon person recognition has a wealth of application scenarios. However, there is no large and high quality dataset for cartoon person recognition. It limit the development of recognition algorithms. In this paper, we propose the first large unconstrained cartoon database called iCartoonFace. We have released the dataset publicly available to promote cartoon person recognition research\footnote{The dataset can be applied by sending email to zhengyi01@qiyi.com}. The dataset contains 68,312 images of 2,639 identities. The dataset consists of persons which come from cartoon videos. The samples are extracted from public available images on website and online videos on iQiYi company. All images pass through a careful manual annotation process. We evaluated the state-of-the-art image classification and face recognition algorithms on the iCartoonFace dataset as a baseline. A dataset fusion method which utilize face feature to improve the performance of cartoon recognition task is proposed. Experimental performance show that the performance of baseline models much worse than human performance. The proposed dataset fusion method achieves a 4.74% improvement over the baseline model. In a word, state-of-the-art algorithms for classification and recognition are far from being perfect for unconstrained cartoon person recognition.

* There are many grammar problems that need to be modified in the article. And the description of dataset fusion is not clearly enough. Some experimental results need to be added. A new version will be released soon
Click to Read Paper and Get Code
Deep learning based methods have penetrated many image processing problems and become dominant solutions to these problems. A natural question raised here is "Is there any space for conventional methods on these problems?" In this paper, exposure interpolation is taken as an example to answer this question and the answer is "Yes". A framework on fusing conventional and deep learning method is introduced to generate an medium exposure image for two large-exposureratio images. Experimental results indicate that the quality of the medium exposure image is increased significantly through using the deep learning method to refine the interpolated image via the conventional method. The conventional method can be adopted to improve the convergence speed of the deep learning method and to reduce the number of samples which is required by the deep learning method.

* 10 pages
Click to Read Paper and Get Code
This paper focuses on regularizing the training of the convolutional neural network (CNN). We propose a new regularization approach named ``PatchShuffle`` that can be adopted in any classification-oriented CNN models. It is easy to implement: in each mini-batch, images or feature maps are randomly chosen to undergo a transformation such that pixels within each local patch are shuffled. Through generating images and feature maps with interior orderless patches, PatchShuffle creates rich local variations, reduces the risk of network overfitting, and can be viewed as a beneficial supplement to various kinds of training regularization techniques, such as weight decay, model ensemble and dropout. Experiments on four representative classification datasets show that PatchShuffle improves the generalization ability of CNN especially when the data is scarce. Moreover, we empirically illustrate that CNN models trained with PatchShuffle are more robust to noise and local changes in an image.

Click to Read Paper and Get Code