Models, code, and papers for "Wei Xing":

##### Hierarchical Recurrent Attention Network for Response Generation

Jan 25, 2017
Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, Wei-Ying Ma

We study multi-turn response generation in chatbots where a response is generated according to a conversation context. Existing work has modeled the hierarchy of the context, but does not pay enough attention to the fact that words and utterances in the context are differentially important. As a result, they may lose important information in context and generate irrelevant responses. We propose a hierarchical recurrent attention network (HRAN) to model both aspects in a unified framework. In HRAN, a hierarchical attention mechanism attends to important parts within and among utterances with word level attention and utterance level attention respectively. With the word level attention, hidden vectors of a word level encoder are synthesized as utterance vectors and fed to an utterance level encoder to construct hidden representations of the context. The hidden vectors of the context are then processed by the utterance level attention and formed as context vectors for decoding the response. Empirical studies on both automatic evaluation and human judgment show that HRAN can significantly outperform state-of-the-art models for multi-turn response generation.

##### Topic Aware Neural Response Generation

Sep 19, 2016
Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, Wei-Ying Ma

We consider incorporating topic information into the sequence-to-sequence framework to generate informative and interesting responses for chatbots. To this end, we propose a topic aware sequence-to-sequence (TA-Seq2Seq) model. The model utilizes topics to simulate prior knowledge of human that guides them to form informative and interesting responses in conversation, and leverages the topic information in generation by a joint attention mechanism and a biased generation probability. The joint attention mechanism summarizes the hidden vectors of an input message as context vectors by message attention, synthesizes topic vectors by topic attention from the topic words of the message obtained from a pre-trained LDA model, and let these vectors jointly affect the generation of words in decoding. To increase the possibility of topic words appearing in responses, the model modifies the generation probability of topic words by adding an extra probability item to bias the overall distribution. Empirical study on both automatic evaluation metrics and human annotations shows that TA-Seq2Seq can generate more informative and interesting responses, and significantly outperform the-state-of-the-art response generation models.

##### LightLDA: Big Topic Models on Modest Compute Clusters

When building large-scale machine learning (ML) programs, such as big topic models or deep neural nets, one usually assumes such tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude faster than current state-of-the-art Gibbs samplers; 2) a structure-aware model-parallel scheme, which leverages dependencies within the topic model, yielding a sampling strategy that is frugal on machine memory and network communication; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed; and 4) a bounded asynchronous data-parallel scheme, which allows efficient distributed processing of massive data via a parameter server. Our distribution strategy is an instance of the model-and-data-parallel programming model underlying the Petuum framework for general distributed ML, and was implemented on top of the Petuum open-source system. We provide experimental evidence showing how this development puts massive models within reach on a small cluster while still enjoying proportional time cost reductions with increasing cluster size, in comparison with alternative options.

##### Embedding Electronic Health Records for Clinical Information Retrieval

Nov 13, 2018
Xing Wei, Carsten Eickhoff

Neural network representation learning frameworks have recently shown to be highly effective at a wide range of tasks ranging from radiography interpretation via data-driven diagnostics to clinical decision support. This often superior performance comes at the price of dramatically increased training data requirements that cannot be satisfied in every given institution or scenario. As a means of countering such data sparsity effects, distant supervision alleviates the need for scarce in-domain data by relying on a related, resource-rich, task for training. This study presents an end-to-end neural clinical decision support system that recommends relevant literature for individual patients (few available resources) via distant supervision on the well-known MIMIC-III collection (abundant resource). Our experiments show significant improvements in retrieval effectiveness over traditional statistical as well as purely locally supervised retrieval models.

* Published in AMIA Annual Symposium 2018
##### Weakly supervised discriminative feature learning with state information for person identification

Feb 27, 2020
Hong-Xing Yu, Wei-Shi Zheng

Unsupervised learning of identity-discriminative visual feature is appealing in real-world tasks where manual labelling is costly. However, the images of an identity can be visually discrepant when images are taken under different states, e.g. different camera views and poses. This visual discrepancy leads to great difficulty in unsupervised discriminative learning. Fortunately, in real-world tasks we could often know the states without human annotation, e.g. we can easily have the camera view labels in person re-identification and facial pose labels in face recognition. In this work we propose utilizing the state information as weak supervision to address the visual discrepancy caused by different states. We formulate a simple pseudo label model and utilize the state information in an attempt to refine the assigned pseudo labels by the weakly supervised decision boundary rectification and weakly supervised feature drift regularization. We evaluate our model on unsupervised person re-identification and pose-invariant face recognition. Despite the simplicity of our method, it could outperform the state-of-the-art results on Duke-reID, MultiPIE and CFP datasets with a standard ResNet-50 backbone. We also find our model could perform comparably with the standard supervised fine-tuning results on the three datasets. Code is available at https://github.com/KovenYu/state-information

* To appear at CVPR20
##### Scalable Variational Gaussian Process Regression Networks

Mar 25, 2020
Shibo Li, Wei Xing, Mike Kirby, Shandian Zhe

Gaussian process regression networks (GPRN) are powerful Bayesian models for multi-output regression, but their inference is intractable. To address this issue, existing methods use a fully factorized structure (or a mixture of such structures) over all the outputs and latent functions for posterior approximation, which, however, can miss the strong posterior dependencies among the latent variables and hurt the inference quality. In addition, the updates of the variational parameters are inefficient and can be prohibitively expensive for a large number of outputs. To overcome these limitations, we propose a scalable variational inference algorithm for GPRN, which not only captures the abundant posterior dependencies but also is much more efficient for massive outputs. We tensorize the output space and introduce tensor/matrix-normal variational posteriors to capture the posterior correlations and to reduce the parameters. We jointly optimize all the parameters and exploit the inherent Kronecker product structure in the variational model evidence lower bound to accelerate the computation. We demonstrate the advantages of our method in several real-world applications.

##### Cross-Spectrum Dual-Subspace Pairing for RGB-infrared Cross-Modality Person Re-Identification

Feb 29, 2020
Xing Fan, Hao Luo, Chi Zhang, Wei Jiang

Due to its potential wide applications in video surveillance and other computer vision tasks like tracking, person re-identification (ReID) has become popular and been widely investigated. However, conventional person re-identification can only handle RGB color images, which will fail at dark conditions. Thus RGB-infrared ReID (also known as Infrared-Visible ReID or Visible-Thermal ReID) is proposed. Apart from appearance discrepancy in traditional ReID caused by illumination, pose variations and viewpoint changes, modality discrepancy produced by cameras of the different spectrum also exists, which makes RGB-infrared ReID more difficult. To address this problem, we focus on extracting the shared cross-spectrum features of different modalities. In this paper, a novel multi-spectrum image generation method is proposed and the generated samples are utilized to help the network to find discriminative information for re-identifying the same person across modalities. Another challenge of RGB-infrared ReID is that the intra-person (images from the same person) discrepancy is often larger than the inter-person (images from different persons) discrepancy, so a dual-subspace pairing strategy is proposed to alleviate this problem. Combining those two parts together, we also design a one-stream neural network combining the aforementioned methods to extract compact representations of person images, called Cross-spectrum Dual-subspace Pairing (CDP) model. Furthermore, during the training process, we also propose a Dynamic Hard Spectrum Mining method to automatically mine more hard samples from hard spectrum based on the current model state to further boost the performance. Extensive experimental results on two public datasets, SYSU-MM01 with RGB + near-infrared images and RegDB with RGB + far-infrared images, have demonstrated the efficiency and generality of our proposed method.

##### Bayesian Loss for Crowd Count Estimation with Point Supervision

Aug 10, 2019
Zhiheng Ma, Xing Wei, Xiaopeng Hong, Yihong Gong

In crowd counting datasets, each person is annotated by a point, which is usually the center of the head. And the task is to estimate the total count in a crowd scene. Most of the state-of-the-art methods are based on density map estimation, which convert the sparse point annotations into a "ground truth" density map through a Gaussian kernel, and then use it as the learning target to train a density map estimator. However, such a "ground-truth" density map is imperfect due to occlusions, perspective effects, variations in object shapes, etc. On the contrary, we propose \emph{Bayesian loss}, a novel loss function which constructs a density contribution probability model from the point annotations. Instead of constraining the value at every pixel in the density map, the proposed training loss adopts a more reliable supervision on the count expectation at each annotated point. Without bells and whistles, the loss function makes substantial improvements over the baseline loss on all tested datasets. Moreover, our proposed loss function equipped with a standard backbone network, without using any external detectors or multi-scale architectures, plays favourably against the state of the arts. Our method outperforms previous best approaches by a large margin on the latest and largest UCF-QNRF dataset. The source code is available at \url{https://github.com/ZhihengCV/Baysian-Crowd-Counting}.

* Accepted by ICCV 2019 as an oral presentation
##### STNReID : Deep Convolutional Networks with Pairwise Spatial Transformer Networks for Partial Person Re-identification

Mar 17, 2019
Hao Luo, Xing Fan, Chi Zhang, Wei Jiang

Partial person re-identification (ReID) is a challenging task because only partial information of person images is available for matching target persons. Few studies, especially on deep learning, have focused on matching partial person images with holistic person images. This study presents a novel deep partial ReID framework based on pairwise spatial transformer networks (STNReID), which can be trained on existing holistic person datasets. STNReID includes a spatial transformer network (STN) module and a ReID module. The STN module samples an affined image (a semantically corresponding patch) from the holistic image to match the partial image. The ReID module extracts the features of the holistic, partial, and affined images. Competition (or confrontation) is observed between the STN module and the ReID module, and two-stage training is applied to acquire a strong STNReID for partial ReID. Experimental results show that our STNReID obtains 66.7% and 54.6% rank-1 accuracies on partial ReID and partial iLIDS datasets, respectively. These values are at par with those obtained with state-of-the-art methods.

##### Explaining a black-box using Deep Variational Information Bottleneck Approach

Feb 19, 2019
Seojin Bang, Pengtao Xie, Wei Wu, Eric Xing

Briefness and comprehensiveness are necessary in order to give a lot of information concisely in explaining a black-box decision system. However, existing interpretable machine learning methods fail to consider briefness and comprehensiveness simultaneously, which may lead to redundant explanations. We propose a system-agnostic interpretable method that provides a brief but comprehensive explanation by adopting the inspiring information theoretic principle, information bottleneck principle. Using an information theoretic objective, VIBI selects instance-wise key features that are maximally compressed about an input (briefness), and informative about a decision made by a black-box on that input (comprehensive). The selected key features act as an information bottleneck that serves as a concise explanation for each black-box decision. We show that VIBI outperforms other interpretable machine learning methods in terms of both interpretability and fidelity evaluated by human and quantitative metrics.

##### Unsupervised Person Re-identification by Deep Asymmetric Metric Embedding

Jan 29, 2019
Hong-Xing Yu, Ancong Wu, Wei-Shi Zheng

Person re-identification (Re-ID) aims to match identities across non-overlapping camera views. Researchers have proposed many supervised Re-ID models which require quantities of cross-view pairwise labelled data. This limits their scalabilities to many applications where a large amount of data from multiple disjoint camera views is available but unlabelled. Although some unsupervised Re-ID models have been proposed to address the scalability problem, they often suffer from the view-specific bias problem which is caused by dramatic variances across different camera views, e.g., different illumination, viewpoints and occlusion. The dramatic variances induce specific feature distortions in different camera views, which can be very disturbing in finding cross-view discriminative information for Re-ID in the unsupervised scenarios, since no label information is available to help alleviate the bias. We propose to explicitly address this problem by learning an unsupervised asymmetric distance metric based on cross-view clustering. The asymmetric distance metric allows specific feature transformations for each camera view to tackle the specific feature distortions. We then design a novel unsupervised loss function to embed the asymmetric metric into a deep neural network, and therefore develop a novel unsupervised deep framework named the DEep Clustering-based Asymmetric MEtric Learning (DECAMEL). In such a way, DECAMEL jointly learns the feature representation and the unsupervised asymmetric metric. DECAMEL learns a compact cross-view cluster structure of Re-ID data, and thus help alleviate the view-specific bias and facilitate mining the potential cross-view discriminative information for unsupervised Re-ID. Extensive experiments on seven benchmark datasets whose sizes span several orders show the effectiveness of our framework.

* To appear in TPAMI
##### GLStyleNet: Higher Quality Style Transfer Combining Global and Local Pyramid Features

Nov 18, 2018
Zhizhong Wang, Lei Zhao, Wei Xing, Dongming Lu

Recent studies using deep neural networks have shown remarkable success in style transfer especially for artistic and photo-realistic images. However, the approaches using global feature correlations fail to capture small, intricate textures and maintain correct texture scales of the artworks, and the approaches based on local patches are defective on global effect. In this paper, we present a novel feature pyramid fusion neural network, dubbed GLStyleNet, which sufficiently takes into consideration multi-scale and multi-level pyramid features by best aggregating layers across a VGG network, and performs style transfer hierarchically with multiple losses of different scales. Our proposed method retains high-frequency pixel information and low frequency construct information of images from two aspects: loss function constraint and feature fusion. Our approach is not only flexible to adjust the trade-off between content and style, but also controllable between global and local. Compared to state-of-the-art methods, our method can transfer not just large-scale, obvious style cues but also subtle, exquisite ones, and dramatically improves the quality of style transfer. We demonstrate the effectiveness of our approach on portrait style transfer, artistic style transfer, photo-realistic style transfer and Chinese ancient painting style transfer tasks. Experimental results indicate that our unified approach improves image style transfer quality over previous state-of-the-art methods, while also accelerating the whole process in a certain extent. Our code is available at https://github.com/EndyWon/GLStyleNet.

##### SphereReID: Deep Hypersphere Manifold Embedding for Person Re-Identification

Jul 02, 2018
Xing Fan, Wei Jiang, Hao Luo, Mengjuan Fei

Many current successful Person Re-Identification(ReID) methods train a model with the softmax loss function to classify images of different persons and obtain the feature vectors at the same time. However, the underlying feature embedding space is ignored. In this paper, we use a modified softmax function, termed Sphere Softmax, to solve the classification problem and learn a hypersphere manifold embedding simultaneously. A balanced sampling strategy is also introduced. Finally, we propose a convolutional neural network called SphereReID adopting Sphere Softmax and training a single model end-to-end with a new warming-up learning rate schedule on four challenging datasets including Market-1501, DukeMTMC-reID, CHHK-03, and CUHK-SYSU. Experimental results demonstrate that this single model outperforms the state-of-the-art methods on all four datasets without fine-tuning or re-ranking. For example, it achieves 94.4% rank-1 accuracy on Market-1501 and 83.9% rank-1 accuracy on DukeMTMC-reID. The code and trained weights of our model will be released.

* Contribute to Journal of Visual Communication and Image Representation
##### Cross-view Asymmetric Metric Learning for Unsupervised Person Re-identification

Oct 18, 2017
Hong-Xing Yu, Ancong Wu, Wei-Shi Zheng

While metric learning is important for Person re-identification (RE-ID), a significant problem in visual surveillance for cross-view pedestrian matching, existing metric models for RE-ID are mostly based on supervised learning that requires quantities of labeled samples in all pairs of camera views for training. However, this limits their scalabilities to realistic applications, in which a large amount of data over multiple disjoint camera views is available but not labelled. To overcome the problem, we propose unsupervised asymmetric metric learning for unsupervised RE-ID. Our model aims to learn an asymmetric metric, i.e., specific projection for each view, based on asymmetric clustering on cross-view person images. Our model finds a shared space where view-specific bias is alleviated and thus better matching performance can be achieved. Extensive experiments have been conducted on a baseline and five large-scale RE-ID datasets to demonstrate the effectiveness of the proposed model. Through the comparison, we show that our model works much more suitable for unsupervised RE-ID compared to classical unsupervised metric learning models. We also compare with existing unsupervised RE-ID methods, and our model outperforms them with notable margins. Specifically, we report the results on large-scale unlabelled RE-ID dataset, which is important but unfortunately less concerned in literatures.

##### DSRGAN: Explicitly Learning Disentangled Representation of Underlying Structure and Rendering for Image Generation without Tuple Supervision

Sep 30, 2019
Guang-Yuan Hao, Hong-Xing Yu, Wei-Shi Zheng

We focus on explicitly learning disentangled representation for natural image generation, where the underlying spatial structure and the rendering on the structure can be independently controlled respectively, yet using no tuple supervision. The setting is significant since tuple supervision is costly and sometimes even unavailable. However, the task is highly unconstrained and thus ill-posed. To address this problem, we propose to introduce an auxiliary domain which shares a common underlying-structure space with the target domain, and we make a partially shared latent space assumption. The key idea is to encourage the partially shared latent variable to represent the similar underlying spatial structures in both domains, while the two domain-specific latent variables will be unavoidably arranged to present renderings of two domains respectively. This is achieved by designing two parallel generative networks with a common Progressive Rendering Architecture (PRA), which constrains both generative networks' behaviors to model shared underlying structure and to model spatially dependent relation between rendering and underlying structure. Thus, we propose DSRGAN (GANs for Disentangling Underlying Structure and Rendering) to instantiate our method. We also propose a quantitative criterion (the Normalized Disentanglability) to quantify disentanglability. Comparison to the state-of-the-art methods shows that DSRGAN can significantly outperform them in disentanglability.

##### Downhole Track Detection via Multiscale Conditional Generative Adversarial Nets

Apr 17, 2019
Jia Li, Xing Wei, Guoqiang Yang, Xiao Sun, Changliang Li

Frequent mine disasters cause a large number of casualties and property losses. Autonomous driving is a fundamental measure for solving this problem, and track detection is one of the key technologies for computer vision to achieve downhole automatic driving. The track detection result based on the traditional convolutional neural network (CNN) algorithm lacks the detailed and unique description of the object and relies too much on visual postprocessing technology. Therefore, this paper proposes a track detection algorithm based on a multiscale conditional generative adversarial network (CGAN). The generator is decomposed into global and local parts using a multigranularity structure in the generator network. A multiscale shared convolution structure is adopted in the discriminator network to further supervise training the generator. Finally, the Monte Carlo search technique is introduced to search the intermediate state of the generator, and the result is sent to the discriminator for comparison. Compared with the existing work, our model achieved 82.43\% pixel accuracy and an average intersection-over-union (IOU) of 0.6218, and the detection of the track reached 95.01\% accuracy in the downhole roadway scene test set.

* arXiv admin note: text overlap with arXiv:1711.11585 by other authors
##### Reinforcement Learning Based Emotional Editing Constraint Conversation Generation

Apr 17, 2019
Jia Li, Xiao Sun, Xing Wei, Changliang Li, Jianhua Tao

In recent years, the generation of conversation content based on deep neural networks has attracted many researchers. However, traditional neural language models tend to generate general replies, lacking logical and emotional factors. This paper proposes a conversation content generation model that combines reinforcement learning with emotional editing constraints to generate more meaningful and customizable emotional replies. The model divides the replies into three clauses based on pre-generated keywords and uses the emotional editor to further optimize the final reply. The model combines multi-task learning with multiple indicator rewards to comprehensively optimize the quality of replies. Experiments shows that our model can not only improve the fluency of the replies, but also significantly enhance the logical relevance and emotional relevance of the replies.

##### Coarse-to-Fine Salient Object Detection with Low-Rank Matrix Recovery

Oct 02, 2018
Qi Zheng, Shujian Yu, Xinge You, Qinmu Peng, Wei Yuan

Low-Rank Matrix Recovery (LRMR) has recently been applied to saliency detection by decomposing image features into a low-rank component associated with background and a sparse component associated with visual salient regions. Despite its great potential, existing LRMR-based saliency detection methods seldom consider the inter-relationship among elements within these two components, thus are prone to generating scattered or incomplete saliency maps. In this paper, we introduce a novel and efficient LRMR-based saliency detection model under a coarse-to-fine framework to circumvent this limitation. First, we roughly measure the saliency of image regions with a baseline LRMR model that integrates a $\ell_1$-norm sparsity constraint and a Laplacian regularization smooth term. Given samples from the coarse saliency map, we then learn a projection that maps image features to refined saliency values, to significantly sharpen the object boundaries and to preserve the object entirety. We evaluate our framework against existing LRMR based methods on three benchmark datasets. Experimental results validate the superiority of our method as well as the effectiveness of our suggested coarse-to-fine framework, especially for images containing multiple objects.

* 32 pages, 9 figures
##### MIXGAN: Learning Concepts from Different Domains for Mixture Generation

Jul 04, 2018
Guang-Yuan Hao, Hong-Xing Yu, Wei-Shi Zheng

In this work, we present an interesting attempt on mixture generation: absorbing different image concepts (e.g., content and style) from different domains and thus generating a new domain with learned concepts. In particular, we propose a mixture generative adversarial network (MIXGAN). MIXGAN learns concepts of content and style from two domains respectively, and thus can join them for mixture generation in a new domain, i.e., generating images with content from one domain and style from another. MIXGAN overcomes the limitation of current GAN-based models which either generate new images in the same domain as they observed in training stage, or require off-the-shelf content templates for transferring or translation. Extensive experimental results demonstrate the effectiveness of MIXGAN as compared to related state-of-the-art GAN-based models.

* Accepted by IJCAI-ECAI 2018, the 27th International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artificial Intelligence
##### Orthogonality-Promoting Distance Metric Learning: Convex Relaxation and Theoretical Analysis

Feb 16, 2018
Pengtao Xie, Wei Wu, Yichen Zhu, Eric P. Xing

Distance metric learning (DML), which learns a distance metric from labeled "similar" and "dissimilar" data pairs, is widely utilized. Recently, several works investigate orthogonality-promoting regularization (OPR), which encourages the projection vectors in DML to be close to being orthogonal, to achieve three effects: (1) high balancedness -- achieving comparable performance on both frequent and infrequent classes; (2) high compactness -- using a small number of projection vectors to achieve a "good" metric; (3) good generalizability -- alleviating overfitting to training data. While showing promising results, these approaches suffer three problems. First, they involve solving non-convex optimization problems where achieving the global optimal is NP-hard. Second, it lacks a theoretical understanding why OPR can lead to balancedness. Third, the current generalization error analysis of OPR is not directly on the regularizer. In this paper, we address these three issues by (1) seeking convex relaxations of the original nonconvex problems so that the global optimal is guaranteed to be achievable; (2) providing a formal analysis on OPR's capability of promoting balancedness; (3) providing a theoretical analysis that directly reveals the relationship between OPR and generalization performance. Experiments on various datasets demonstrate that our convex methods are more effective in promoting balancedness, compactness, and generalization, and are computationally more efficient, compared with the nonconvex methods.