Models, code, and papers for "Shiliang Zhang":
Artificial neural network (ANN) is a very useful tool in solving learning problems. Boosting the performances of ANN can be mainly concluded from two aspects: optimizing the architecture of ANN and normalizing the raw data for ANN. In this paper, a novel method which improves the effects of ANN by preprocessing the raw data is proposed. It totally leverages the fact that different features should play different roles. The raw data set is firstly preprocessed by principle component analysis (PCA), and then its principle components are weighted by their corresponding eigenvalues. Several aspects of analysis are carried out to analyze its theory and the applicable occasions. Three classification problems are launched by an active learning algorithm to verify the proposed method. From the empirical results, conclusion comes to the fact that the proposed method can significantly improve the performance of ANN.
In this paper, we propose a novel model for high-dimensional data, called the Hybrid Orthogonal Projection and Estimation (HOPE) model, which combines a linear orthogonal projection and a finite mixture model under a unified generative modeling framework. The HOPE model itself can be learned unsupervised from unlabelled data based on the maximum likelihood estimation as well as discriminatively from labelled data. More interestingly, we have shown the proposed HOPE models are closely related to neural networks (NNs) in a sense that each hidden layer can be reformulated as a HOPE model. As a result, the HOPE framework can be used as a novel tool to probe why and how NNs work, more importantly, to learn NNs in either supervised or unsupervised ways. In this work, we have investigated the HOPE framework to learn NNs for several standard tasks, including image recognition on MNIST and speech recognition on TIMIT. Experimental results have shown that the HOPE framework yields significant performance gains over the current state-of-the-art methods in various types of NN learning problems, including unsupervised feature learning, supervised or semi-supervised learning.
A novel predictor for traffic flow forecasting, namely spatio-temporal Bayesian network predictor, is proposed. Unlike existing methods, our approach incorporates all the spatial and temporal information available in a transportation network to carry our traffic flow forecasting of the current site. The Pearson correlation coefficient is adopted to rank the input variables (traffic flows) for prediction, and the best-first strategy is employed to select a subset as the cause nodes of a Bayesian network. Given the derived cause nodes and the corresponding effect node in the spatio-temporal Bayesian network, a Gaussian Mixture Model is applied to describe the statistical relationship between the input and output. Finally, traffic flow forecasting is performed under the criterion of Minimum Mean Square Error (M.M.S.E.). Experimental results with the urban vehicular flow data of Beijing demonstrate the effectiveness of our presented spatio-temporal Bayesian network predictor.
Exploiting resolution invariant representation is critical for person Re-Identification (ReID) in real applications, where the resolutions of captured person images may vary dramatically. This paper learns person representations robust to resolution variance through jointly training a Foreground-Focus Super-Resolution (FFSR) module and a Resolution-Invariant Feature Extractor (RIFE) by end-to-end CNN learning. FFSR upscales the person foreground using a fully convolutional auto-encoder with skip connections learned with a foreground focus training loss. RIFE adopts two feature extraction streams weighted by a dual-attention block to learn features for low and high resolution images, respectively. These two complementary modules are jointly trained, leading to a strong resolution invariant representation. We evaluate our methods on five datasets containing person images at a large range of resolutions, where our methods show substantial superiority to existing solutions. For instance, we achieve Rank-1 accuracy of 36.4% and 73.3% on CAVIAR and MLR-CUHK03, outperforming the state-of-the art by 2.9% and 2.6%, respectively.
Connectionist Temporal Classification (CTC) based end-to-end speech recognition system usually need to incorporate an external language model by using WFST-based decoding in order to achieve promising results. This is more essential to Mandarin speech recognition since it owns a special phenomenon, namely homophone, which causes a lot of substitution errors. The linguistic information introduced by language model will help to distinguish these substitution errors. In this work, we propose a transformer based spelling correction model to automatically correct errors especially the substitution errors made by CTC-based Mandarin speech recognition system. Specifically, we investigate using the recognition results generated by CTC-based systems as input and the ground-truth transcriptions as output to train a transformer with encoder-decoder architecture, which is much similar to machine translation. Results in a 20,000 hours Mandarin speech recognition task show that the proposed spelling correction model can achieve a CER of 3.41%, which results in 22.9% and 53.2% relative improvement compared to the baseline CTC-based systems decoded with and without language model respectively.
This paper proposes a two-stream convolution network to extract spatial and temporal cues for video based person Re-Identification (ReID). A temporal stream in this network is constructed by inserting several Multi-scale 3D (M3D) convolution layers into a 2D CNN network. The resulting M3D convolution network introduces a fraction of parameters into the 2D CNN, but gains the ability of multi-scale temporal feature learning. With this compact architecture, M3D convolution network is also more efficient and easier to optimize than existing 3D convolution networks. The temporal stream further involves Residual Attention Layers (RAL) to refine the temporal features. By jointly learning spatial-temporal attention masks in a residual manner, RAL identifies the discriminative spatial regions and temporal cues. The other stream in our network is implemented with a 2D CNN for spatial feature extraction. The spatial and temporal features from two streams are finally fused for the video based person ReID. Evaluations on three widely used benchmarks datasets, i.e., MARS, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our method over existing 3D convolution networks and state-of-art methods.
Shift operation is an efficient alternative over depthwise separable convolution. However, it is still bottlenecked by its implementation manner, namely memory movement. To put this direction forward, a new and novel basic component named Sparse Shift Layer (SSL) is introduced in this paper to construct efficient convolutional neural networks. In this family of architectures, the basic block is only composed by 1x1 convolutional layers with only a few shift operations applied to the intermediate feature maps. To make this idea feasible, we introduce shift operation penalty during optimization and further propose a quantization-aware shift learning method to impose the learned displacement more friendly for inference. Extensive ablation studies indicate that only a few shift operations are sufficient to provide spatial information communication. Furthermore, to maximize the role of SSL, we redesign an improved network architecture to Fully Exploit the limited capacity of neural Network (FE-Net). Equipped with SSL, this network can achieve 75.0% top-1 accuracy on ImageNet with only 563M M-Adds. It surpasses other counterparts constructed by depthwise separable convolution and the networks searched by NAS in terms of accuracy and practical speed.
Neuron pruning is an efficient method to compress the network into a slimmer one for reducing the computational cost and storage overhead. Most of state-of-the-art results are obtained in a layer-by-layer optimization mode. It discards the unimportant input neurons and uses the survived ones to reconstruct the output neurons approaching to the original ones in a layer-by-layer manner. However, an unnoticed problem arises that the information loss is accumulated as layer increases since the survived neurons still do not encode the entire information as before. A better alternative is to propagate the entire useful information to reconstruct the pruned layer instead of directly discarding the less important neurons. To this end, we propose a novel Layer Decomposition-Recomposition Framework (LDRF) for neuron pruning, by which each layer's output information is recovered in an embedding space and then propagated to reconstruct the following pruned layers with useful information preserved. We mainly conduct our experiments on ILSVRC-12 benchmark with VGG-16 and ResNet-50. What should be emphasized is that our results before end-to-end fine-tuning are significantly superior owing to the information-preserving property of our proposed framework.With end-to-end fine-tuning, we achieve state-of-the-art results of 5.13x and 3x speed-up with only 0.5% and 0.65% top-5 accuracy drop respectively, which outperform the existing neuron pruning methods.
Although the performance of person Re-Identification (ReID) has been significantly boosted, many challenging issues in real scenarios have not been fully investigated, e.g., the complex scenes and lighting variations, viewpoint and pose changes, and the large number of identities in a camera network. To facilitate the research towards conquering those issues, this paper contributes a new dataset called MSMT17 with many important features, e.g., 1) the raw videos are taken by an 15-camera network deployed in both indoor and outdoor scenes, 2) the videos cover a long period of time and present complex lighting variations, and 3) it contains currently the largest number of annotated identities, i.e., 4,101 identities and 126,441 bounding boxes. We also observe that, domain gap commonly exists between datasets, which essentially causes severe performance drop when training and testing on different datasets. This results in that available training data cannot be effectively leveraged for new testing domains. To relieve the expensive costs of annotating new training samples, we propose a Person Transfer Generative Adversarial Network (PTGAN) to bridge the domain gap. Comprehensive experiments show that the domain gap could be substantially narrowed-down by the PTGAN.
Previous works on vehicle Re-ID mainly focus on extracting global features and learning distance metrics. Because some vehicles commonly share same model and maker, it is hard to distinguish them based on their global appearances. Compared with the global appearance, local regions such as decorations and inspection stickers attached to the windshield, may be more distinctive for vehicle Re-ID. To embed the detailed visual cues in those local regions, we propose a Region-Aware deep Model (RAM). Specifically, in addition to extracting global features, RAM also extracts features from a series of local regions. As each local region conveys more distinctive visual cues, RAM encourages the deep model to learn discriminative features. We also introduce a novel learning algorithm to jointly use vehicle IDs, types/models, and colors to train the RAM. This strategy fuses more cues for training and results in more discriminative global and regional features. We evaluate our methods on two large-scale vehicle Re-ID datasets, i.e., VeRi and VehicleID. Experimental results show our methods achieve promising performance in comparison with recent works.
In this paper, we present an improved feedforward sequential memory networks (FSMN) architecture, namely Deep-FSMN (DFSMN), by introducing skip connections between memory blocks in adjacent layers. These skip connections enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. As a result, DFSMN significantly benefits from these skip connections and deep structure. We have compared the performance of DFSMN to BLSTM both with and without lower frame rate (LFR) on several large speech recognition tasks, including English and Mandarin. Experimental results shown that DFSMN can consistently outperform BLSTM with dramatic gain, especially trained with LFR using CD-Phone as modeling units. In the 2000 hours Fisher (FSH) task, the proposed DFSMN can achieve a word error rate of 9.4% by purely using the cross-entropy criterion and decoding with a 3-gram language model, which achieves a 1.5% absolute improvement compared to the BLSTM. In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM. Moreover, we can easily design the lookahead filter order of the memory blocks in DFSMN to control the latency for real-time applications.
Traditional Bag-of-visual Words (BoWs) model is commonly generated with many steps including local feature extraction, codebook generation, and feature quantization, etc. Those steps are relatively independent with each other and are hard to be jointly optimized. Moreover, the dependency on hand-crafted local feature makes BoWs model not effective in conveying high-level semantics. These issues largely hinder the performance of BoWs model in large-scale image applications. To conquer these issues, we propose an End-to-End BoWs (E$^2$BoWs) model based on Deep Convolutional Neural Network (DCNN). Our model takes an image as input, then identifies and separates the semantic objects in it, and finally outputs the visual words with high semantic discriminative power. Specifically, our model firstly generates Semantic Feature Maps (SFMs) corresponding to different object categories through convolutional layers, then introduces Bag-of-Words Layers (BoWL) to generate visual words for each individual feature map. We also introduce a novel learning algorithm to reinforce the sparsity of the generated E$^2$BoWs model, which further ensures the time and memory efficiency. We evaluate the proposed E$^2$BoWs model on several image search datasets including CIFAR-10, CIFAR-100, MIRFLICKR-25K and NUS-WIDE. Experimental results show that our method achieves promising accuracy and efficiency compared with recent deep learning based retrieval works.
We introduce a new structure for memory neural networks, called feedforward sequential memory networks (FSMN), which can learn long-term dependency without using recurrent feedback. The proposed FSMN is a standard feedforward neural networks equipped with learnable sequential memory blocks in the hidden layers. In this work, we have applied FSMN to several language modeling (LM) tasks. Experimental results have shown that the memory blocks in FSMN can learn effective representations of long history. Experiments have shown that FSMN based language models can significantly outperform not only feedforward neural network (FNN) based LMs but also the popular recurrent neural network (RNN) LMs.
Exploiting multi-scale representations is critical to improve edge detection for objects at different scales. To extract edges at dramatically different scales, we propose a Bi-Directional Cascade Network (BDCN) structure, where an individual layer is supervised by labeled edges at its specific scale, rather than directly applying the same supervision to all CNN outputs. Furthermore, to enrich multi-scale representations learned by BDCN, we introduce a Scale Enhancement Module (SEM) which utilizes dilated convolution to generate multi-scale features, instead of using deeper CNNs or explicitly fusing multi-scale edge maps. These new approaches encourage the learning of multi-scale representations in different layers and detect edges that are well delineated by their scales. Learning scale dedicated layers also results in compact network with a fraction of parameters. We evaluate our method on three datasets, i.e., BSDS500, NYUDv2, and Multicue, and achieve ODS Fmeasure of 0.828, 1.3% higher than current state-of-the art on BSDS500. The code has been available at https://github.com/pkuCactus/BDCN.
Learning discriminative representations for unseen person images is critical for person Re-Identification (ReID). Most of current approaches learn deep representations in classification tasks, which essentially minimize the empirical classification risk on the training set. As shown in our experiments, such representations commonly focus on several body parts discriminative to the training set, rather than the entire human body. Inspired by the structural risk minimization principle in SVM, we revise the traditional deep representation learning procedure to minimize both the empirical classification risk and the representation learning risk. The representation learning risk is evaluated by the proposed part loss, which automatically generates several parts for an image, and computes the person classification loss on each part separately. Compared with traditional global classification loss, simultaneously considering multiple part loss enforces the deep network to focus on the entire human body and learn discriminative representations for different parts. Experimental results on three datasets, i.e., Market1501, CUHK03, VIPeR, show that our representation outperforms the existing deep representations.
Fine-Grained Visual Categorization (FGVC) has achieved significant progress recently. However, the number of fine-grained species could be huge and dynamically increasing in real scenarios, making it difficult to recognize unseen objects under the current FGVC framework. This raises an open issue to perform large-scale fine-grained identification without a complete training set. Aiming to conquer this issue, we propose a retrieval task named One-Shot Fine-Grained Instance Retrieval (OSFGIR). "One-Shot" denotes the ability of identifying unseen objects through a fine-grained retrieval task assisted with an incomplete auxiliary training set. This paper first presents the detailed description to OSFGIR task and our collected OSFGIR-378K dataset. Next, we propose the Convolutional and Normalization Networks (CN-Nets) learned on the auxiliary dataset to generate a concise and discriminative representation. Finally, we present a coarse-to-fine retrieval framework consisting of three components, i.e., coarse retrieval, fine-grained retrieval, and query expansion, respectively. The framework progressively retrieves images with similar semantics, and performs fine-grained identification. Experiments show our OSFGIR framework achieves significantly better accuracy and efficiency than existing FGVC and image retrieval methods, thus could be a better solution for large-scale fine-grained object identification.
This paper describes the USTC_NELSLIP systems submitted to the Trilingual Entity Detection and Linking (EDL) track in 2016 TAC Knowledge Base Population (KBP) contests. We have built two systems for entity discovery and mention detection (MD): one uses the conditional RNNLM and the other one uses the attention-based encoder-decoder framework. The entity linking (EL) system consists of two modules: a rule based candidate generation and a neural networks probability ranking model. Moreover, some simple string matching rules are used for NIL clustering. At the end, our best system has achieved an F1 score of 0.624 in the end-to-end typed mention ceaf plus metric.
This paper proposes the Global-Local Temporal Representation (GLTR) to exploit the multi-scale temporal cues in video sequences for video person Re-Identification (ReID). GLTR is constructed by first modeling the short-term temporal cues among adjacent frames, then capturing the long-term relations among inconsecutive frames. Specifically, the short-term temporal cues are modeled by parallel dilated convolutions with different temporal dilation rates to represent the motion and appearance of pedestrian. The long-term relations are captured by a temporal self-attention model to alleviate the occlusions and noises in video sequences. The short and long-term temporal cues are aggregated as the final GLTR by a simple single-stream CNN. GLTR shows substantial superiority to existing features learned with body part cues or metric learning on four widely-used video ReID datasets. For instance, it achieves Rank-1 Accuracy of 87.02% on MARS dataset without re-ranking, better than current state-of-the art.
Person re-identification (ReID) aims to match people across multiple non-overlapping video cameras deployed at different locations. To address this challenging problem, many metric learning approaches have been proposed, among which triplet loss is one of the state-of-the-arts. In this work, we explore the margin between positive and negative pairs of triplets and prove that large margin is beneficial. In particular, we propose a novel multi-stage training strategy which learns incremental triplet margin and improves triplet loss effectively. Multiple levels of feature maps are exploited to make the learned features more discriminative. Besides, we introduce global hard identity searching method to sample hard identities when generating a training batch. Extensive experiments on Market-1501, CUHK03, and DukeMTMCreID show that our approach yields a performance boost and outperforms most existing state-of-the-art methods.
This paper mainly establishes a large-scale Long sequence Video database for person re-IDentification (LVreID). Different from existing datasets, LVreID presents many important new features. (1) long sequences: the average sequence length is 200 frames, which convey more abundant cues like pose and viewpoint changes that can be explored for feature learning. (2) complex lighting, scene, and background variations: it is captured by 15 cameras located in both indoor and outdoor scenes in 12 time slots. (3) currently the largest size: it contains 3,772 identities and about 3 million bounding boxes. Those unique features in LVreID define a more challenging and realistic person ReID task. Spatial Aligned Temporal Pyramid Pooling (SATPP) network is proposed as a baseline algorithm to leverage the rich visual-temporal cues in LVreID for feature learning. SATPP jointly handles the misalignment issues in detected bounding boxes and efficiently aggregates the discriminative cues embedded in sequential video frames. Extensive experiments show feature extracted by SATPP outperforms several widely used video features. Our experiments also prove the ReID accuracy increases substantially along with longer sequence length. This demonstrates the advantage and necessity of using longer video sequences for person ReID.