Models, code, and papers for "Xin Wang":

##### Reject Illegal Inputs with Generative Classifier Derived from Any Discriminative Classifier

Jan 02, 2020
Xin Wang

Generative classifiers have been shown promising to detect illegal inputs including adversarial examples and out-of-distribution samples. Supervised Deep Infomax~(SDIM) is a scalable end-to-end framework to learn generative classifiers. In this paper, we propose a modification of SDIM termed SDIM-\emph{logit}. Instead of training generative classifier from scratch, SDIM-\emph{logit} first takes as input the logits produced any given discriminative classifier, and generate logit representations; then a generative classifier is derived by imposing statistical constraints on logit representations. SDIM-\emph{logit} could inherit the performance of the discriminative classifier without loss. SDIM-\emph{logit} incurs a negligible number of additional parameters, and can be efficiently trained with base classifiers fixed. We perform \emph{classification with rejection}, where test samples whose class conditionals are smaller than pre-chosen thresholds will be rejected without predictions. Experiments on illegal inputs, including adversarial examples, samples with common corruptions, and out-of-distribution~(OOD) samples show that allowed to reject a portion of test samples, SDIM-\emph{logit} significantly improves the performance on the left test sets.

* 7 pages, 3 figures
##### Sparse Learning in reproducing kernel Hilbert space

Jan 03, 2019
Xin He, Junhui Wang

Sparse learning aims to learn the sparse structure of the true target function from the collected data, which plays a crucial role in high dimensional data analysis. This article proposes a unified and universal method for learning sparsity of M-estimators within a rich family of loss functions in a reproducing kernel Hilbert space (RKHS). The family of loss functions interested is very rich, including most commonly used ones in literature. More importantly, the proposed method is motivated by some nice properties in the induced RKHS, and is computationally efficient for large-scale data, and can be further improved through parallel computing. The asymptotic estimation and selection consistencies of the proposed method are established for a general loss function under mild conditions. It works for general loss function, admits general dependence structure, allows for efficient computation, and with theoretical guarantee. The superior performance of our proposed method is also supported by a variety of simulated examples and a real application in the human breast cancer study (GSE20194).

##### A Fast Training Algorithm for Deep Convolutional Fuzzy Systems with Application to Stock Index Prediction

Dec 07, 2018
Li-Xin Wang

A deep convolutional fuzzy system (DCFS) on a high-dimensional input space is a multi-layer connection of many low-dimensional fuzzy systems, where the input variables to the low-dimensional fuzzy systems are selected through a moving window (a convolution operator) across the input spaces of the layers. To design the DCFS based on input-output data pairs, we propose a bottom-up layer-by-layer scheme. Specifically, by viewing each of the first-layer fuzzy systems as a weak estimator of the output based only on a very small portion of the input variables, we can design these fuzzy systems using the WM Method. After the first-layer fuzzy systems are designed, we pass the data through the first layer and replace the inputs in the original data set by the corresponding outputs of the first layer to form a new data set, then we design the second-layer fuzzy systems based on this new data set in the same way as designing the first-layer fuzzy systems. Repeating this process we design the whole DCFS. Since the WM Method requires only one-pass of the data, this training algorithm for the DCFS is very fast. We apply the DCFS model with the training algorithm to predict a synthetic chaotic plus random time-series and the real Hang Seng Index of the Hong Kong stock market.

##### Gaussian-Chain Filters for Heavy-Tailed Noise with Application to Detecting Big Buyers and Big Sellers in Stock Market

May 09, 2014
Li-Xin Wang

We propose a new heavy-tailed distribution --- Gaussian-Chain (GC) distribution, which is inspirited by the hierarchical structures prevailing in social organizations. We determine the mean, variance and kurtosis of the Gaussian-Chain distribution to show its heavy-tailed property, and compute the tail distribution table to give specific numbers showing how heavy is the heavy-tails. To filter out the heavy-tailed noise, we construct two filters --- 2nd and 3rd-order GC filters --- based on the maximum likelihood principle. Simulation results show that the GC filters perform much better than the benchmark least-squares algorithm when the noise is heavy-tail distributed. Using the GC filters, we propose a trading strategy, named Ride-the-Mood, to follow the mood of the market by detecting the actions of the big buyers and the big sellers in the market based on the noisy, heavy-tailed price data. Application of the Ride-the-Mood strategy to five blue-chip Hong Kong stocks over the recent two-year period from April 2, 2012 to March 31, 2014 shows that their returns are higher than the returns of the benchmark Buy-and-Hold strategy and the Hang Seng Index Fund.

##### Application of Word2vec in Phoneme Recognition

Dec 19, 2019
Xin Feng, Lei Wang

In this paper, we present how to hybridize a Word2vec model and an attention-based end-to-end speech recognition model. We build a phoneme recognition system based on Listen, Attend and Spell model. And the phoneme recognition model uses a word2vec model to initialize the embedding matrix for the improvement of the performance, which can increase the distance among the phoneme vectors. At the same time, in order to solve the problem of overfitting in the 61 phoneme recognition model on TIMIT dataset, we propose a new training method. A 61-39 phoneme mapping comparison table is used to inverse map the phonemes of the dataset to generate more 61 phoneme training data. At the end of training, replace the dataset with a standard dataset for corrective training. Our model can achieve the best result under the TIMIT dataset which is 16.5% PER (Phoneme Error Rate).

##### Recognizing License Plates in Real-Time

Jun 11, 2019
Xuewen Yang, Xin Wang

License plate detection and recognition (LPDR) is of growing importance for enabling intelligent transportation and ensuring the security and safety of the cities. However, LPDR faces a big challenge in a practical environment. The license plates can have extremely diverse sizes, fonts and colors, and the plate images are usually of poor quality caused by skewed capturing angles, uneven lighting, occlusion, and blurring. In applications such as surveillance, it often requires fast processing. To enable real-time and accurate license plate recognition, in this work, we propose a set of techniques: 1) a contour reconstruction method along with edge-detection to quickly detect the candidate plates; 2) a simple zero-one-alternation scheme to effectively remove the fake top and bottom borders around plates to facilitate more accurate segmentation of characters on plates; 3) a set of techniques to augment the training data, incorporate SIFT features into the CNN network, and exploit transfer learning to obtain the initial parameters for more effective training; and 4) a two-phase verification procedure to determine the correct plate at low cost, a statistical filtering in the plate detection stage to quickly remove unwanted candidates, and the accurate CR results after the CR process to perform further plate verification without additional processing. We implement a complete LPDR system based on our algorithms. The experimental results demonstrate that our system can accurately recognize license plate in real-time. Additionally, it works robustly under various levels of illumination and noise, and in the presence of car movement. Compared to peer schemes, our system is not only among the most accurate ones but is also the fastest, and can be easily applied to other scenarios.

* License Plate Detection and Recognition, Computer Vision, Supervised Learning
##### Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

Mar 19, 2019
Hesham Mostafa, Xin Wang

Deep neural networks are typically highly over-parameterized with pruning techniques able to remove a significant fraction of network parameters with little loss in accuracy. Recently, techniques based on dynamic re-allocation of non-zero parameters have emerged for training sparse networks directly without having to train a large dense model beforehand. We present a parameter re-allocation scheme that addresses the limitations of previous methods such as their high computational cost and the fixed number of parameters they allocate to each layer. We investigate the performance of these dynamic re-allocation methods in deep convolutional networks and show that our method outperforms previous static and dynamic parameterization methods, yielding the best accuracy for a given number of training parameters, and performing on par with networks obtained by iteratively pruning a trained dense model. We further investigated the mechanisms underlying the superior performance of the resulting sparse networks. We found that neither the structure, nor the initialization of the sparse networks discovered by our parameter reallocation scheme are sufficient to explain their superior generalization performance. Rather, it is the continuous exploration of different sparse network structures during training that is critical to effective learning. We show that it is more fruitful to explore these structural degrees of freedom than to add extra parameters to the network.

##### Deep Learning for Real-Time Crime Forecasting and its Ternarization

Real-time crime forecasting is important. However, accurate prediction of when and where the next crime will happen is difficult. No known physical model provides a reasonable approximation to such a complex system. Historical crime data are sparse in both space and time and the signal of interests is weak. In this work, we first present a proper representation of crime data. We then adapt the spatial temporal residual network on the well represented data to predict the distribution of crime in Los Angeles at the scale of hours in neighborhood-sized parcels. These experiments as well as comparisons with several existing approaches to prediction demonstrate the superiority of the proposed model in terms of accuracy. Finally, we present a ternarization technique to address the resource consumption issue for its deployment in real world. This work is an extension of our short conference proceeding paper [Wang et al, Arxiv 1707.03340].

* 14 pages, 7 figures
##### Scalable kernel-based variable selection with sparsistency

Feb 27, 2018
Xin He, Junhui Wang, Shaogao Lv

Variable selection is central to high-dimensional data analysis, and various algorithms have been developed. Ideally, a variable selection algorithm shall be flexible, scalable, and with theoretical guarantee, yet most existing algorithms cannot attain these properties at the same time. In this article, a three-step variable selection algorithm is developed, involving kernel-based estimation of the regression function and its gradient functions as well as a hard thresholding. Its key advantage is that it assumes no explicit model assumption, admits general predictor effects, allows for scalable computation, and attains desirable asymptotic sparsistency. The proposed algorithm can be adapted to any reproducing kernel Hilbert space (RKHS) with different kernel functions, and can be extended to interaction selection with slight modification. Its computational cost is only linear in the data dimension, and can be further improved through parallel computing. The sparsistency of the proposed algorithm is established for general RKHS under mild conditions, including linear and Gaussian kernels as special cases. Its effectiveness is also supported by a variety of simulated and real examples.

* 27 pages, 5 figures
##### A multi-task learning model for malware classification with useful file access pattern from API call sequence

Oct 19, 2016
Xin Wang, Siu Ming Yiu

Based on API call sequences, semantic-aware and machine learning (ML) based malware classifiers can be built for malware detection or classification. Previous works concentrate on crafting and extracting various features from malware binaries, disassembled binaries or API calls via static or dynamic analysis and resorting to ML to build classifiers. However, they tend to involve too much feature engineering and fail to provide interpretability. We solve these two problems with the recent advances in deep learning: 1) RNN-based autoencoders (RNN-AEs) can automatically learn low-dimensional representation of a malware from its raw API call sequence. 2) Multiple decoders can be trained under different supervisions to give more information, other than the class or family label of a malware. Inspired by the works of document classification and automatic sentence summarization, each API call sequence can be regarded as a sentence. In this paper, we make the first attempt to build a multi-task malware learning model based on API call sequences. The model consists of two decoders, one for malware classification and one for $\emph{file access pattern}$ (FAP) generation given the API call sequence of a malware. We base our model on the general seq2seq framework. Experiments show that our model can give competitive classification results as well as insightful FAP information.

##### Deep Learning for Learning Graph Representations

Jan 02, 2020
Wenwu Zhu, Xin Wang, Peng Cui

Mining graph data has become a popular research topic in computer science and has been widely studied in both academia and industry given the increasing amount of network data in the recent years. However, the huge amount of network data has posed great challenges for efficient analysis. This motivates the advent of graph representation which maps the graph into a low-dimension vector space, keeping original graph structure and supporting graph inference. The investigation on efficient representation of a graph has profound theoretical significance and important realistic meaning, we therefore introduce some basic ideas in graph representation/network embedding as well as some representative models in this chapter.

* 51 pages, 8 figures
##### Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment

Oct 28, 2019
Yusuke Yasuda, Xin Wang, Junichi Yamagishi

Sequence-to-sequence text-to-speech (TTS) is dominated by soft-attention-based methods. Recently, hard-attention-based methods have been proposed to prevent fatal alignment errors, but their sampling method of discrete alignment is poorly investigated. This research investigates various combinations of sampling methods and probability distributions for alignment transition modeling in a hard-alignment-based sequence-to-sequence TTS method called SSNT-TTS. We clarify the common sampling methods of discrete variables including greedy search, beam search, and random sampling from a Bernoulli distribution in a more general way. Furthermore, we introduce the binary Concrete distribution to model discrete variables more properly. The results of a listening test shows that deterministic search is more preferable than stochastic search, and the binary Concrete distribution is robust with stochastic search for natural alignment transition.

* Submitted to ICASSP 2020
##### Multi-modal Deep Analysis for Multimedia

Oct 11, 2019
Wenwu Zhu, Xin Wang, Hongzhi Li

With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge fusion: multi-modal fusion of data with domain knowledge. More specifically, on data-driven correlational representation, we highlight three important categories of methods, such as multi-modal deep representation, multi-modal transfer learning, and multi-modal hashing. On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining and multi-modal recommendation. Finally, we bring forward our insights and future research directions.

* 25 pages, 39 figures, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
##### Additive Powers-of-Two Quantization: A Non-uniform Discretization for Neural Networks

Sep 28, 2019
Yuhang Li, Xin Dong, Wei Wang

We proposed Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme that attends to the bell-shaped and long-tailed distribution of weights in neural networks. By constraining all quantization levels as a sum of several Powers-of-Two terms, APoT quantization enjoys overwhelming efficiency of computation and a good match with weights' distribution. A simple reparameterization on clipping function is applied to generate better-defined gradient for updating of optimal clipping threshold. Moreover, weight normalization is presented to refine the input distribution of weights to be more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models demonstrating the effectiveness of our proposed APoT quantization. For example, our 3-bit quantized ResNet-34 on ImageNet only drops 0.3% Top-1 and 0.2% Top-5 accuracy without bells and whistles, while the computation of our model is approximately 2x less than uniformly quantized neural networks.

* quantization, efficient neural network
##### Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

Aug 30, 2019
Yusuke Yasuda, Xin Wang, Junichi Yamagishi

End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to output acoustic features using a single network. A recent advance of end-to-end TTS is due to a key technique called attention mechanisms, and all successful methods proposed so far have been based on soft attention mechanisms. However, although network structures are becoming increasingly complex, end-to-end TTS systems with soft attention mechanisms may still fail to learn and to predict accurate alignment between the input and output. This may be because the soft attention mechanisms are too flexible. Therefore, we propose an approach that has more explicit but natural constraints suitable for speech signals to make alignment learning and prediction of end-to-end TTS systems more robust. The proposed system, with the constrained alignment scheme borrowed from segment-to-segment neural transduction (SSNT), directly calculates the joint probability of acoustic features and alignment given an input text. The alignment is designed to be hard and monotonically increase by considering the speech nature, and it is treated as a latent variable and marginalized during training. During prediction, both the alignment and acoustic features can be generated from the probabilistic distributions. The advantages of our approach are that we can simplify many modules for the soft attention and that we can train the end-to-end TTS model using a single likelihood function. As far as we know, our approach is the first end-to-end TTS without a soft attention mechanism.

* To be appeared at SSW10
##### Self-Supervised Dialogue Learning

Jun 30, 2019
Jiawei Wu, Xin Wang, William Yang Wang

The sequential order of utterances is often meaningful in coherent dialogues, and the order changes of utterances could lead to low-quality and incoherent conversations. We consider the order information as a crucial supervised signal for dialogue learning, which, however, has been neglected by many previous dialogue systems. Therefore, in this paper, we introduce a self-supervised learning task, inconsistent order detection, to explicitly capture the flow of conversation in dialogues. Given a sampled utterance pair triple, the task is to predict whether it is ordered or misordered. Then we propose a sampling-based self-supervised network SSN to perform the prediction with sampled triple references from previous dialogue history. Furthermore, we design a joint learning framework where SSN can guide the dialogue systems towards more coherent and relevant dialogue learning through adversarial training. We demonstrate that the proposed methods can be applied to both open-domain and task-oriented dialogue scenarios, and achieve the new state-of-the-art performance on the OpenSubtitiles and Movie-Ticket Booking datasets.

* 11pages, 2 figures, accepted to ACL 2019
##### Neural source-filter waveform models for statistical parametric speech synthesis

Apr 27, 2019
Xin Wang, Shinji Takaki, Junichi Yamagishi

Neural waveform models such as WaveNet have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. As an autoregressive (AR) model, WaveNet is limited by a slow sequential waveform generation process. Some new models that use the inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner. However, these IAF-based models require sequential transformation during training, which severely slows down the training speed. Other models such as Parallel WaveNet and ClariNet bring together the benefits of AR and IAF-based models and train an IAF model by transferring the knowledge from a pre-trained AR teacher to an IAF student without any sequential transformation. However, both models require additional training criteria, and their implementation is prohibitively complicated. We propose a framework for neural source-filter (NSF) waveform modeling without AR nor IAF-based approaches. This framework requires only three components for waveform generation: a source module that generates a sine-based signal as excitation, a non-AR dilated-convolution-based filter module that transforms the excitation into a waveform, and a conditional module that pre-processes the acoustic features for the source and filer modules. This framework minimizes spectral-amplitude distances for model training, which can be efficiently implemented by using short-time Fourier transform routines. Under this framework, we designed three NSF models and compared them with WaveNet. It was demonstrated that the NSF models generated waveforms at least 100 times faster than WaveNet, and the quality of the synthetic speech from the best NSF model was better than or equally good as that from WaveNet.

* Submitted to IEEE/ACM TASLP
##### Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation

Apr 04, 2019
Jiawei Wu, Xin Wang, William Yang Wang

The overreliance on large parallel corpora significantly limits the applicability of machine translation systems to the majority of language pairs. Back-translation has been dominantly used in previous approaches for unsupervised neural machine translation, where pseudo sentence pairs are generated to train the models with a reconstruction loss. However, the pseudo sentences are usually of low quality as translation errors accumulate during training. To avoid this fundamental issue, we propose an alternative but more effective approach, extract-edit, to extract and then edit real sentences from the target monolingual corpora. Furthermore, we introduce a comparative translation loss to evaluate the translated target sentences and thus train the unsupervised translation systems. Experiments show that the proposed approach consistently outperforms the previous state-of-the-art unsupervised machine translation systems across two benchmarks (English-French and English-German) and two low-resource language pairs (English-Romanian and English-Russian) by more than 2 (up to 3.63) BLEU points.

* 11 pages, 3 figures. Accepted to NAACL 2019