Models, code, and papers for "Weiran Wang":
We adopt a multi-view approach for analyzing two knowledge transfer settings---learning using privileged information (LUPI) and distillation---in a common framework. Under reasonable assumptions about the complexities of hypothesis spaces, and being optimistic about the expected loss achievable by the student (in distillation) and a transformed teacher predictor (in LUPI), we show that encouraging agreement between the teacher and the student leads to reduced search space. As a result, improved convergence rate can be obtained with regularized empirical risk minimization.
We study the problem of column selection in large-scale kernel canonical correlation analysis (KCCA) using the Nystr\"om approximation, where one approximates two positive semi-definite kernel matrices using "landmark" points from the training set. When building low-rank kernel approximations in KCCA, previous work mostly samples the landmarks uniformly at random from the training set. We propose novel strategies for sampling the landmarks non-uniformly based on a version of statistical leverage scores recently developed for kernel ridge regression. We study the approximation accuracy of the proposed non-uniform sampling strategy, develop an incremental algorithm that explores the path of approximation ranks and facilitates efficient model selection, and derive the kernel stability of out-of-sample mapping for our method. Experimental results on both synthetic and real-world datasets demonstrate the promise of our method.
We provide a simple and efficient algorithm for the projection operator for weighted $\ell_1$-norm regularization subject to a sum constraint, together with an elementary proof. The implementation of the proposed algorithm can be downloaded from the author's homepage.
We study stochastic optimization of nonconvex loss functions, which are typical objectives for training neural networks. We propose stochastic approximation algorithms which optimize a series of regularized, nonlinearized losses on large minibatches of samples, using only first-order gradient information. Our algorithms provably converge to an approximate critical point of the expected objective with faster rates than minibatch stochastic gradient descent, and facilitate better parallelization by allowing larger minibatches.
Kernel canonical correlation analysis (KCCA) is a nonlinear multi-view representation learning technique with broad applicability in statistics and machine learning. Although there is a closed-form solution for the KCCA objective, it involves solving an $N\times N$ eigenvalue system where $N$ is the training set size, making its computational requirements in both memory and time prohibitive for large-scale problems. Various approximation techniques have been developed for KCCA. A commonly used approach is to first transform the original inputs to an $M$-dimensional random feature space so that inner products in the feature space approximate kernel evaluations, and then apply linear CCA to the transformed inputs. In many applications, however, the dimensionality $M$ of the random feature space may need to be very large in order to obtain a sufficiently good approximation; it then becomes challenging to perform the linear CCA step on the resulting very high-dimensional data matrices. We show how to use a stochastic optimization algorithm, recently proposed for linear CCA and its neural-network extension, to further alleviate the computation requirements of approximate KCCA. This approach allows us to run approximate KCCA on a speech dataset with $1.4$ million training samples and a random feature space of dimensionality $M=100000$ on a typical workstation.
We provide a simple and efficient algorithm for computing the Euclidean projection of a point onto the capped simplex---a simplex with an additional uniform bound on each coordinate---together with an elementary proof. Both the MATLAB and C++ implementations of the proposed algorithm can be downloaded at https://eng.ucmerced.edu/people/wwang5.
While deep learning based end-to-end automatic speech recognition (ASR) systems have greatly simplified modeling pipelines, they suffer from the data sparsity issue. In this work, we propose a self-training method with an end-to-end system for semi-supervised ASR. Starting from a Connectionist Temporal Classification (CTC) system trained on the supervised data, we iteratively generate pseudo-labels on a mini-batch of unsupervised utterances with the current model, and use the pseudo-labels to augment the supervised data for immediate model update. Our method retains the simplicity of end-to-end ASR systems, and can be seen as performing alternating optimization over a well-defined learning objective. We also perform empirical investigations of our method, regarding the effect of data augmentation, decoding beamsize for pseudo-label generation, and freshness of pseudo-labels. On a commonly used semi-supervised ASR setting with the WSJ corpus, our method gives 14.4% relative WER improvement over a carefully-trained base system with data augmentation, reducing the performance gap between the base system and the oracle system by 50%.
We present and analyze an approach for distributed stochastic optimization which is statistically optimal and achieves near-linear speedups (up to logarithmic factors). Our approach allows a communication-memory tradeoff, with either logarithmic communication but linear memory, or polynomial communication and a corresponding polynomial reduction in required memory. This communication-memory tradeoff is achieved through minibatch-prox iterations (minibatch passive-aggressive updates), where a subproblem on a minibatch is solved at each iteration. We provide a novel analysis for such a minibatch-prox procedure which achieves the statistical optimal rate regardless of minibatch size and smoothness, thus significantly improving on prior work.
Recent work has begun exploring neural acoustic word embeddings---fixed-dimensional vector representations of arbitrary-length speech segments corresponding to words. Such embeddings are applicable to speech retrieval and recognition tasks, where reasoning about whole words may make it possible to avoid ambiguous sub-word representations. The main idea is to map acoustic sequences to fixed-dimensional vectors such that examples of the same word are mapped to similar vectors, while different-word examples are mapped to very different vectors. In this work we take a multi-view approach to learning acoustic word embeddings, in which we jointly learn to embed acoustic sequences and their corresponding character sequences. We use deep bidirectional LSTM embedding models and multi-view contrastive losses. We study the effect of different loss variants, including fixed-margin and cost-sensitive losses. Our acoustic word embeddings improve over previous approaches for the task of word discrimination. We also present results on other tasks that are enabled by the multi-view approach, including cross-view word discrimination and word similarity.
Practitioners often need to build ASR systems for new use cases in a short amount of time, given limited in-domain data. While recently developed end-to-end methods largely simplify the modeling pipelines, they still suffer from the data sparsity issue. In this work, we explore a few simple-to-implement techniques for building online ASR systems in an end-to-end fashion, with a small amount of transcribed data in the target domain. These techniques include data augmentation in the target domain, domain adaptation using models previously trained on a large source domain, and knowledge distillation on non-transcribed target domain data; they are applicable in real scenarios with different types of resources. Our experiments demonstrate that each technique is independently useful in the low-resource setting, and combining them yields significant improvement of the online ASR performance in the target domain.
Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain complex sound patterns. For example, a cooking scene may contain several sound sources including silverware clinking, chopping, frying, etc. What complicates ASC more is that classes of different activities could have overlapping sounds patterns (e.g. both cooking and dishwashing could have silverware clinking sound). In this paper, we propose a multi-head attention network to model the complex temporal input structures for ASC. The proposed network takes the audio's time-frequency representation as input, and it leverages standard VGG plus LSTM layers to extract high-level feature representation. Further more, it applies multiple attention heads to summarize various patterns of sound events into fixed dimensional representation, for the purpose of final scene classification. The whole network is trained in an end-to-end fashion with back-propagation. Experimental results confirm that our model discovers meaningful sound patterns through the attention mechanism, without using explicit supervision in the alignment. We evaluated our proposed model using DCASE 2018 Task 5 dataset, and achieved competitive performance on par with previous winner's results.
Previous work has shown that it is possible to improve speech recognition by learning acoustic features from paired acoustic-articulatory data, for example by using canonical correlation analysis (CCA) or its deep extensions. One limitation of this prior work is that the learned feature models are difficult to port to new datasets or domains, and articulatory data is not available for most speech corpora. In this work we study the problem of acoustic feature learning in the setting where we have access to an external, domain-mismatched dataset of paired speech and articulatory measurements, either with or without labels. We develop methods for acoustic feature learning in these settings, based on deep variational CCA and extensions that use both source and target domain data and labels. Using this approach, we improve phonetic recognition accuracies on both TIMIT and Wall Street Journal and analyze a number of design choices.
We study the problem of acoustic feature learning in the setting where we have access to another (non-acoustic) modality for feature learning but not at test time. We use deep variational canonical correlation analysis (VCCA), a recently proposed deep generative method for multi-view representation learning. We also extend VCCA with improved latent variable priors and with adversarial learning. Compared to other techniques for multi-view feature learning, VCCA's advantages include an intuitive latent variable interpretation and a variational lower bound objective that can be trained end-to-end efficiently. We compare VCCA and its extensions with previous feature learning methods on the University of Wisconsin X-ray Microbeam Database, and show that VCCA-based feature learning improves over previous methods for speaker-independent phonetic recognition.
Canonical correlation analysis (CCA) is a classical representation learning technique for finding correlated variables in multi-view data. Several nonlinear extensions of the original linear CCA have been proposed, including kernel and deep neural network methods. These approaches seek maximally correlated projections among families of functions, which the user specifies (by choosing a kernel or neural network structure), and are computationally demanding. Interestingly, the theory of nonlinear CCA, without functional restrictions, had been studied in the population setting by Lancaster already in the 1950s, but these results have not inspired practical algorithms. We revisit Lancaster's theory to devise a practical algorithm for nonparametric CCA (NCCA). Specifically, we show that the solution can be expressed in terms of the singular value decomposition of a certain operator associated with the joint density of the views. Thus, by estimating the population density from data, NCCA reduces to solving an eigenvalue system, superficially like kernel CCA but, importantly, without requiring the inversion of any kernel matrix. We also derive a partially linear CCA (PLCCA) variant in which one of the views undergoes a linear projection while the other is nonparametric. Using a kernel density estimate based on a small number of nearest neighbors, our NCCA and PLCCA algorithms are memory-efficient, often run much faster, and perform better than kernel CCA and comparable to deep CCA.
Recent studies have been revisiting whole words as the basic modelling unit in speech recognition and query applications, instead of phonetic units. Such whole-word segmental systems rely on a function that maps a variable-length speech segment to a vector in a fixed-dimensional space; the resulting acoustic word embeddings need to allow for accurate discrimination between different word types, directly in the embedding space. We compare several old and new approaches in a word discrimination task. Our best approach uses side information in the form of known word pairs to train a Siamese convolutional neural network (CNN): a pair of tied networks that take two speech segments as input and produce their embeddings, trained with a hinge loss that separates same-word pairs and different-word pairs by some margin. A word classifier CNN performs similarly, but requires much stronger supervision. Both types of CNNs yield large improvements over the best previously published results on the word discrimination task.
In addition to finding meaningful clusters, centroid-based clustering algorithms such as K-means or mean-shift should ideally find centroids that are valid patterns in the input space, representative of data in their cluster. This is challenging with data having a nonconvex or manifold structure, as with images or text. We introduce a new algorithm, Laplacian K-modes, which naturally combines three powerful ideas in clustering: the explicit use of assignment variables (as in K-means); the estimation of cluster centroids which are modes of each cluster's density estimate (as in mean-shift); and the regularizing effect of the graph Laplacian, which encourages similar assignments for nearby points (as in spectral clustering). The optimization algorithm alternates an assignment step, which is a convex quadratic program, and a mean-shift step, which separates for each cluster centroid. The algorithm finds meaningful density estimates for each cluster, even with challenging problems where the clusters have manifold structure, are highly nonconvex or in high dimension. It also provides centroids that are valid patterns, truly representative of their cluster (unlike K-means), and an out-of-sample mapping that predicts soft assignments for a new point.
Dimensionality reduction (DR) is often used as a preprocessing step in classification, but usually one first fixes the DR mapping, possibly using label information, and then learns a classifier (a filter approach). Best performance would be obtained by optimizing the classification error jointly over DR mapping and classifier (a wrapper approach), but this is a difficult nonconvex problem, particularly with nonlinear DR. Using the method of auxiliary coordinates, we give a simple, efficient algorithm to train a combination of nonlinear DR and a classifier, and apply it to a RBF mapping with a linear SVM. This alternates steps where we train the RBF mapping and a linear SVM as usual regression and classification, respectively, with a closed-form step that coordinates both. The resulting nonlinear low-dimensional classifier achieves classification errors competitive with the state-of-the-art but is fast at training and testing, and allows the user to trade off runtime for classification accuracy easily. We then study the role of nonlinear DR in linear classification, and the interplay between the DR mapping, the number of latent dimensions and the number of classes. When trained jointly, the DR mapping takes an extreme role in eliminating variation: it tends to collapse classes in latent space, erasing all manifold structure, and lay out class centroids so they are linearly separable with maximum margin.
We consider the problem of learning soft assignments of $N$ items to $K$ categories given two sources of information: an item-category similarity matrix, which encourages items to be assigned to categories they are similar to (and to not be assigned to categories they are dissimilar to), and an item-item similarity matrix, which encourages similar items to have similar assignments. We propose a simple quadratic programming model that captures this intuition. We give necessary conditions for its solution to be unique, define an out-of-sample mapping, and derive a simple, effective training algorithm based on the alternating direction method of multipliers. The model predicts reasonable assignments from even a few similarity values, and can be seen as a generalization of semisupervised learning. It is particularly useful when items naturally belong to multiple categories, as for example when annotating documents with keywords or pictures with tags, with partially tagged items, or when the categories have complex interrelations (e.g. hierarchical) that are unknown.
We provide an elementary proof of a simple, efficient algorithm for computing the Euclidean projection of a point onto the probability simplex. We also show an application in Laplacian K-modes clustering.
Many clustering algorithms exist that estimate a cluster centroid, such as K-means, K-medoids or mean-shift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. We propose a natural definition of a K-modes objective function by combining the notions of density and cluster assignment. The algorithm becomes K-means and K-medoids in the limit of very large and very small scales. Computationally, it is slightly slower than K-means but much faster than mean-shift or K-medoids. Unlike K-means, it is able to find centroids that are valid patterns, truly representative of a cluster, even with nonconvex clusters, and appears robust to outliers and misspecification of the scale and number of clusters.