Research papers and code for "Yu Cao":
This paper presents a novel oversampling technique that addresses highly imbalanced benchmark and electroencephalogram (EEG) data distributions. Presently, conventional machine learning technologies do not adequately address imbalanced data with an anomalous class distribution and underrepresented data. To balance the class distributions, an adaptive subspace self-organizing map (ASSOM) that combines a local mapping scheme and the globally competitive rule is proposed to artificially generate synthetic samples that focus on minority class samples and its application in EEG. The ASSOM is configured with feature-invariant characteristics, including translation, scaling, and rotation, and it retains the independence of the basis vectors in each module. Specifically, basis vectors that are generated via each ASSOM module can avoid generating repeated representative features that only increase the computational load. Several benchmark experimental results demonstrate that the proposed ASSOM method incorporating a supervised learning approach could be superior to other existing oversampling techniques, and two EEG applications present the improvement of classification accuracy using the proposed ASSOM method.

* Submitting to IEEE Transactions on Cybernetics
Click to Read Paper and Get Code
We present a 3D Convolutional Neural Networks (CNNs) based single shot detector for spatial-temporal action detection tasks. Our model includes: (1) two short-term appearance and motion streams, with single RGB and optical flow image input separately, in order to capture the spatial and temporal information for the current frame; (2) two long-term 3D ConvNet based stream, working on sequences of continuous RGB and optical flow images to capture the context from past frames. Our model achieves strong performance for action detection in video and can be easily integrated into any current two-stream action detection methods. We report a frame-mAP of 71.30% on the challenging UCF101-24 actions dataset, achieving the state-of-the-art result of the one-stage methods. To the best of our knowledge, our work is the first system that combined 3D CNN and SSD in action detection tasks.

* 26th IEEE International Conference on Image Processing (ICIP 2019)
Click to Read Paper and Get Code
Today a canonical approach to reduce the computation cost of Deep Neural Networks (DNNs) is to pre-define an over-parameterized model before training to guarantee the learning capacity, and then prune unimportant learning units (filters and neurons) during training to improve model compactness. We argue it is unnecessary to introduce redundancy at the beginning of the training but then reduce redundancy for the ultimate inference model. In this paper, we propose a Continuous Growth and Pruning (CGaP) scheme to minimize the redundancy from the beginning. CGaP starts the training from a small network seed, then expands the model continuously by reinforcing important learning units, and finally prunes the network to obtain a compact and accurate model. As the growth phase favors important learning units, CGaP provides a clear learning purpose to the pruning phase. Experimental results on representative datasets and DNN architectures demonstrate that CGaP outperforms previous pruning-only approaches that deal with pre-defined structures. For VGG-19 on CIFAR-100 and SVHN datasets, CGaP reduces the number of parameters by 78.9% and 85.8%, FLOPs by 53.2% and 74.2%, respectively; For ResNet-110 On CIFAR-10, CGaP reduces 64.0% number of parameters and 63.3% FLOPs.

Click to Read Paper and Get Code
Multi-hop reasoning question answering requires deep comprehension of relationships between various documents and queries. We propose a Bi-directional Attention Entity Graph Convolutional Network (BAG), leveraging relationships between nodes in an entity graph and attention information between a query and the entity graph, to solve this task. Graph convolutional networks are used to obtain a relation-aware representation of nodes for entity graphs built from documents with multi-level features. Bidirectional attention is then applied on graphs and queries to generate a query-aware nodes representation, which will be used for the final prediction. Experimental evaluation shows BAG achieves state-of-the-art accuracy performance on the QAngaroo WIKIHOP dataset.

* 5 pages, 1 figure, accepted short paper on NAACL-HLT 2019
Click to Read Paper and Get Code
In recent years, Deep Learning (DL) has found great success in domains such as multimedia understanding. However, the complex nature of multimedia data makes it difficult to develop DL-based software. The state-of-the art tools, such as Caffe, TensorFlow, Torch7, and CNTK, while are successful in their applicable domains, are programming libraries with fixed user interface, internal representation, and execution environment. This makes it difficult to implement portable and customized DL applications. In this paper, we present DeepDSL, a domain specific language (DSL) embedded in Scala, that compiles deep networks written in DeepDSL to Java source code. Deep DSL provides (1) intuitive constructs to support compact encoding of deep networks; (2) symbolic gradient derivation of the networks; (3) static analysis for memory consumption and error detection; and (4) DSL-level optimization to improve memory and runtime efficiency. DeepDSL programs are compiled into compact, efficient, customizable, and portable Java source code, which operates the CUDA and CUDNN interfaces running on Nvidia GPU via a Java Native Interface (JNI) library. We evaluated DeepDSL with a number of popular DL networks. Our experiments show that the compiled programs have very competitive runtime performance and memory efficiency compared to the existing libraries.

Click to Read Paper and Get Code
With the fast development of various positioning techniques such as Global Position System (GPS), mobile devices and remote sensing, spatio-temporal data has become increasingly available nowadays. Mining valuable knowledge from spatio-temporal data is critically important to many real world applications including human mobility understanding, smart transportation, urban planning, public safety, health care and environmental management. As the number, volume and resolution of spatio-temporal datasets increase rapidly, traditional data mining methods, especially statistics based methods for dealing with such data are becoming overwhelmed. Recently, with the advances of deep learning techniques, deep leaning models such as convolutional neural network (CNN) and recurrent neural network (RNN) have enjoyed considerable success in various machine learning tasks due to their powerful hierarchical feature learning ability in both spatial and temporal domains, and have been widely applied in various spatio-temporal data mining (STDM) tasks such as predictive learning, representation learning, anomaly detection and classification. In this paper, we provide a comprehensive survey on recent progress in applying deep learning techniques for STDM. We first categorize the types of spatio-temporal data and briefly introduce the popular deep learning models that are used in STDM. Then a framework is introduced to show a general pipeline of the utilization of deep learning models for STDM. Next we classify existing literatures based on the types of ST data, the data mining tasks, and the deep learning models, followed by the applications of deep learning for STDM in different domains including transportation, climate science, human mobility, location based social network, crime analysis, and neuroscience. Finally, we conclude the limitations of current research and point out future research directions.

* arXiv admin note: text overlap with arXiv:1711.04710 by other authors
Click to Read Paper and Get Code
Multi-view high-dimensional data become increasingly popular in the big data era. Feature selection is a useful technique for alleviating the curse of dimensionality in multi-view learning. In this paper, we study unsupervised feature selection for multi-view data, as class labels are usually expensive to obtain. Traditional feature selection methods are mostly designed for single-view data and cannot fully exploit the rich information from multi-view data. Existing multi-view feature selection methods are usually based on noisy cluster labels which might not preserve sufficient information from multi-view data. To better utilize multi-view information, we propose a method, CDMA-FS, to select features for each view by performing alignment on a cross diffused matrix. We formulate it as a constrained optimization problem and solve it using Quasi-Newton based method. Experiments results on four real-world datasets show that the proposed method is more effective than the state-of-the-art methods in multi-view setting.

* 8 pages
Click to Read Paper and Get Code
With rapid advances in neuroimaging techniques, the research on brain disorder identification has become an emerging area in the data mining community. Brain disorder data poses many unique challenges for data mining research. For example, the raw data generated by neuroimaging experiments is in tensor representations, with typical characteristics of high dimensionality, structural complexity and nonlinear separability. Furthermore, brain connectivity networks can be constructed from the tensor data, embedding subtle interactions between brain regions. Other clinical measures are usually available reflecting the disease status from different perspectives. It is expected that integrating complementary information in the tensor data and the brain network data, and incorporating other clinical parameters will be potentially transformative for investigating disease mechanisms and for informing therapeutic interventions. Many research efforts have been devoted to this area. They have achieved great success in various applications, such as tensor-based modeling, subgraph pattern mining, multi-view feature analysis. In this paper, we review some recent data mining methods that are used for analyzing brain disorders.

Click to Read Paper and Get Code
Collaborative filtering is an effective recommendation approach in which the preference of a user on an item is predicted based on the preferences of other users with similar interests. A big challenge in using collaborative filtering methods is the data sparsity problem which often arises because each user typically only rates very few items and hence the rating matrix is extremely sparse. In this paper, we address this problem by considering multiple collaborative filtering tasks in different domains simultaneously and exploiting the relationships between domains. We refer to it as a multi-domain collaborative filtering (MCF) problem. To solve the MCF problem, we propose a probabilistic framework which uses probabilistic matrix factorization to model the rating problem in each domain and allows the knowledge to be adaptively transferred across different domains by automatically learning the correlation between domains. We also introduce the link function for different domains to correct their biases. Experiments conducted on several real-world applications demonstrate the effectiveness of our methods when compared with some representative methods.

* Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)
Click to Read Paper and Get Code
There is an increasing need of continual learning in dynamic systems, such as the self-driving vehicle, the surveillance drone, and the robotic system. Such a system requires learning from the data stream, training the model to preserve previous information and adapt to a new task, and generating a single-headed vector for future inference. Different from previous approaches with dynamic structures, this work focuses on a single network and model segmentation to prevent catastrophic forgetting. Leveraging the redundant capacity of a single network, model parameters for each task are separated into two groups: one important group which is frozen to preserve current knowledge, and secondary group to be saved (not pruned) for a future learning. A fixed-size memory containing a small amount of previously seen data is further adopted to assist the training. Without additional regularization, the simple yet effective approach of PST successfully incorporates multiple tasks and achieves the state-of-the-art accuracy in the single-head evaluation on CIFAR-10 and CIFAR-100 datasets. Moreover, the segmented training significantly improves computation efficiency in continual learning.

Click to Read Paper and Get Code
Deep Neural Networks (DNNs) on hardware is facing excessive computation cost due to the massive number of parameters. A typical training pipeline to mitigate over-parameterization is to pre-define a DNN structure first with redundant learning units (filters and neurons) under the goal of high accuracy, then to prune redundant learning units after training with the purpose of efficient inference. We argue that it is sub-optimal to introduce redundancy into training for the purpose of reducing redundancy later in inference. Moreover, the fixed network structure further results in poor adaption to dynamic tasks, such as lifelong learning. In contrast, structural plasticity plays an indispensable role in mammalian brains to achieve compact and accurate learning. Throughout the lifetime, active connections are continuously created while those no longer important are degenerated. Inspired by such observation, we propose a training scheme, namely Continuous Growth and Pruning (CGaP), where we start the training from a small network seed, then literally execute continuous growth by adding important learning units and finally prune secondary ones for efficient inference. The inference model generated from CGaP is sparse in the structure, largely decreasing the inference power and latency when deployed on hardware platforms. With popular DNN structures on representative datasets, the efficacy of CGaP is benchmarked by both algorithm simulation and architectural modeling on Field-programmable Gate Arrays (FPGA). For example, CGaP decreases the FLOPs, model size, DRAM access energy and inference latency by 63.3%, 64.0%, 11.8% and 40.2%, respectively, for ResNet-110 on CIFAR-10.

Click to Read Paper and Get Code
This paper introduces a deep learning enabled generative sensing framework which integrates low-end sensors with computational intelligence to attain a high recognition accuracy on par with that attained with high-end sensors. The proposed generative sensing framework aims at transforming low-end, low-quality sensor data into higher quality sensor data in terms of achieved classification accuracy. The low-end data can be transformed into higher quality data of the same modality or into data of another modality. Different from existing methods for image generation, the proposed framework is based on discriminative models and targets to maximize the recognition accuracy rather than a similarity measure. This is achieved through the introduction of selective feature regeneration in a deep neural network (DNN). The proposed generative sensing will essentially transform low-quality sensor data into high-quality information for robust perception. Results are presented to illustrate the performance of the proposed framework.

* 5 pages, Submitted to IEEE MIPR 2018
Click to Read Paper and Get Code
Many people use Yelp to find a good restaurant. Nonetheless, with only an overall rating for each restaurant, Yelp offers not enough information for independently judging its various aspects such as environment, service or flavor. In this paper, we introduced a machine learning based method to characterize such aspects for particular types of restaurants. The main approach used in this paper is to use a support vector machine (SVM) model to decipher the sentiment tendency of each review from word frequency. Word scores generated from the SVM models are further processed into a polarity index indicating the significance of each word for special types of restaurant. Customers overall tend to express more sentiment regarding service. As for the distinction between different cuisines, results that match the common sense are obtained: Japanese cuisines are usually fresh, some French cuisines are overpriced while Italian Restaurants are often famous for their pizzas.

* 6 pages
Click to Read Paper and Get Code
Colorization of grayscale images has been a hot topic in computer vision. Previous research mainly focuses on producing a colored image to match the original one. However, since many colors share the same gray value, an input grayscale image could be diversely colored while maintaining its reality. In this paper, we design a novel solution for unsupervised diverse colorization. Specifically, we leverage conditional generative adversarial networks to model the distribution of real-world item colors, in which we develop a fully convolutional generator with multi-layer noise to enhance diversity, with multi-layer condition concatenation to maintain reality, and with stride 1 to keep spatial information. With such a novel network architecture, the model yields highly competitive performance on the open LSUN bedroom dataset. The Turing test of 80 humans further indicates our generated color schemes are highly convincible.

Click to Read Paper and Get Code
For a learning task, data can usually be collected from different sources or be represented from multiple views. For example, laboratory results from different medical examinations are available for disease diagnosis, and each of them can only reflect the health state of a person from a particular aspect/view. Therefore, different views provide complementary information for learning tasks. An effective integration of the multi-view information is expected to facilitate the learning performance. In this paper, we propose a general predictor, named multi-view machines (MVMs), that can effectively include all the possible interactions between features from multiple views. A joint factorization is embedded for the full-order interaction parameters which allows parameter estimation under sparsity. Moreover, MVMs can work in conjunction with different loss functions for a variety of machine learning tasks. A stochastic gradient descent method is presented to learn the MVM model. We further illustrate the advantages of MVMs through comparison with other methods for multi-view classification, including support vector machines (SVMs), support tensor machines (STMs) and factorization machines (FMs).

* WSDM 2016
Click to Read Paper and Get Code
Multi-view face synthesis from a single image is an ill-posed problem and often suffers from serious appearance distortion. Producing photo-realistic and identity preserving multi-view results is still a not well defined synthesis problem. This paper proposes Load Balanced Generative Adversarial Networks (LB-GAN) to precisely rotate the yaw angle of an input face image to any specified angle. LB-GAN decomposes the challenging synthesis problem into two well constrained subtasks that correspond to a face normalizer and a face editor respectively. The normalizer first frontalizes an input image, and then the editor rotates the frontalized image to a desired pose guided by a remote code. In order to generate photo-realistic local details, the normalizer and the editor are trained in a two-stage manner and regulated by a conditional self-cycle loss and an attention based L2 loss. Exhaustive experiments on controlled and uncontrolled environments demonstrate that the proposed method not only improves the visual realism of multi-view synthetic images, but also preserves identity information well.

Click to Read Paper and Get Code