Models, code, and papers for "Yi Zhao":

Granger Mediation Analysis of Multiple Time Series with an Application to fMRI

Sep 15, 2017
Yi Zhao, Xi Luo

It becomes increasingly popular to perform mediation analysis for complex data from sophisticated experimental studies. In this paper, we present Granger Mediation Analysis (GMA), a new framework for causal mediation analysis of multiple time series. This framework is motivated by a functional magnetic resonance imaging (fMRI) experiment where we are interested in estimating the mediation effects between a randomized stimulus time series and brain activity time series from two brain regions. The stable unit treatment assumption for causal mediation analysis is thus unrealistic for this type of time series data. To address this challenge, our framework integrates two types of models: causal mediation analysis across the variables and vector autoregressive models across the temporal observations. We further extend this framework to handle multilevel data to address individual variability and correlated errors between the mediator and the outcome variables. These models not only provide valid causal mediation for time series data but also model the causal dynamics across time. We show that the modeling parameters in our models are identifiable, and we develop computationally efficient methods to maximize the likelihood-based optimization criteria. Simulation studies show that our method reduces the estimation bias and improve statistical power, compared to existing approaches. On a real fMRI data set, our approach not only infers the causal effects of brain pathways but accurately captures the feedback effect of the outcome region on the mediator region.

* 59 pages. Presented at the 2017 ENAR, JSM, and other meetings 

  Click for Model/Code and Paper
The Game Imitation: Deep Supervised Convolutional Networks for Quick Video Game AI

Feb 18, 2017
Zhao Chen, Darvin Yi

We present a vision-only model for gaming AI which uses a late integration deep convolutional network architecture trained in a purely supervised imitation learning context. Although state-of-the-art deep learning models for video game tasks generally rely on more complex methods such as deep-Q learning, we show that a supervised model which requires substantially fewer resources and training time can already perform well at human reaction speeds on the N64 classic game Super Smash Bros. We frame our learning task as a 30-class classification problem, and our CNN model achieves 80% top-1 and 95% top-3 validation accuracy. With slight test-time fine-tuning, our model is also competitive during live simulation with the highest-level AI built into the game. We will further show evidence through network visualizations that the network is successfully leveraging temporal information during inference to aid in decision making. Our work demonstrates that supervised CNN models can provide good performance in challenging policy prediction tasks while being significantly simpler and more lightweight than alternatives.

* 11 pages, 12 figures 

  Click for Model/Code and Paper
Pathway Lasso: Estimate and Select Sparse Mediation Pathways with High Dimensional Mediators

Mar 24, 2016
Yi Zhao, Xi Luo

In many scientific studies, it becomes increasingly important to delineate the causal pathways through a large number of mediators, such as genetic and brain mediators. Structural equation modeling (SEM) is a popular technique to estimate the pathway effects, commonly expressed as products of coefficients. However, it becomes unstable to fit such models with high dimensional mediators, especially for a general setting where all the mediators are causally dependent but the exact causal relationships between them are unknown. This paper proposes a sparse mediation model using a regularized SEM approach, where sparsity here means that a small number of mediators have nonzero mediation effects between a treatment and an outcome. To address the model selection challenge, we innovate by introducing a new penalty called Pathway Lasso. This penalty function is a convex relaxation of the non-convex product function, and it enables a computationally tractable optimization criterion to estimate and select many pathway effects simultaneously. We develop a fast ADMM-type algorithm to compute the model parameters, and we show that the iterative updates can be expressed in closed form. On both simulated data and a real fMRI dataset, the proposed approach yields higher pathway selection accuracy and lower estimation bias than other competing methods.

* 26 pages and 7 figures. Presented at the 2016 ENAR meeting, March 8, 2016, see slides at https://rluo.github.io/slides/MultipleMediator_ENAR_2016.html 

  Click for Model/Code and Paper
Classification of entities via their descriptive sentences

Nov 28, 2017
Chao Zhao, Min Zhao, Yi Guan

Hypernym identification of open-domain entities is crucial for taxonomy construction as well as many higher-level applications. Current methods suffer from either low precision or low recall. To decrease the difficulty of this problem, we adopt a classification-based method. We pre-define a concept taxonomy and classify an entity to one of its leaf concept, based on the name and description information of the entity. A convolutional neural network classifier and a K-means clustering module are adopted for classification. We applied this system to 2.1 million Baidu Baike entities, and 1.1 million of them were successfully identified with a precision of 99.36%.


  Click for Model/Code and Paper
Constructing a Hierarchical User Interest Structure based on User Profiles

Sep 20, 2017
Chao Zhao, Min Zhao, Yi Guan

The interests of individual internet users fall into a hierarchical structure which is useful in regards to building personalized searches and recommendations. Most studies on this subject construct the interest hierarchy of a single person from the document perspective. In this study, we constructed the user interest hierarchy via user profiles. We organized 433,397 user interests, referred to here as "attentions", into a user attention network (UAN) from 200 million user profiles; we then applied the Louvain algorithm to detect hierarchical clusters in these attentions. Finally, a 26-level hierarchy with 34,676 clusters was obtained. We found that these attention clusters were aggregated according to certain topics as opposed to the hyponymy-relation based conceptual ontologies. The topics can be entities or concepts, and the relations were not restrained by hyponymy. The concept relativity encapsulated in the user's interest can be captured by labeling the attention clusters with corresponding concepts.


  Click for Model/Code and Paper
EMR-based medical knowledge representation and inference via Markov random fields and distributed representation learning

Sep 20, 2017
Chao Zhao, Jingchi Jiang, Yi Guan

Objective: Electronic medical records (EMRs) contain an amount of medical knowledge which can be used for clinical decision support (CDS). Our objective is a general system that can extract and represent these knowledge contained in EMRs to support three CDS tasks: test recommendation, initial diagnosis, and treatment plan recommendation, with the given condition of one patient. Methods: We extracted four kinds of medical entities from records and constructed an EMR-based medical knowledge network (EMKN), in which nodes are entities and edges reflect their co-occurrence in a single record. Three bipartite subgraphs (bi-graphs) were extracted from the EMKN to support each task. One part of the bi-graph was the given condition (e.g., symptoms), and the other was the condition to be inferred (e.g., diseases). Each bi-graph was regarded as a Markov random field to support the inference. Three lazy energy functions and one parameter-based energy function were proposed, as well as two knowledge representation learning-based energy functions, which can provide a distributed representation of medical entities. Three measures were utilized for performance evaluation. Results: On the initial diagnosis task, 80.11% of the test records identified at least one correct disease from top 10 candidates. Test and treatment recommendation results were 87.88% and 92.55%, respectively. These results altogether indicate that the proposed system outperformed the baseline methods. The distributed representation of medical entities does reflect similarity relationships in regards to knowledge level. Conclusion: Combining EMKN and MRF is an effective approach for general medical knowledge representation and inference. Different tasks, however, require designing their energy functions individually.


  Click for Model/Code and Paper
Fast Asynchronous Parallel Stochastic Gradient Decent

Aug 24, 2015
Shen-Yi Zhao, Wu-Jun Li

Stochastic gradient descent~(SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient~(SVRG). Both theoretical and empirical results show that AsySVRG can outperform existing state-of-the-art parallel SGD methods like Hogwild! in terms of convergence rate and computation cost.


  Click for Model/Code and Paper
Unsupervised Learning Layers for Video Analysis

May 24, 2017
Liang Zhao, Yang Wang, Yi Yang, Wei Xu

This paper presents two unsupervised learning layers (UL layers) for label-free video analysis: one for fully connected layers, and the other for convolutional ones. The proposed UL layers can play two roles: they can be the cost function layer for providing global training signal; meanwhile they can be added to any regular neural network layers for providing local training signals and combined with the training signals backpropagated from upper layers for extracting both slow and fast changing features at layers of different depths. Therefore, the UL layers can be used in either pure unsupervised or semi-supervised settings. Both a closed-form solution and an online learning algorithm for two UL layers are provided. Experiments with unlabeled synthetic and real-world videos demonstrated that the neural networks equipped with UL layers and trained with the proposed online learning algorithm can extract shape and motion information from video sequences of moving objects. The experiments demonstrated the potential applications of UL layers and online learning algorithm to head orientation estimation and moving object localization.


  Click for Model/Code and Paper
Transferring neural speech waveform synthesizers to musical instrument sounds generation

Oct 27, 2019
Yi Zhao, Xin Wang, Lauri Juvela, Junichi Yamagishi

Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation. The similarity between speech and music audio synthesis techniques suggests interesting avenues to explore in terms of the best way to apply speech synthesizers in the music domain. This work compares three neural synthesizers used for musical instrument sounds generation under three scenarios: training from scratch on music data, zero-shot learning from the speech domain, and fine-tuning-based adaptation from the speech to the music domain. The results of a large-scale perceptual test demonstrated that the performance of three synthesizers improved when they were pre-trained on speech data and fine-tuned on music data, which indicates the usefulness of knowledge from speech data for music audio generation. Among the synthesizers, WaveGlow showed the best potential in zero-shot learning while NSF performed best in the other scenarios and could generate samples that were perceptually close to natural audio.

* Submitted to ICASSP 2020 

  Click for Model/Code and Paper
ADASS: Adaptive Sample Selection for Training Acceleration

Jun 11, 2019
Shen-Yi Zhao, Hao Gao, Wu-Jun Li

Stochastic gradient decent~(SGD) and its variants, including some accelerated variants, have become popular for training in machine learning. However, in all existing SGD and its variants, the sample size in each iteration~(epoch) of training is the same as the size of the full training set. In this paper, we propose a new method, called \underline{ada}ptive \underline{s}ample \underline{s}election~(ADASS), for training acceleration. During different epoches of training, ADASS only need to visit different training subsets which are adaptively selected from the full training set according to the Lipschitz constants of the loss functions on samples. It means that in ADASS the sample size in each epoch of training can be smaller than the size of the full training set, by discarding some samples. ADASS can be seamlessly integrated with existing optimization methods, such as SGD and momentum SGD, for training acceleration. Theoretical results show that the learning accuracy of ADASS is comparable to that of counterparts with full training set. Furthermore, empirical results on both shallow models and deep models also show that ADASS can accelerate the training process of existing methods without sacrificing accuracy.


  Click for Model/Code and Paper
Clustered Reinforcement Learning

Jun 06, 2019
Xiao Ma, Shen-Yi Zhao, Wu-Jun Li

Exploration strategy design is one of the challenging problems in reinforcement learning~(RL), especially when the environment contains a large state space or sparse rewards. During exploration, the agent tries to discover novel areas or high reward~(quality) areas. In most existing methods, the novelty and quality in the neighboring area of the current state are not well utilized to guide the exploration of the agent. To tackle this problem, we propose a novel RL framework, called \underline{c}lustered \underline{r}einforcement \underline{l}earning~(CRL), for efficient exploration in RL. CRL adopts clustering to divide the collected states into several clusters, based on which a bonus reward reflecting both novelty and quality in the neighboring area~(cluster) of the current state is given to the agent. Experiments on a continuous control task and several \emph{Atari 2600} games show that CRL can outperform other state-of-the-art methods to achieve the best performance in most cases.

* 16pages, 3 figures 

  Click for Model/Code and Paper
On the Convergence of Memory-Based Distributed SGD

May 30, 2019
Shen-Yi Zhao, Hao Gao, Wu-Jun Li

Distributed stochastic gradient descent~(DSGD) has been widely used for optimizing large-scale machine learning models, including both convex and non-convex models. With the rapid growth of model size, huge communication cost has been the bottleneck of traditional DSGD. Recently, many communication compression methods have been proposed. Memory-based distributed stochastic gradient descent~(M-DSGD) is one of the efficient methods since each worker communicates a sparse vector in each iteration so that the communication cost is small. Recent works propose the convergence rate of M-DSGD when it adopts vanilla SGD. However, there is still a lack of convergence theory for M-DSGD when it adopts momentum SGD. In this paper, we propose a universal convergence analysis for M-DSGD by introducing \emph{transformation equation}. The transformation equation describes the relation between traditional DSGD and M-DSGD so that we can transform M-DSGD to its corresponding DSGD. Hence we get the convergence rate of M-DSGD with momentum for both convex and non-convex problems. Furthermore, we combine M-DSGD and stagewise learning that the learning rate of M-DSGD in each stage is a constant and is decreased by stage, instead of iteration. Using the transformation equation, we propose the convergence rate of stagewise M-DSGD which bridges the gap between theory and practice.


  Click for Model/Code and Paper
Operation-aware Neural Networks for User Response Prediction

Apr 02, 2019
Yi Yang, Baile Xu, Furao Shen, Jian Zhao

User response prediction makes a crucial contribution to the rapid development of online advertising system and recommendation system. The importance of learning feature interactions has been emphasized by many works. Many deep models are proposed to automatically learn high-order feature interactions. Since most features in advertising system and recommendation system are high-dimensional sparse features, deep models usually learn a low-dimensional distributed representation for each feature in the bottom layer. Besides traditional fully-connected architectures, some new operations, such as convolutional operations and product operations, are proposed to learn feature interactions better. In these models, the representation is shared among different operations. However, the best representation for different operations may be different. In this paper, we propose a new neural model named Operation-aware Neural Networks (ONN) which learns different representations for different operations. Our experimental results on two large-scale real-world ad click/conversion datasets demonstrate that ONN consistently outperforms the state-of-the-art models in both offline-training environment and online-training environment.


  Click for Model/Code and Paper
Quantized Epoch-SGD for Communication-Efficient Distributed Learning

Jan 10, 2019
Shen-Yi Zhao, Hao Gao, Wu-Jun Li

Due to its efficiency and ease to implement, stochastic gradient descent (SGD) has been widely used in machine learning. In particular, SGD is one of the most popular optimization methods for distributed learning. Recently, quantized SGD (QSGD), which adopts quantization to reduce the communication cost in SGD-based distributed learning, has attracted much attention. Although several QSGD methods have been proposed, some of them are heuristic without theoretical guarantee, and others have high quantization variance which makes the convergence become slow. In this paper, we propose a new method, called Quantized Epoch-SGD (QESGD), for communication-efficient distributed learning. QESGD compresses (quantizes) the parameter with variance reduction, so that it can get almost the same performance as that of SGD with less communication cost. QESGD is implemented on the Parameter Server framework, and empirical results on distributed deep learning show that QESGD can outperform other state-of-the-art quantization methods to achieve the best performance.


  Click for Model/Code and Paper
GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud

Dec 08, 2018
Li Yi, Wang Zhao, He Wang, Minhyuk Sung, Leonidas Guibas

We introduce a novel 3D object proposal approach named Generative Shape Proposal Network (GSPN) for instance segmentation in point cloud data. Instead of treating object proposal as a direct bounding box regression problem, we take an analysis-by-synthesis strategy and generate proposals by reconstructing shapes from noisy observations in a scene. We incorporate GSPN into a novel 3D instance segmentation framework named Region-based PointNet (R-PointNet) which allows flexible proposal refinement and instance segmentation generation. We achieve state-of-the-art performance on several 3D instance segmentation tasks. The success of GSPN largely comes from its emphasis on geometric understandings during object proposal, which greatly reducing proposals with low objectness.


  Click for Model/Code and Paper
Self-weighted Multiple Kernel Learning for Graph-based Clustering and Semi-supervised Classification

Jun 20, 2018
Zhao Kang, Xiao Lu, Jinfeng Yi, Zenglin Xu

Multiple kernel learning (MKL) method is generally believed to perform better than single kernel method. However, some empirical studies show that this is not always true: the combination of multiple kernels may even yield an even worse performance than using a single kernel. There are two possible reasons for the failure: (i) most existing MKL methods assume that the optimal kernel is a linear combination of base kernels, which may not hold true; and (ii) some kernel weights are inappropriately assigned due to noises and carelessly designed algorithms. In this paper, we propose a novel MKL framework by following two intuitive assumptions: (i) each kernel is a perturbation of the consensus kernel; and (ii) the kernel that is close to the consensus kernel should be assigned a large weight. Impressively, the proposed method can automatically assign an appropriate weight to each kernel without introducing additional parameters, as existing methods do. The proposed framework is integrated into a unified framework for graph-based clustering and semi-supervised classification. We have conducted experiments on multiple benchmark datasets and our empirical results verify the superiority of the proposed framework.

* Accepted by IJCAI 2018, Code is available 

  Click for Model/Code and Paper
Indexing of CNN Features for Large Scale Image Search

Feb 01, 2018
Ruoyu Liu, Yao Zhao, Shikui Wei, Yi Yang

The convolutional neural network (CNN) features can give a good description of image content, which usually represent images with unique global vectors. Although they are compact compared to local descriptors, they still cannot efficiently deal with large-scale image retrieval due to the cost of the linear incremental computation and storage. To address this issue, we build a simple but effective indexing framework based on inverted table, which significantly decreases both the search time and memory usage. In addition, several strategies are fully investigated under an indexing framework to adapt it to CNN features and compensate for quantization errors. First, we use multiple assignment for the query and database images to increase the probability of relevant images' co-existing in the same Voronoi cells obtained via the clustering algorithm. Then, we introduce embedding codes to further improve precision by removing false matches during a search. We demonstrate that by using hashing schemes to calculate the embedding codes and by changing the ranking rule, indexing framework speeds can be greatly improved. Extensive experiments conducted on several unsupervised and supervised benchmarks support these results and the superiority of the proposed indexing framework. We also provide a fair comparison between the popular CNN features.

* 21 pages, 9 figures, submitted to Multimedia Tools and Applications 

  Click for Model/Code and Paper
De-identification of medical records using conditional random fields and long short-term memory networks

Sep 29, 2017
Zhipeng Jiang, Chao Zhao, Bin He, Yi Guan, Jingchi Jiang

The CEGS N-GRID 2016 Shared Task 1 in Clinical Natural Language Processing focuses on the de-identification of psychiatric evaluation records. This paper describes two participating systems of our team, based on conditional random fields (CRFs) and long short-term memory networks (LSTMs). A pre-processing module was introduced for sentence detection and tokenization before de-identification. For CRFs, manually extracted rich features were utilized to train the model. For LSTMs, a character-level bi-directional LSTM network was applied to represent tokens and classify tags for each token, following which a decoding layer was stacked to decode the most probable protected health information (PHI) terms. The LSTM-based system attained an i2b2 strict micro-F_1 measure of 89.86%, which was higher than that of the CRF-based system.


  Click for Model/Code and Paper
Hierarchical Deep Recurrent Architecture for Video Understanding

Jul 11, 2017
Luming Tang, Boyang Deng, Haiyu Zhao, Shuai Yi

This paper introduces the system we developed for the Youtube-8M Video Understanding Challenge, in which a large-scale benchmark dataset was used for multi-label video classification. The proposed framework contains hierarchical deep architecture, including the frame-level sequence modeling part and the video-level classification part. In the frame-level sequence modelling part, we explore a set of methods including Pooling-LSTM (PLSTM), Hierarchical-LSTM (HLSTM), Random-LSTM (RLSTM) in order to address the problem of large amount of frames in a video. We also introduce two attention pooling methods, single attention pooling (ATT) and multiply attention pooling (Multi-ATT) so that we can pay more attention to the informative frames in a video and ignore the useless frames. In the video-level classification part, two methods are proposed to increase the classification performance, i.e. Hierarchical-Mixture-of-Experts (HMoE) and Classifier Chains (CC). Our final submission is an ensemble consisting of 18 sub-models. In terms of the official evaluation metric Global Average Precision (GAP) at 20, our best submission achieves 0.84346 on the public 50% of test dataset and 0.84333 on the private 50% of test data.

* Accepted as Classification Challenge Track paper in CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding 

  Click for Model/Code and Paper
Learning and inference in knowledge-based probabilistic model for medical diagnosis

Mar 28, 2017
Jingchi Jiang, Chao Zhao, Yi Guan, Qiubin Yu

Based on a weighted knowledge graph to represent first-order knowledge and combining it with a probabilistic model, we propose a methodology for the creation of a medical knowledge network (MKN) in medical diagnosis. When a set of symptoms is activated for a specific patient, we can generate a ground medical knowledge network composed of symptom nodes and potential disease nodes. By Incorporating a Boltzmann machine into the potential function of a Markov network, we investigated the joint probability distribution of the MKN. In order to deal with numerical symptoms, a multivariate inference model is presented that uses conditional probability. In addition, the weights for the knowledge graph were efficiently learned from manually annotated Chinese Electronic Medical Records (CEMRs). In our experiments, we found numerically that the optimum choice of the quality of disease node and the expression of symptom variable can improve the effectiveness of medical diagnosis. Our experimental results comparing a Markov logic network and the logistic regression algorithm on an actual CEMR database indicate that our method holds promise and that MKN can facilitate studies of intelligent diagnosis.

* 32 pages, 8 figures 

  Click for Model/Code and Paper