Research papers and code for "Yang Zhang":
Hashtags in online social networks have gained tremendous popularity during the past five years. The resulting large quantity of data has provided a new lens into modern society. Previously, researchers mainly rely on data collected from Twitter to study either a certain type of hashtags or a certain property of hashtags. In this paper, we perform the first large-scale empirical analysis of hashtags shared on Instagram, the major platform for hashtag-sharing. We study hashtags from three different dimensions including the temporal-spatial dimension, the semantic dimension, and the social dimension. Extensive experiments performed on three large-scale datasets with more than 7 million hashtags in total provide a series of interesting observations. First, we show that the temporal patterns of hashtags can be categorized into four different clusters, and people tend to share fewer hashtags at certain places and more hashtags at others. Second, we observe that a non-negligible proportion of hashtags exhibit large semantic displacement. We demonstrate hashtags that are more uniformly shared among users, as quantified by the proposed hashtag entropy, are less prone to semantic displacement. In the end, we propose a bipartite graph embedding model to summarize users' hashtag profiles, and rely on these profiles to perform friendship prediction. Evaluation results show that our approach achieves an effective prediction with AUC (area under the ROC curve) above 0.8 which demonstrates the strong social signals possessed in hashtags.

* WWW 2019
Click to Read Paper and Get Code
Collecting training data from the physical world is usually time-consuming and even dangerous for fragile robots, and thus, recent advances in robot learning advocate the use of simulators as the training platform. Unfortunately, the reality gap between synthetic and real visual data prohibits direct migration of the models trained in virtual worlds to the real world. This paper proposes a modular architecture for tackling the virtual-to-real problem. The proposed architecture separates the learning model into a perception module and a control policy module, and uses semantic image segmentation as the meta representation for relating these two modules. The perception module translates the perceived RGB image to semantic image segmentation. The control policy module is implemented as a deep reinforcement learning agent, which performs actions based on the translated image segmentation. Our architecture is evaluated in an obstacle avoidance task and a target following task. Experimental results show that our architecture significantly outperforms all of the baseline methods in both virtual and real environments, and demonstrates a faster learning curve than them. We also present a detailed analysis for a variety of variant configurations, and validate the transferability of our modular architecture.

* 7 pages, accepted by IJCAI-18
Click to Read Paper and Get Code
The OECD pointed out that the best way to keep students up to school is to intervene as early as possible [1]. Using education big data and deep learning to predict student's score provides new resources and perspectives for early intervention. Previous forecasting schemes often requires manual filter of features , a large amount of prior knowledge and expert knowledge. Deep learning can automatically extract features without manual intervention to achieve better predictive performance. In this paper, the graph neural network matrix filling model (Graph-VAE) based on deep learning can automatically extract features without a large amount of prior knowledge. The experiment proves that our model is better than the traditional solution in the student's score dataset, and it better describes the correlation and difference between the students and the curriculum, and dimensionality reducing the vector of coding result is visualized, the clustering effect is consistent with the real data distribution clustering. In addition, we use gradient-based attribution methods to analyze the key factors that influence performance prediction.

Click to Read Paper and Get Code
This study provides a systematic review of the recent advances in designing the intelligent tutoring robot (ITR), and summarises the status quo of applying artificial intelligence (AI) techniques. We first analyse the environment of the ITR and propose a relationship model for describing interactions of ITR with the students, the social milieu and the curriculum. Then, we transform the relationship model into the perception-planning-action model for exploring what AI techniques are suitable to be applied in the ITR. This article provides insights on promoting human-robot teaching-learning process and AI-assisted educational techniques, illustrating the design guidelines and future research perspectives in intelligent tutoring robots.

Click to Read Paper and Get Code
In this paper, we present a novel upsampling framework to enhance the spatial resolution of the depth image. In our framework, the upscaling of a low-resolution depth image is guided by a corresponding intensity images, we formulate it as a cost aggregation problem with the guided filter. However, the guided filter does not make full use of the properties of the depth image. Since depth images have quite sparse gradients, it inspires us to regularize the gradients for improving depth upscaling results. Statistics show a special property of depth images, that is, there is a non-ignorable part of pixels whose horizontal or vertical derivatives are equal to $\pm 1$. Considering this special property, we propose a low gradient regularization method which reduces the penalty for horizontal or vertical derivative $\pm1$. The proposed low gradient regularization is integrated with the guided filter into the depth image upsampling method. Experimental results demonstrate the effectiveness of our proposed approach both qualitatively and quantitatively compared with the state-of-the-art methods.

* 28 pages, 7figures
Click to Read Paper and Get Code
Imbalanced Learning is an important learning algorithm for the classification models, which have enjoyed much popularity on many applications. Typically, imbalanced learning algorithms can be partitioned into two types, i.e., data level approaches and algorithm level approaches. In this paper, the focus is to develop a robust synthetic minority oversampling technique which falls the umbrella of data level approaches. On one hand, we proposed a method to generate synthetic samples in a high dimensional feature space, instead of a linear sampling space. On the other hand, in the proposed imbalanced learning framework, Gaussian Mixture Model is employed to distinguish the outliers from minority class instances and filter out the synthetic majority class instances. Last and more importantly, an adaptive optimization method is proposed to optimize these parameters in sampling process. By doing so, an effectiveness and efficiency imbalanced learning framework is developed.

Click to Read Paper and Get Code
Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks. In this paper, we give a survey for MTL. First, we classify different MTL algorithms into several categories, including feature learning approach, low-rank approach, task clustering approach, task relation learning approach, and decomposition approach, and then discuss the characteristics of each approach. In order to improve the performance of learning tasks further, MTL can be combined with other learning paradigms including semi-supervised learning, active learning, unsupervised learning, reinforcement learning, multi-view learning and graphical models. When the number of tasks is large or the data dimensionality is high, batch MTL models are difficult to handle this situation and online, parallel and distributed MTL models as well as dimensionality reduction and feature hashing are reviewed to reveal their computational and storage advantages. Many real-world applications use MTL to boost their performance and we review representative works. Finally, we present theoretical analyses and discuss several future directions for MTL.

Click to Read Paper and Get Code
We investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Compared with character-based methods, our model explicitly leverages word and word sequence information. Compared with word-based methods, lattice LSTM does not suffer from segmentation errors. Gated recurrent cells allow our model to choose the most relevant characters and words from a sentence for better NER results. Experiments on various datasets show that lattice LSTM outperforms both word-based and character-based LSTM baselines, achieving the best results.

* Accepted at ACL 2018 as Long paper
Click to Read Paper and Get Code
This paper describes NCRF++, a toolkit for neural sequence labeling. NCRF++ is designed for quick implementation of different neural sequence labeling models with a CRF inference layer. It provides users with an inference for building the custom model structure through configuration file with flexible neural feature design and utilization. Built on PyTorch, the core operations are calculated in batch, making the toolkit efficient with the acceleration of GPU. It also includes the implementations of most state-of-the-art neural sequence labeling models such as LSTM-CRF, facilitating reproducing and refinement on those methods.

* ACL 2018, demonstration paper
Click to Read Paper and Get Code
In this technique report, we aim to mitigate the overfitting problem of natural language by applying data augmentation methods. Specifically, we attempt several types of noise to perturb the input word embedding, such as Gaussian noise, Bernoulli noise, and adversarial noise, etc. We also apply several constraints on different types of noise. By implementing these proposed data augmentation methods, the baseline models can gain improvements on several sentence classification tasks.

Click to Read Paper and Get Code
Robust PCA is a widely used statistical procedure to recover a underlying low-rank matrix with grossly corrupted observations. This work considers the problem of robust PCA as a nonconvex optimization problem on the manifold of low-rank matrices, and proposes two algorithms (for two versions of retractions) based on manifold optimization. It is shown that, with a proper designed initialization, the proposed algorithms are guaranteed to converge to the underlying low-rank matrix linearly. Compared with a previous work based on the Burer-Monterio decomposition of low-rank matrices, the proposed algorithms reduce the dependence on the conditional number of the underlying low-rank matrix theoretically. Simulations and real data examples confirm the competitive performance of our method.

Click to Read Paper and Get Code
Parameters in deep neural networks which are trained on large-scale databases can generalize across multiple domains, which is referred as "transferability". Unfortunately, the transferability is usually defined as discrete states and it differs with domains and network architectures. Existing works usually heuristically apply parameter-sharing or fine-tuning, and there is no principled approach to learn a parameter transfer strategy. To address the gap, a parameter transfer unit (PTU) is proposed in this paper. The PTU learns a fine-grained nonlinear combination of activations from both the source and the target domain networks, and subsumes hand-crafted discrete transfer states. In the PTU, the transferability is controlled by two gates which are artificial neurons and can be learned from data. The PTU is a general and flexible module which can be used in both CNNs and RNNs. Experiments are conducted with various network architectures and multiple transfer domain pairs. Results demonstrate the effectiveness of the PTU as it outperforms heuristic parameter-sharing and fine-tuning in most settings.

Click to Read Paper and Get Code
This paper quantitatively characterizes the approximation power of deep feed-forward neural networks (FNNs) in terms of the number of neurons, i.e., the product of the network width and depth. It is shown by construction that ReLU FNNs with width $\mywidth$ and depth $9L+12$ can approximate an arbitrary H\"older continuous function of order $\alpha$ with a Lipschitz constant $\nu$ on $[0,1]^d$ with a tight approximation rate $5(8\sqrt{d})^\alpha\nu N^{-2\alpha/d}L^{-2\alpha/d}$ for any given $N,L\in \N^+$. The constructive approximation is a corollary of a more general result for an arbitrary continuous function $f$ in terms of its modulus of continuity $\omega_f(\cdot)$. In particular, the approximation rate of ReLU FNNs with width $\mywidth$ and depth $9L+12$ for a general continuous function $f$ is $5\omega_f(8\sqrt{d} N^{-2/d}L^{-2/d})$. We also extend our analysis to the case when the domain of $f$ is irregular or localized in an $\epsilon$-neighborhood of a $d_{\mathcal{M}}$-dimensional smooth manifold $\mathcal{M}\subseteq [0,1]^d$ with $d_{\mathcal{M}}\ll d$. Especially, in the case of an essentially low-dimensional domain, we show an approximation rate $3\omega_f\big(\tfrac{4\epsilon}{1-\delta}\sqrt{\tfrac{d}{d_\delta}}\big)+5\omega_f\big(\tfrac{16d}{(1-\delta)\sqrt{d_\delta}}N^{-2/d_\delta}L^{-2/d_\delta }\big)$ for ReLU FNNs to approximate $f$ in the $\epsilon$-neighborhood, where $d_\delta=\OO\big(d_{\mathcal{M}}\tfrac{\ln (d/\delta)}{\delta^2}\big)$ for any given $\delta\in(0,1)$. Our analysis provides a general guide for selecting the width and the depth of ReLU FNNs to approximate continuous functions especially in parallel computing.

Click to Read Paper and Get Code
We study the global convergence of policy optimization for finding the Nash equilibria (NE) in zero-sum linear quadratic (LQ) games. To this end, we first investigate the landscape of LQ games, viewing it as a nonconvex-nonconcave saddle-point problem in the policy space. Specifically, we show that despite its nonconvexity and nonconcavity, zero-sum LQ games have the property that the stationary point of the objective with respect to the feedback control policies constitutes the NE of the game. Building upon this, we develop three projected nested-gradient methods that are guaranteed to converge to the NE of the game. Moreover, we show that all of these algorithms enjoy both global sublinear and local linear convergence rates. Simulation results are then provided to validate the proposed algorithms. To the best of our knowledge, this work appears the first to investigate the optimization landscape of LQ games, and provably show the convergence of policy optimization methods to the Nash equilibria. Our work serves as an initial step of understanding the theoretical aspects of policy-based reinforcement learning algorithms for zero-sum Markov games in general.

Click to Read Paper and Get Code
Embeddings are a fundamental component of many modern machine learning and natural language processing models. Understanding them and visualizing them is essential for gathering insights about the information they capture and the behavior of the models. State of the art in analyzing embeddings consists in projecting them in two-dimensional planes without any interpretable semantics associated to the axes of the projection, which makes detailed analyses and comparison among multiple sets of embeddings challenging. In this work, we propose to use explicit axes defined as algebraic formulae over embeddings to project them into a lower dimensional, but semantically meaningful subspace, as a simple yet effective analysis and visualization methodology. This methodology assigns an interpretable semantics to the measures of variability and the axes of visualizations, allowing for both comparisons among different sets of embeddings and fine-grained inspection of the embedding spaces. We demonstrate the power of the proposed methodology through a series of case studies that make use of visualizations constructed around the underlying methodology and through a user study. The results show how the methodology is effective at providing more profound insights than classical projection methods and how it is widely applicable to many other use cases.

Click to Read Paper and Get Code
In recent years, deep learning has achieved remarkable achievements in many fields, including computer vision, natural language processing, speech recognition and others. Adequate training data is the key to ensure the effectiveness of the deep models. However, obtaining valid data requires a lot of time and labor resources. Data augmentation (DA) is an effective alternative approach, which can generate new labeled data based on existing data using label-preserving transformations. Although we can benefit a lot from DA, designing appropriate DA policies requires a lot of expert experience and time consumption, and the evaluation of searching the optimal policies is costly. So we raise a new question in this paper: how to achieve automated data augmentation at as low cost as possible? We propose a method named BO-Aug for automating the process by finding the optimal DA policies using the Bayesian optimization approach. Our method can find the optimal policies at a relatively low search cost, and the searched policies based on a specific dataset are transferable across different neural network architectures or even different datasets. We validate the BO-Aug on three widely used image classification datasets, including CIFAR-10, CIFAR-100 and SVHN. Experimental results show that the proposed method can achieve state-of-the-art or near advanced classification accuracy. Code to reproduce our experiments is available at https://github.com/zhangxiaozao/BO-Aug.

Click to Read Paper and Get Code
As an intuitive way of expression emotion, the animated Graphical Interchange Format (GIF) images have been widely used on social media. Most previous studies on automated GIF emotion recognition fail to effectively utilize GIF's unique properties, and this potentially limits the recognition performance. In this study, we demonstrate the importance of human related information in GIFs and conduct human-centered GIF emotion recognition with a proposed Keypoint Attended Visual Attention Network (KAVAN). The framework consists of a facial attention module and a hierarchical segment temporal module. The facial attention module exploits the strong relationship between GIF contents and human characters, and extracts frame-level visual feature with a focus on human faces. The Hierarchical Segment LSTM (HS-LSTM) module is then proposed to better learn global GIF representations. Our proposed framework outperforms the state-of-the-art on the MIT GIFGIF dataset. Furthermore, the facial attention module provides reliable facial region mask predictions, which improves the model's interpretability.

* Accepted to IEEE International Conference on Multimedia and Expo (ICME) 2019
Click to Read Paper and Get Code
We do not speak word by word from scratch; our brain quickly structures a pattern like \textsc{sth do sth at someplace} and then fill in the detailed descriptions. To render existing encoder-decoder image captioners such human-like reasoning, we propose a novel framework: learning to Collocate Neural Modules (CNM), to generate the `inner pattern' connecting visual encoder and language decoder. Unlike the widely-used neural module networks in visual Q\&A, where the language (ie, question) is fully observable, CNM for captioning is more challenging as the language is being generated and thus is partially observable. To this end, we make the following technical contributions for CNM training: 1) compact module design --- one for function words and three for visual content words (eg, noun, adjective, and verb), 2) soft module fusion and multi-step module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (eg, adjective is before noun). Extensive experiments on the challenging MS-COCO image captioning benchmark validate the effectiveness of our CNM image captioner. In particular, CNM achieves a new state-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40 on the official server. CNM is also robust to few training samples, eg, by training only one sentence per image, CNM can halve the performance loss compared to a strong baseline.

Click to Read Paper and Get Code
We study the approximation efficiency of function compositions in nonlinear approximation, especially the case when compositions are implemented using multi-layer feed-forward neural networks (FNNs) with ReLU activation functions. The central question of interest is what are the advantages of function compositions in generating dictionaries and what is the optimal implementation of function compositions via ReLU FNNs, especially in modern computing architecture. This question is answered by studying the $N$-term approximation rate, which is the decrease in error versus the number of computational nodes (neurons) in the approximant, together with parallel efficiency for the first time. First, for an arbitrary function $f$ on $[0,1]$, regardless of its smoothness and even the continuity, if $f$ can be approximated via nonlinear approximation using one-hidden-layer ReLU FNNs with an approximation rate $O(N^{-\eta})$, we quantitatively show that dictionaries with function compositions via deep ReLU FNNs can improve the approximation rate to $O(N^{-2\eta})$. Second, for H{\"o}lder continuous functions of order $\alpha$ with a uniform Lipchitz constant $\omega$ on a $d$-dimensional cube, we show that the $N$-term approximation via ReLU FNNs with two or three function compositions can achieve an approximation rate $O( N^{-2\alpha/d})$. The approximation rate can be improved to $O(L^{-2\alpha/d})$ by composing $L$ times, if $N$ is fixed and sufficiently large; but further compositions cannot achieve the approximation rate $O(N^{-\alpha L/d})$. Finally, considering the computational efficiency per training iteration in parallel computing, FNNs with $O(1)$ hidden layers are an optimal choice for approximating H{\"o}lder continuous functions if computing resources are enough.

Click to Read Paper and Get Code