Research papers and code for "Huan Liu":
We are surrounded by huge amounts of large-scale high dimensional data. It is desirable to reduce the dimensionality of data for many learning tasks due to the curse of dimensionality. Feature selection has shown its effectiveness in many applications by building simpler and more comprehensive model, improving learning performance, and preparing clean, understandable data. Recently, some unique characteristics of big data such as data velocity and data variety present challenges to the feature selection problem. In this paper, we envision these challenges of feature selection for big data analytics. In particular, we first give a brief introduction about feature selection and then detail the challenges of feature selection for structured, heterogeneous and streaming data as well as its scalability and stability issues. At last, to facilitate and promote the feature selection research, we present an open-source feature selection repository (scikit-feature), which consists of most of current popular feature selection algorithms.

* Special Issue on Big Data, IEEE Intelligent Systems, 2016. arXiv admin note: text overlap with arXiv:1601.07996
Click to Read Paper and Get Code
With convenient access to observational data, learning individual causal effects from such data draws more attention in many influential research areas such as economics, healthcare, and education. For example, we aim to study how a medicine (treatment) would affect the health condition (outcome) of a certain patient. To validate causal inference from observational data, we need to control the influence of confounders - the variables which causally influence both the treatment and the outcome. Along this line, existing work for learning individual treatment effect overwhelmingly relies on the assumption that there are no hidden confounders. However, in real-world observational data, this assumption is untenable and can be unrealistic. In fact, an important fact ignored by them is that observational data can come with network information that can be utilized to infer hidden confounders. For example, in an observational study of the individual treatment effect of a medicine, instead of randomized experiments, the medicine is assigned to individuals based on a series of factors. Some factors (e.g., socioeconomic status) are hard to measure directly and therefore become hidden confounders of observational datasets. Fortunately, the socioeconomic status of an individual can be reflected by whom she is connected in social networks. With this fact in mind, we aim to exploit the network structure to recognize patterns of hidden confounders in the task of learning individual treatment effects from observational data. In this work, we propose a novel causal inference framework, the network deconfounder, which learns representations of confounders by unraveling patterns of hidden confounders from the network structure between instances of observational data. Empirically, we perform extensive experiments to validate the effectiveness of the network deconfounder on various datasets.

* 16 pages, 4 figures
Click to Read Paper and Get Code
Predicting signed links in social networks often faces the problem of signed link data sparsity, i.e., only a small percentage of signed links are given. The problem is exacerbated when the number of negative links is much smaller than that of positive links. Boosting signed link prediction necessitates additional information to compensate for data sparsity. According to psychology theories, one rich source of such information is user's personality such as optimism and pessimism that can help determine her propensity in establishing positive and negative links. In this study, we investigate how personality information can be obtained, and if personality information can help alleviate the data sparsity problem for signed link prediction. We propose a novel signed link prediction model that enables empirical exploration of user personality via social media data. We evaluate our proposed model on two datasets of real-world signed link networks. The results demonstrate the complementary role of personality information in the signed link prediction problem. Experimental results also indicate the effectiveness of different levels of personality information for signed link data sparsity problem.

* Companion Proceedings of the 2019 World Wide Web Conference
Click to Read Paper and Get Code
As opposed to manual feature engineering which is tedious and difficult to scale, network representation learning has attracted a surge of research interests as it automates the process of feature learning on graphs. The learned low-dimensional node vector representation is generalizable and eases the knowledge discovery process on graphs by enabling various off-the-shelf machine learning tools to be directly applied. Recent research has shown that the past decade of network embedding approaches either explicitly factorize a carefully designed matrix to obtain the low-dimensional node vector representation or are closely related to implicit matrix factorization, with the fundamental assumption that the factorized node connectivity matrix is low-rank. Nonetheless, the global low-rank assumption does not necessarily hold especially when the factorized matrix encodes complex node interactions, and the resultant single low-rank embedding matrix is insufficient to capture all the observed connectivity patterns. In this regard, we propose a novel multi-level network embedding framework BoostNE, which can learn multiple network embedding representations of different granularity from coarse to fine without imposing the prevalent global low-rank assumption. The proposed BoostNE method is also in line with the successful gradient boosting method in ensemble learning as multiple weak embeddings lead to a stronger and more effective one. We assess the effectiveness of the proposed BoostNE framework by comparing it with existing state-of-the-art network embedding methods on various datasets, and the experimental results corroborate the superiority of the proposed BoostNE network embedding framework.

Click to Read Paper and Get Code
Classification, which involves finding rules that partition a given data set into disjoint groups, is one class of data mining problems. Approaches proposed so far for mining classification rules for large databases are mainly decision tree based symbolic learning methods. The connectionist approach based on neural networks has been thought not well suited for data mining. One of the major reasons cited is that knowledge generated by neural networks is not explicitly represented in the form of rules suitable for verification or interpretation by humans. This paper examines this issue. With our newly developed algorithms, rules which are similar to, or more concise than those generated by the symbolic methods can be extracted from the neural networks. The data mining process using neural networks with the emphasis on rule extraction is described. Experimental results and comparison with previously published works are presented.

* VLDB1995
Click to Read Paper and Get Code
Sentiment in social media is increasingly considered as an important resource for customer segmentation, market understanding, and tackling other socio-economic issues. However, sentiment in social media is difficult to measure since user-generated content is usually short and informal. Although many traditional sentiment analysis methods have been proposed, identifying slang sentiment words remains untackled. One of the reasons is that slang sentiment words are not available in existing dictionaries or sentiment lexicons. To this end, we propose to build the first sentiment dictionary of slang words to aid sentiment analysis of social media content. It is laborious and time-consuming to collect and label the sentiment polarity of a comprehensive list of slang words. We present an approach to leverage web resources to construct an extensive Slang Sentiment word Dictionary (SlangSD) that is easy to maintain and extend. SlangSD is publicly available for research purposes. We empirically show the advantages of using SlangSD, the newly-built slang sentiment word dictionary for sentiment classification, and provide examples demonstrating its ease of use with an existing sentiment system.

* 15 pages, 2 figures
Click to Read Paper and Get Code
Many problems in machine learning and other fields can be (re)for-mulated as linearly constrained separable convex programs. In most of the cases, there are multiple blocks of variables. However, the traditional alternating direction method (ADM) and its linearized version (LADM, obtained by linearizing the quadratic penalty term) are for the two-block case and cannot be naively generalized to solve the multi-block case. So there is great demand on extending the ADM based methods for the multi-block case. In this paper, we propose LADM with parallel splitting and adaptive penalty (LADMPSAP) to solve multi-block separable convex programs efficiently. When all the component objective functions have bounded subgradients, we obtain convergence results that are stronger than those of ADM and LADM, e.g., allowing the penalty parameter to be unbounded and proving the sufficient and necessary conditions} for global convergence. We further propose a simple optimality measure and reveal the convergence rate of LADMPSAP in an ergodic sense. For programs with extra convex set constraints, with refined parameter estimation we devise a practical version of LADMPSAP for faster convergence. Finally, we generalize LADMPSAP to handle programs with more difficult objective functions by linearizing part of the objective function as well. LADMPSAP is particularly suitable for sparse representation and low-rank recovery problems because its subproblems have closed form solutions and the sparsity and low-rankness of the iterates can be preserved during the iteration. It is also highly parallelizable and hence fits for parallel or distributed computing. Numerical experiments testify to the advantages of LADMPSAP in speed and numerical accuracy.

* Preliminary version published on Asian Conference on Machine Learning 2013
Click to Read Paper and Get Code
In this work, we address the following matrix recovery problem: suppose we are given a set of data points containing two parts, one part consists of samples drawn from a union of multiple subspaces and the other part consists of outliers. We do not know which data points are outliers, or how many outliers there are. The rank and number of the subspaces are unknown either. Can we detect the outliers and segment the samples into their right subspaces, efficiently and exactly? We utilize a so-called {\em Low-Rank Representation} (LRR) method to solve this problem, and prove that under mild technical conditions, any solution to LRR exactly recovers the row space of the samples and detect the outliers as well. Since the subspace membership is provably determined by the row space, this further implies that LRR can perform exact subspace segmentation and outlier detection, in an efficient way.

* Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012
Click to Read Paper and Get Code
Network representation learning, a fundamental research problem which aims at learning low-dimension node representations on graph-structured data, has been extensively studied in the research community. By generalizing the power of neural networks on graph-structured data, graph neural networks (GNNs) achieve superior capability in network representation learning. However, the node features of many real-world graphs could be high-dimensional and sparse, rendering the learned node representations from existing GNN architectures less expressive. The main reason lies in that those models directly makes use of the raw features of nodes as input for the message-passing and have limited power in capturing sophisticated interactions between features. In this paper, we propose a novel GNN framework for learning node representations that incorporate high-order feature interactions on feature-sparse graphs. Specifically, the proposed message aggregator and feature factorizer extract two channels of embeddings from the feature-sparse graph, characterizing the aggregated node features and high-order feature interactions, respectively. Furthermore, we develop an attentive fusion network to seamlessly combine the information from two different channels and learn the feature interaction-aware node representations. Extensive experiments on various datasets demonstrate the effectiveness of the proposed framework on a variety of graph learning tasks.

Click to Read Paper and Get Code
The increasing popularity and diversity of social media sites has encouraged more and more people to participate in multiple online social networks to enjoy their services. Each user may create a user identity to represent his or her unique public figure in every social network. User identity linkage across online social networks is an emerging task and has attracted increasing attention, which could potentially impact various domains such as recommendations and link predictions. The majority of existing work focuses on mining network proximity or user profile data for discovering user identity linkages. With the recent advancements in graph neural networks (GNNs), it provides great potential to advance user identity linkage since users are connected in social graphs, and learning latent factors of users and items is the key. However, predicting user identity linkages based on GNNs faces challenges. For example, the user social graphs encode both \textit{local} structure such as users' neighborhood signals, and \textit{global} structure with community properties. To address these challenges simultaneously, in this paper, we present a novel graph neural network framework ({\m}) for user identity linkage. In particular, we provide a principled approach to jointly capture local and global information in the user-user social graph and propose the framework {\m}, which jointly learning user representations for user identity linkage. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed framework.

* 7 pages, 3 figures
Click to Read Paper and Get Code
Learning expressive low-dimensional representations of ultrahigh-dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distance-based detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.

* 10 pages, 4 figures, 3 tables. To appear in the proceedings of KDD18, Long presentation (oral)
Click to Read Paper and Get Code
Most social media platforms are largely based on text, and users often write posts to describe where they are, what they are seeing, and how they are feeling. Because written text lacks the emotional cues of spoken and face-to-face dialogue, ambiguities are common in written language. This problem is exacerbated in the short, informal nature of many social media posts. To bypass this issue, a suite of special characters called "emojis," which are small pictograms, are embedded within the text. Many emojis are small depictions of facial expressions designed to help disambiguate the emotional meaning of the text. However, a new ambiguity arises in the way that emojis are rendered. Every platform (Windows, Mac, and Android, to name a few) renders emojis according to their own style. In fact, it has been shown that some emojis can be rendered so differently that they look "happy" on some platforms, and "sad" on others. In this work, we use real-world data to verify the existence of this problem. We verify that the usage of the same emoji can be significantly different across platforms, with some emojis exhibiting different sentiment polarities on different platforms. We propose a solution to identify the intended emoji based on the platform-specific nature of the emoji used by the author of a social media post. We apply our solution to sentiment analysis, a task that can benefit from the emoji calibration technique we use in this work. We conduct experiments to evaluate the effectiveness of the mapping in this task.

Click to Read Paper and Get Code
Many real-world relations can be represented by signed networks with positive and negative links, as a result of which signed network analysis has attracted increasing attention from multiple disciplines. With the increasing prevalence of social media networks, signed network analysis has evolved from developing and measuring theories to mining tasks. In this article, we present a review of mining signed networks in the context of social media and discuss some promising research directions and new frontiers. We begin by giving basic concepts and unique properties and principles of signed networks. Then we classify and review tasks of signed network mining with representative algorithms. We also delineate some tasks that have not been extensively studied with formal definitions and also propose research directions to expand the field of signed network mining.

* 37 pages
Click to Read Paper and Get Code
Recent studies have revealed the vulnerability of deep neural networks: A small adversarial perturbation that is imperceptible to human can easily make a well-trained deep neural network misclassify. This makes it unsafe to apply neural networks in security-critical applications. In this paper, we propose a new defense algorithm called Random Self-Ensemble (RSE) by combining two important concepts: {\bf randomness} and {\bf ensemble}. To protect a targeted model, RSE adds random noise layers to the neural network to prevent the strong gradient-based attacks, and ensembles the prediction over random noises to stabilize the performance. We show that our algorithm is equivalent to ensemble an infinite number of noisy models $f_\epsilon$ without any additional memory overhead, and the proposed training procedure based on noisy stochastic gradient descent can ensure the ensemble model has a good predictive capability. Our algorithm significantly outperforms previous defense techniques on real data sets. For instance, on CIFAR-10 with VGG network (which has 92\% accuracy without any attack), under the strong C\&W attack within a certain distortion tolerance, the accuracy of unprotected model drops to less than 10\%, the best previous defense technique has $48\%$ accuracy, while our method still has $86\%$ prediction accuracy under the same level of attack. Finally, our method is simple and easy to integrate into any neural network.

* ECCV 2018 camera ready
Click to Read Paper and Get Code
This paper presents a novel framework to generate realistic face video of an anchor, who is reading certain news. This task is also known as Virtual Anchor. Given some paragraphs of words, we first utilize a pretrained Word2Vec model to embed each word into a vector; then we utilize a Seq2Seq-based model to translate these word embeddings into action units and head poses of the target anchor; these action units and head poses will be concatenated with facial landmarks as well as the former $n$ synthesized frames, and the concatenation serves as input of a Pix2PixHD-based model to synthesize realistic facial images for the virtual anchor. The experimental results demonstrate our framework is feasible for the synthesis of virtual anchor.

* 4 pages,4 figures
Click to Read Paper and Get Code
This paper aims to learn a compact representation of a video for video face recognition task. We make the following contributions: first, we propose a meta attention-based aggregation scheme which adaptively and fine-grained weighs the feature along each feature dimension among all frames to form a compact and discriminative representation. It makes the best to exploit the valuable or discriminative part of each frame to promote the performance of face recognition, without discarding or despising low quality frames as usual methods do. Second, we build a feature aggregation network comprised of a feature embedding module and a feature aggregation module. The embedding module is a convolutional neural network used to extract a feature vector from a face image, while the aggregation module consists of cascaded two meta attention blocks which adaptively aggregate the feature vectors into a single fixed-length representation. The network can deal with arbitrary number of frames, and is insensitive to frame order. Third, we validate the performance of proposed aggregation scheme. Experiments on publicly available datasets, such as YouTube face dataset and IJB-A dataset, show the effectiveness of our method, and it achieves competitive performances on both the verification and identification protocols.

* 9 pages, 6 figures
Click to Read Paper and Get Code
Symbolic Aggregate approximation (SAX) is a classical symbolic approach in many time series data mining applications. However, SAX only reflects the segment mean value feature and misses important information in a segment, namely the trend of the value change in the segment. Such a miss may cause a wrong classification in some cases, since the SAX representation cannot distinguish different time series with similar average values but different trends. In this paper, we present Trend Feature Symbolic Aggregate approximation (TFSAX) to solve this problem. First, we utilize Piecewise Aggregate Approximation (PAA) approach to reduce dimensionality and discretize the mean value of each segment by SAX. Second, extract trend feature in each segment by using trend distance factor and trend shape factor. Then, design multi-resolution symbolic mapping rules to discretize trend information into symbols. We also propose a modified distance measure by integrating the SAX distance with a weighted trend distance. We show that our distance measure has a tighter lower bound to the Euclidean distance than that of the original SAX. The experimental results on diverse time series data sets demonstrate that our proposed representation significantly outperforms the original SAX representation and an improved SAX representation for classification.

* 9 pages,ACM_IMCOM2019_CFP
Click to Read Paper and Get Code
Modeling spillover effects from observational data is an important problem in economics, business, and other fields of research. % It helps us infer the causality between two seemingly unrelated set of events. For example, if consumer spending in the United States declines, it has spillover effects on economies that depend on the U.S. as their largest export market. In this paper, we aim to infer the causation that results in spillover effects between pairs of entities (or units), we call this effect as \textit{paired spillover}. To achieve this, we leverage the recent developments in variational inference and deep learning techniques to propose a generative model called Linked Causal Variational Autoencoder (LCVA). Similar to variational autoencoders (VAE), LCVA incorporates an encoder neural network to learn the latent attributes and a decoder network to reconstruct the inputs. However, unlike VAE, LCVA treats the \textit{latent attributes as confounders that are assumed to affect both the treatment and the outcome of units}. Specifically, given a pair of units $u$ and $\bar{u}$, their individual treatment and outcomes, the encoder network of LCVA samples the confounders by conditioning on the observed covariates of $u$, the treatments of both $u$ and $\bar{u}$ and the outcome of $u$. Once inferred, the latent attributes (or confounders) of $u$ captures the spillover effect of $\bar{u}$ on $u$. Using a network of users from job training dataset (LaLonde (1986)) and co-purchase dataset from Amazon e-commerce domain, we show that LCVA is significantly more robust than existing methods in capturing spillover effects.

* First two authors contributed equally, 4 pages, CIKM'18
Click to Read Paper and Get Code
The human visual system excels at detecting local blur of visual images, but the underlying mechanism is not well understood. Traditional views of blur such as reduction in energy at high frequencies and loss of phase coherence at localized features have fundamental limitations. For example, they cannot well discriminate flat regions from blurred ones. Here we propose that high-level semantic information is critical in successfully identifying local blur. Therefore, we resort to deep neural networks that are proficient at learning high-level features and propose the first end-to-end local blur mapping algorithm based on a fully convolutional network. By analyzing various architectures with different depths and design philosophies, we empirically show that high-level features of deeper layers play a more important role than low-level features of shallower layers in resolving challenging ambiguities for this task. We test the proposed method on a standard blur detection benchmark and demonstrate that it significantly advances the state-of-the-art (ODS F-score of 0.853). Furthermore, we explore the use of the generated blur maps in three applications, including blur region segmentation, blur degree estimation, and blur magnification.

* Tech report
Click to Read Paper and Get Code
Expert finding is an important task in both industry and academia. It is challenging to rank candidates with appropriate expertise for various queries. In addition, different types of objects interact with one another, which naturally forms heterogeneous information networks. We study the task of expert finding in heterogeneous bibliographical networks based on two aspects: textual content analysis and authority ranking. Regarding the textual content analysis, we propose a new method for query expansion via locally-trained embedding learning with concept hierarchy as guidance, which is particularly tailored for specific queries with narrow semantic meanings. Compared with global embedding learning, locally-trained embedding learning projects the terms into a latent semantic space constrained on relevant topics, therefore it preserves more precise and subtle information for specific queries. Considering the candidate ranking, the heterogeneous information network structure, while being largely ignored in the previous studies of expert finding, provides additional information. Specifically, different types of interactions among objects play different roles. We propose a ranking algorithm to estimate the authority of objects in the network, treating each strongly-typed edge type individually. To demonstrate the effectiveness of the proposed framework, we apply the proposed method to a large-scale bibliographical dataset with over two million entries and one million researcher candidates. The experiment results show that the proposed framework outperforms existing methods for both general and specific queries.

Click to Read Paper and Get Code