Research papers and code for "Na Zou":
Deep learning is increasingly being used in high-stake decision making applications that affect individual lives. However, deep learning models might exhibit algorithmic discrimination behaviors with respect to protected groups, potentially posing negative impacts on individuals and society. Therefore, fairness in deep learning has attracted tremendous attention recently. We provide a comprehensive review covering existing techniques to tackle algorithmic fairness problems from the computational perspective. Specifically, we show that interpretability can serve as a useful ingredient, which could be augmented into the biases detection and mitigation pipelines. We also discuss open research problems and future research directions, aiming to push forward the area of fairness in deep learning and build genuinely fair, accountable, and transparent deep learning systems.

Click to Read Paper and Get Code
An important step in early brain development study is to perform automatic segmentation of infant brain magnetic resonance (MR) images into cerebrospinal fluid (CSF), gray matter (GM) and white matter (WM) regions. This task is especially challenging in the isointense stage (approximately 6-8 months of age) when GM and WM exhibit similar levels of intensities in MR images. Deep learning has shown its great promise in various image segmentation tasks. However, existing models do not have an efficient and effective way to aggregate global information. They also suffer from information loss during up-sampling operations. In this work, we address these problems by proposing a global aggregation block, which can be flexibly used for global information fusion. We build a novel model based on 3D U-Net to make fast and accurate voxel-wise dense prediction. We perform thorough experiments, and results indicate that our model outperforms previous best models significantly on 3D multimodality isointense infant brain MR image segmentation.

* 10 pages, 9 figures, 8 tables
Click to Read Paper and Get Code
Anomaly detection aims to distinguish observations that are rare and different from the majority. While most existing algorithms assume that instances are i.i.d., in many practical scenarios, links describing instance-to-instance dependencies and interactions are available. Such systems are called attributed networks. Anomaly detection in attributed networks has various applications such as monitoring suspicious accounts in social media and financial fraud in transaction networks. However, it remains a challenging task since the definition of anomaly becomes more complicated and topological structures are heterogeneous with nodal attributes. In this paper, we propose a spectral convolution and deconvolution based framework -- SpecAE, to project the attributed network into a tailored space to detect global and community anomalies. SpecAE leverages Laplacian sharpening to amplify the distances between representations of anomalies and the ones of the majority. The learned representations along with reconstruction errors are combined with a density estimation model to perform the detection. They are trained jointly as an end-to-end framework. Experiments on real-world datasets demonstrate the effectiveness of SpecAE.

* 5 pages, in proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM)
Click to Read Paper and Get Code
When applying the support vector machine (SVM) to high-dimensional classification problems, we often impose a sparse structure in the SVM to eliminate the influences of the irrelevant predictors. The lasso and other variable selection techniques have been successfully used in the SVM to perform automatic variable selection. In some problems, there is a natural hierarchical structure among the variables. Thus, in order to have an interpretable SVM classifier, it is important to respect the heredity principle when enforcing the sparsity in the SVM. Many variable selection methods, however, do not respect the heredity principle. In this paper we enforce both sparsity and the heredity principle in the SVM by using the so-called structured variable selection (SVS) framework originally proposed in Yuan, Joseph and Zou (2007). We minimize the empirical hinge loss under a set of linear inequality constraints and a lasso-type penalty. The solution always obeys the desired heredity principle and enjoys sparsity. The new SVM classifier can be efficiently fitted, because the optimization problem is a linear program. Another contribution of this work is to present a nonparametric extension of the SVS framework, and we propose nonparametric heredity SVMs. Simulated and real data are used to illustrate the merits of the proposed method.

* Electronic Journal of Statistics 2008, Vol. 2, 103-117
* Published in at http://dx.doi.org/10.1214/07-EJS125 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org)
Click to Read Paper and Get Code
In classification, the de facto method for aggregating individual losses is the average loss. When the actual metric of interest is 0-1 loss, it is common to minimize the average surrogate loss for some well-behaved (e.g. convex) surrogate. Recently, several other aggregate losses such as the maximal loss and average top-$k$ loss were proposed as alternative objectives to address shortcomings of the average loss. However, we identify common classification settings, e.g. the data is imbalanced, has too many easy or ambiguous examples, etc., when average, maximal and average top-$k$ all suffer from suboptimal decision boundaries, even on an infinitely large training set. To address this problem, we propose a new classification objective called the close-$k$ aggregate loss, where we adaptively minimize the loss for points close to the decision boundary. We provide theoretical guarantees for the 0-1 accuracy when we optimize close-$k$ aggregate loss. We also conduct systematic experiments across the PMLB and OpenML benchmark datasets. Close-$k$ achieves significant gains in 0-1 test accuracy, improvements of $\geq 2$% and $p<0.05$, in over 25% of the datasets compared to average, maximal and average top-$k$. In contrast, the previous aggregate losses outperformed close-$k$ in less than 2% of the datasets.

Click to Read Paper and Get Code
Pseudo-Boolean monotone functions are unimodal functions which are trivial to optimize for some hillclimbers, but are challenging for a surprising number of evolutionary algorithms (EAs). A general trend is that EAs are efficient if parameters like the mutation rate are set conservatively, but may need exponential time otherwise. In particular, it was known that the $(1+1)$-EA and the $(1+\lambda)$-EA can optimize every monotone function in pseudolinear time if the mutation rate is $c/n$ for some $c<1$, but they need exponential time for some monotone functions for $c>2.2$. The second part of the statement was also known for the $(\mu+1)$-EA. In this paper we show that the first statement does not apply to the $(\mu+1)$-EA. More precisely, we prove that for every constant $c>0$ there is a constant integer $\mu_0$ such that the $(\mu+1)$-EA with mutation rate $c/n$ and population size $\mu_0\le\mu\le n$ needs superpolynomial time to optimize some monotone functions. Thus, increasing the population size by just a constant has devastating effects on the performance. This is in stark contrast to many other benchmark functions on which increasing the population size either increases the performance significantly, or affects performance mildly. The reason why larger populations are harmful lies in the fact that larger populations may temporarily decrease selective pressure on parts of the population. This allows unfavorable mutations to accumulate in single individuals and their descendants. If the population moves sufficiently fast through the search space, such unfavorable descendants can become ancestors of future generations, and the bad mutations are preserved. Remarkably, this effect only occurs if the population renews itself sufficiently fast, which can only happen far away from the optimum. This is counter-intuitive since usually optimization gets harder as we approach the optimum.

Click to Read Paper and Get Code
A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size $n$ (e.g., $O(n^{24})$). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.

* 30 pages, 1 figure, 1 table
Click to Read Paper and Get Code
Meta learning is a promising solution to few-shot learning problems. However, existing meta learning methods are restricted to the scenarios where training and application tasks share the same out-put structure. To obtain a meta model applicable to the tasks with new structures, it is required to collect new training data and repeat the time-consuming meta training procedure. This makes them inefficient or even inapplicable in learning to solve heterogeneous few-shot learning tasks. We thus develop a novel and principled HierarchicalMeta Learning (HML) method. Different from existing methods that only focus on optimizing the adaptability of a meta model to similar tasks, HML also explicitly optimizes its generalizability across heterogeneous tasks. To this end, HML first factorizes a set of similar training tasks into heterogeneous ones and trains the meta model over them at two levels to maximize adaptation and generalization performance respectively. The resultant model can then directly generalize to new tasks. Extensive experiments on few-shot classification and regression problems clearly demonstrate the superiority of HML over fine-tuning and state-of-the-art meta learning approaches in terms of generalization across heterogeneous tasks.

Click to Read Paper and Get Code
As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.

Click to Read Paper and Get Code
Purpose: This paper presents an impedance control method with mixed $H_2/H_\infty$ synthesis and relaxed passivity for a cable-driven series elastic actuator to be applied for physical human-robot interaction. Design/methodology/approach: To shape the system's impedance to match a desired dynamic model, the impedance control problem was reformulated into an impedance matching structure. The desired competing performance requirements as well as constraints from the physical system can be characterized with weighting functions for respective signals. Considering the frequency properties of human movements, the passivity constraint for stable human-robot interaction, which is required on the entire frequency spectrum and may bring conservative solutions, has been relaxed in such a way that it only restrains the low frequency band. Thus, impedance control became a mixed $H_2/H_\infty$ synthesis problem, and a dynamic output feedback controller can be obtained. Findings: The proposed impedance control strategy has been tested for various desired impedance with both simulation and experiments on the cable-driven series elastic actuator platform. The actual interaction torque tracked well the desired torque within the desired norm bounds, and the control input was regulated below the motor velocity limit. The closed loop system can guarantee relaxed passivity at low frequency. Both simulation and experimental results have validated the feasibility and efficacy of the proposed method. Originality/value: This impedance control strategy with mixed $H_2/H_\infty$ synthesis and relaxed passivity provides a novel, effective and less conservative method for physical human-robot interaction control.

* Assembly Automation, Vol. 37, Issue: 3, pp.296-303, 2017
* 11 pages, already published in Assembly Automation
Click to Read Paper and Get Code
Variational autoencoders are powerful algorithms for identifying dominant latent structure in a single dataset. In many applications, however, we are interested in modeling latent structure and variation that are enriched in a target dataset compared to some background---e.g. enriched in patients compared to the general population. Contrastive learning is a principled framework to capture such enriched variation between the target and background, but state-of-the-art contrastive methods are limited to linear models. In this paper, we introduce the contrastive variational autoencoder (cVAE), which combines the benefits of contrastive learning with the power of deep generative models. The cVAE is designed to identify and enhance salient latent features. The cVAE is trained on two related but unpaired datasets, one of which has minimal contribution from the salient latent features. The cVAE explicitly models latent features that are shared between the datasets, as well as those that are enriched in one dataset relative to the other, which allows the algorithm to isolate and enhance the salient latent features. The algorithm is straightforward to implement, has a similar run-time to the standard VAE, and is robust to noise and dataset purity. We conduct experiments across diverse types of data, including gene expression and facial images, showing that the cVAE effectively uncovers latent structure that is salient in a particular analysis.

* Submitted to ICML 2019
Click to Read Paper and Get Code
Measuring similarities between unlabeled time series trajectories is an important problem in domains as diverse as medicine, astronomy, finance, and computer vision. It is often unclear what is the appropriate metric to use because of the complex nature of noise in the trajectories (e.g. different sampling rates or outliers). Domain experts typically hand-craft or manually select a specific metric, such as dynamic time warping (DTW), to apply on their data. In this paper, we propose Autowarp, an end-to-end algorithm that optimizes and learns a good metric given unlabeled trajectories. We define a flexible and differentiable family of warping metrics, which encompasses common metrics such as DTW, Euclidean, and edit distance. Autowarp then leverages the representation power of sequence autoencoders to optimize for a member of this warping distance family. The output is a metric which is easy to interpret and can be robustly learned from relatively few trajectories. In systematic experiments across different domains, we show that Autowarp often outperforms hand-crafted trajectory similarity metrics.

* Accepted at NIPS 2018
Click to Read Paper and Get Code
Adaptive stochastic gradient descent methods, such as AdaGrad, RMSProp, Adam, AMSGrad, etc., have been demonstrated efficacious in solving non-convex stochastic optimization, such as training deep neural networks. However, their convergence rates have not been touched under the non-convex stochastic circumstance except recent breakthrough results on AdaGrad, perturbed AdaGrad and AMSGrad. In this paper, we propose two new adaptive stochastic gradient methods called AdaHB and AdaNAG which integrate a novel weighted coordinate-wise AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum, respectively. The $\mathcal{O}(\frac{\log{T}}{\sqrt{T}})$ non-asymptotic convergence rates of AdaHB and AdaNAG in non-convex stochastic setting are also jointly established by leveraging a newly developed unified formulation of these two momentum mechanisms. Moreover, comparisons have been made between AdaHB, AdaNAG, Adam and RMSProp, which, to a certain extent, explains the reasons why Adam and RMSProp are divergent. In particular, when momentum term vanishes we obtain convergence rate of coordinate-wise AdaGrad in non-convex stochastic setting as a byproduct.

* We generalize AdaGrad to Weighted Adagrad. Discussion with Adam and RMSProp are provided in Section 4
Click to Read Paper and Get Code
Generative networks have made it possible to generate meaningful signals such as images and texts from simple noise. Recently, generative methods based on GAN and VAE were developed for graphs and graph signals. However, some of these methods are complex as well as difficult to train and fine-tune. This work proposes a graph generation model that uses a recent adaptation of Mallat's scattering transform to graphs. The proposed model is naturally composed of an encoder and a decoder. The encoder is a Gaussianized graph scattering transform. The decoder is a simple fully connected network that is adapted to specific tasks, such as link prediction, signal generation on graphs and full graph and signal generation. The training of our proposed system is efficient since it is only applied to the decoder and the hardware requirement is moderate. Numerical results demonstrate state-of-the-art performance of the proposed system for both link prediction and graph and signal generation. These results are in contrast to experience with Euclidean data, where it is difficult to form a generative scattering network that performs as well as state-of-the-art methods. We believe that this is because of the discrete and simpler nature of graph applications, unlike the more complex and high-frequency nature of Euclidean data, in particular, of some natural images.

* 14 pages, 5 figures, 3 tables
Click to Read Paper and Get Code
In Positional-Slotted Object-Applicative (PSOA) RuleML, a predicate application (atom) can have an Object IDentifier (OID) and descriptors that may be positional arguments (tuples) or attribute-value pairs (slots). PSOA RuleML 1.0 specifies for each descriptor whether it is to be interpreted under the perspective of the predicate in whose scope it occurs. This perspectivity dimension refines the space between oidless, positional atoms (relationships) and oidful, slotted atoms (frames): While relationships use only a predicate-scope-sensitive (predicate-dependent) tuple and frames use only predicate-scope-insensitive (predicate-independent) slots, PSOA RuleML 1.0 uses a systematics of orthogonal constructs also permitting atoms with (predicate-)independent tuples and atoms with (predicate-)dependent slots. This supports data and knowledge representation where a slot attribute can have different values depending on the predicate. PSOA thus extends object-oriented multi-membership and multiple inheritance. Based on objectification, PSOA laws are given: Besides unscoping and centralization, the semantic restriction and transformation of describution permits rescoping of one atom's independent descriptors to another atom with the same OID but a different predicate. For inheritance, default descriptors are realized by rules. On top of a metamodel and a Grailog visualization, PSOA's atom systematics for facts, queries, and rules is explained. The presentation and (XML-)serialization syntaxes of PSOA RuleML 1.0 are introduced. Its model-theoretic semantics is formalized by extending the interpretation functions for dependent descriptors. The open PSOATransRun system since Version 1.3 realizes PSOA RuleML 1.0 by a translator to runtime predicates, including for dependent tuples (prdtupterm) and slots (prdsloterm). Our tests show efficiency advantages of dependent and tupled modeling.

* 39 pages, 5 figures, 2 tables; updates for PSOATransRun 1.3.1
Click to Read Paper and Get Code
Generative Adversarial Networks (GANs) represent an attractive and novel approach to generate realistic data, such as genes, proteins, or drugs, in synthetic biology. Here, we apply GANs to generate synthetic DNA sequences encoding for proteins of variable length. We propose a novel feedback-loop architecture, called Feedback GAN (FBGAN), to optimize the synthetic gene sequences for desired properties using an external function analyzer. The proposed architecture also has the advantage that the analyzer need not be differentiable. We apply the feedback-loop mechanism to two examples: 1) generating synthetic genes coding for antimicrobial peptides, and 2) optimizing synthetic genes for the secondary structure of their resulting peptides. A suite of metrics demonstrate that the GAN generated proteins have desirable biophysical properties. The FBGAN architecture can also be used to optimize GAN-generated datapoints for useful properties in domains beyond genomics.

Click to Read Paper and Get Code
We consider the problem of inference in a linear regression model in which the relative ordering of the input features and output labels is not known. Such datasets naturally arise from experiments in which the samples are shuffled or permuted during the protocol. In this work, we propose a framework that treats the unknown permutation as a latent variable. We maximize the likelihood of observations using a stochastic expectation-maximization (EM) approach. We compare this to the dominant approach in the literature, which corresponds to hard EM in our framework. We show on synthetic data that the stochastic EM algorithm we develop has several advantages, including lower parameter error, less sensitivity to the choice of initialization, and significantly better performance on datasets that are only partially shuffled. We conclude by performing two experiments on real datasets that have been partially shuffled, in which we show that the stochastic EM algorithm can recover the weights with modest error.

* 11 pages, 5 figures
Click to Read Paper and Get Code
We generalize the scattering transform to graphs and consequently construct a convolutional neural network on graphs. We show that under certain conditions, any feature generated by such a network is approximately invariant to permutations and stable to graph manipulations. Numerical results demonstrate competitive performance on relevant datasets.

* 22 pages, 8 figures, 3 tables
Click to Read Paper and Get Code
Experience replay is a key technique behind many recent advances in deep reinforcement learning. Allowing the agent to learn from earlier memories can speed up learning and break undesirable temporal correlations. Despite its wide-spread application, very little is understood about the properties of experience replay. How does the amount of memory kept affect learning dynamics? Does it help to prioritize certain experiences? In this paper, we address these questions by formulating a dynamical systems ODE model of Q-learning with experience replay. We derive analytic solutions of the ODE for a simple setting. We show that even in this very simple setting, the amount of memory kept can substantially affect the agent's performance. Too much or too little memory both slow down learning. Moreover, we characterize regimes where prioritized replay harms the agent's learning. We show that our analytic solutions have excellent agreement with experiments. Finally, we propose a simple algorithm for adaptively changing the memory buffer size which achieves consistently good empirical performance.

Click to Read Paper and Get Code
Predicting the subcellular localization of proteins is an important and challenging problem. Traditional experimental approaches are often expensive and time-consuming. Consequently, a growing number of research efforts employ a series of machine learning approaches to predict the subcellular location of proteins. There are two main challenges among the state-of-the-art prediction methods. First, most of the existing techniques are designed to deal with multi-class rather than multi-label classification, which ignores connections between multiple labels. In reality, multiple locations of particular proteins implies that there are vital and unique biological significances that deserve special focus and cannot be ignored. Second, techniques for handling imbalanced data in multi-label classification problems are necessary, but never employed. For solving these two issues, we have developed an ensemble multi-label classifier called HPSLPred, which can be applied for multi-label classification with an imbalanced protein source. For convenience, a user-friendly webserver has been established at http://server.malab.cn/HPSLPred.

Click to Read Paper and Get Code