Research papers and code for "Zhou Zhou":
Recent years have seen a surge of interest in Probabilistic Logic Programming (PLP) and Statistical Relational Learning (SRL) models that combine logic with probabilities. Structure learning of these systems is an intersection area of Inductive Logic Programming (ILP) and statistical learning (SL). However, ILP cannot deal with probabilities, SL cannot model relational hypothesis. The biggest challenge of integrating these two machine learning frameworks is how to estimate the probability of a logic clause only from the observation of grounded logic atoms. Many current methods models a joint probability by representing clause as graphical model and literals as vertices in it. This model is still too complicate and only can be approximate by pseudo-likelihood. We propose Inductive Logic Boosting framework to transform the relational dataset into a feature-based dataset, induces logic rules by boosting Problog Rule Trees and relaxes the independence constraint of pseudo-likelihood. Experimental evaluation on benchmark datasets demonstrates that the AUC-PR and AUC-ROC value of ILP learned rules are higher than current state-of-the-art SRL methods.

* 19 pages, 2 figures
Click to Read Paper and Get Code
The optimal disturbance rejection control problem is considered for consensus tracking systems affected by external persistent disturbances and noise. Optimal estimated values of system states are obtained by recursive filtering for the multiple autonomous underwater vehicles modeled to multi-agent systems with Kalman filter. Then the feedforward-feedback optimal control law is deduced by solving the Riccati equations and matrix equations. The existence and uniqueness condition of feedforward-feedback optimal control law is proposed and the optimal control law algorithm is carried out. Lastly, simulations show the result is effectiveness with respect to external persistent disturbances and noise.

Click to Read Paper and Get Code
Perception and reasoning are basic human abilities that are seamlessly connected as part of human intelligence. However, in current machine learning systems, the perception and reasoning modules are incompatible. Tasks requiring joint perception and reasoning ability are difficult to accomplish autonomously and still demand human intervention. Inspired by the way language experts decoded Mayan scripts by joining two abilities in an abductive manner, this paper proposes the abductive learning framework. The framework learns perception and reasoning simultaneously with the help of a trial-and-error abductive process. We present the Neural-Logical Machine as an implementation of this novel learning framework. We demonstrate that--using human-like abductive learning--the machine learns from a small set of simple hand-written equations and then generalizes well to complex equations, a feat that is beyond the capability of state-of-the-art neural network models. The abductive learning framework explores a new direction for approaching human-level learning ability.

* Corrected typos
Click to Read Paper and Get Code
We study the problem of selecting $K$ arms with the highest expected rewards in a stochastic $n$-armed bandit game. This problem has a wide range of applications, e.g., A/B testing, crowdsourcing, simulation optimization. Our goal is to develop a PAC algorithm, which, with probability at least $1-\delta$, identifies a set of $K$ arms with the aggregate regret at most $\epsilon$. The notion of aggregate regret for multiple-arm identification was first introduced in \cite{Zhou:14} , which is defined as the difference of the averaged expected rewards between the selected set of arms and the best $K$ arms. In contrast to \cite{Zhou:14} that only provides instance-independent sample complexity, we introduce a new hardness parameter for characterizing the difficulty of any given instance. We further develop two algorithms and establish the corresponding sample complexity in terms of this hardness parameter. The derived sample complexity can be significantly smaller than state-of-the-art results for a large class of instances and matches the instance-independent lower bound upto a $\log(\epsilon^{-1})$ factor in the worst case. We also prove a lower bound result showing that the extra $\log(\epsilon^{-1})$ is necessary for instance-dependent algorithms using the introduced hardness parameter.

* 30 pages, 5 figures, preliminary version to appear in ICML 2017
Click to Read Paper and Get Code
We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at http://www.stat.ucla.edu/~zhou/Software.html.

* Statistics and Its Interface, 9: 415-431 (2016)
* 17 pages, 7 figures
Click to Read Paper and Get Code
Given a food image, can a fine-grained object recognition engine tell "which restaurant which dish" the food belongs to? Such ultra-fine grained image recognition is the key for many applications like search by images, but it is very challenging because it needs to discern subtle difference between classes while dealing with the scarcity of training data. Fortunately, the ultra-fine granularity naturally brings rich relationships among object classes. This paper proposes a novel approach to exploit the rich relationships through bipartite-graph labels (BGL). We show how to model BGL in an overall convolutional neural networks and the resulting system can be optimized through back-propagation. We also show that it is computationally efficient in inference thanks to the bipartite structure. To facilitate the study, we construct a new food benchmark dataset, which consists of 37,885 food images collected from 6 restaurants and totally 975 menus. Experimental results on this new food and three other datasets demonstrates BGL advances previous works in fine-grained object recognition. An online demo is available at http://www.f-zhou.com/fg_demo/.

Click to Read Paper and Get Code
Object Detection has been a significant topic in computer vision. As the continuous development of Deep Learning, many advanced academic and industrial outcomes are established on localising and classifying the target objects, such as instance segmentation, video tracking and robotic vision. As the core concept of Deep Learning, Deep Neural Networks (DNNs) and associated training are highly integrated with task-driven modelling, having great effects on accurate detection. The main focus of improving detection performance is proposing DNNs with extra layers and novel topological connections to extract the desired features from input data. However, training these models can be computationally expensive and laborious progress as the complicated model architecture and enormous parameters. Besides, the dataset is another reason causing this issue and low detection accuracy, because of insufficient data samples or difficult instances. To address these training difficulties, this thesis presents two different approaches to improve the detection performance in the relatively light-weight way. As the intrinsic feature of data-driven in deep learning, the first approach is "slot-based image augmentation" to enrich the dataset with extra foreground and background combinations. Instead of the commonly used image flipping method, the proposed system achieved similar mAP improvement with less extra images which decrease training time. This proposed augmentation system has extra flexibility adapting to various scenarios and the performance-driven analysis provides an alternative aspect of conducting image augmentation

* preprint draft
Click to Read Paper and Get Code
In classical Hawkes process, the baseline intensity and triggering kernel are assumed to be a constant and parametric function respectively, which limits the model flexibility. To generalize it, we present a fully Bayesian nonparametric model, namely Gaussian process modulated Hawkes process and propose an EM-variational inference scheme. In this model, a transformation of Gaussian process is used as a prior on the baseline intensity and triggering kernel. By introducing a latent branching structure, the inference of baseline intensity and triggering kernel is decoupled and the variational inference scheme is embedded into an EM framework naturally. We also provide a series of schemes to accelerate the inference. Results of synthetic and real data experiments show that the underlying baseline intensity and triggering kernel can be recovered without parametric restriction and our Bayesian nonparametric estimation is superior to other state of the arts.

Click to Read Paper and Get Code
Graph embedding technics are studied with interest on public datasets, such as BlogCatalog, with the common practice of maximizing scoring on graph reconstruction, link prediction metrics etc. However, in the financial sector the important metrics are often more business related, for example fraud detection rates. With our privileged position of having large amount of real-world non-public P2P-lending social data, we aim to study empirically whether recent advances in graph embedding technics provide a useful signal for metrics more closely related to business interests, such as fraud detection rate.

Click to Read Paper and Get Code
In this paper, we applies GA algorithm into Electrical Impedance Tomography (EIT) application. We first outline the EIT problem as an optimization problem and define a target optimization function. Then we show how the GA algorithm as an alternative searching algorithm can be used for solving EIT inverse problem. In this paper, we explore evolutionary methods such as GA algorithms combined with various regularization operators to solve EIT inverse computing problem. Key words: Electrical Impedance Tomography (EIT), GA, Tikhonov operator , Mumford-Shah operator, Particle Swarm Optimization(PSO), Back Propagation(BP).

* Full paper was accepted into proceedings of 4th International Conference on Innovation in Computing System & Engineering Technology ( ICICSET 2018) in Zurich, Switzerland on August 14- 15, 2018
Click to Read Paper and Get Code
Variance reduction is a simple and effective technique that accelerates convex (or non-convex) stochastic optimization. Among existing variance reduction methods, SVRG and SAGA adopt unbiased gradient estimators and are the most popular variance reduction methods in recent years. Although various accelerated variants of SVRG (e.g., Katyusha and Acc-Prox-SVRG) have been proposed, the direct acceleration of SAGA still remains unknown. In this paper, we propose a direct accelerated variant of SAGA using a novel Sampled Negative Momentum (SSNM), which achieves the best known oracle complexity for strongly convex problems. Consequently, our work fills the void of direct accelerated SAGA.

* 16 pages, 6 figures
Click to Read Paper and Get Code
Combining Bayesian nonparametrics and a forward model selection strategy, we construct parsimonious Bayesian deep networks (PBDNs) that infer capacity-regularized network architectures from the data and require neither cross-validation nor fine-tuning when training the model. One of the two essential components of a PBDN is the development of a special infinite-wide single-hidden-layer neural network, whose number of active hidden units can be inferred from the data. The other one is the construction of a greedy layer-wise learning algorithm that uses a forward model selection criterion to determine when to stop adding another hidden layer. We develop both Gibbs sampling and stochastic gradient descent based maximum a posteriori inference for PBDNs, providing state-of-the-art classification accuracy and interpretable data subtypes near the decision boundaries, while maintaining low computational complexity for out-of-sample prediction.

* NIPS 2018
Click to Read Paper and Get Code
Let $A:[0,1]\rightarrow\mathbb{H}_m$ (the space of Hermitian matrices) be a matrix valued function which is low rank with entries in H\"{o}lder class $\Sigma(\beta,L)$. The goal of this paper is to study statistical estimation of $A$ based on the regression model $\mathbb{E}(Y_j|\tau_j,X_j) = \langle A(\tau_j), X_j \rangle,$ where $\tau_j$ are i.i.d. uniformly distributed in $[0,1]$, $X_j$ are i.i.d. matrix completion sampling matrices, $Y_j$ are independent bounded responses. We propose an innovative nuclear norm penalized local polynomial estimator and establish an upper bound on its point-wise risk measured by Frobenius norm. Then we extend this estimator globally and prove an upper bound on its integrated risk measured by $L_2$-norm. We also propose another new estimator based on bias-reducing kernels to study the case when $A$ is not necessarily low rank and establish an upper bound on its risk measured by $L_{\infty}$-norm. We show that the obtained rates are all optimal up to some logarithmic factor in minimax sense. Finally, we propose an adaptive estimation procedure based on Lepski's method and the penalized data splitting technique which is computationally efficient and can be easily implemented and parallelized.

Click to Read Paper and Get Code
Clickbait detection in tweets remains an elusive challenge. In this paper, we describe the solution for the Zingel Clickbait Detector at the Clickbait Challenge 2017, which is capable of evaluating each tweet's level of click baiting. We first reformat the regression problem as a multi-classification problem, based on the annotation scheme. To perform multi-classification, we apply a token-level, self-attentive mechanism on the hidden states of bi-directional Gated Recurrent Units (biGRU), which enables the model to generate tweets' task-specific vector representations by attending to important tokens. The self-attentive neural network can be trained end-to-end, without involving any manual feature engineering. Our detector ranked first in the final evaluation of Clickbait Challenge 2017.

Click to Read Paper and Get Code
A common approach to analyze a covariate-sample count matrix, an element of which represents how many times a covariate appears in a sample, is to factorize it under the Poisson likelihood. We show its limitation in capturing the tendency for a covariate present in a sample to both repeat itself and excite related ones. To address this limitation, we construct negative binomial factor analysis (NBFA) to factorize the matrix under the negative binomial likelihood, and relate it to a Dirichlet-multinomial distribution based mixed-membership model. To support countably infinite factors, we propose the hierarchical gamma-negative binomial process. By exploiting newly proved connections between discrete distributions, we construct two blocked and a collapsed Gibbs sampler that all adaptively truncate their number of factors, and demonstrate that the blocked Gibbs sampler developed under a compound Poisson representation converges fast and has low computational complexity. Example results show that NBFA has a distinct mechanism in adjusting its number of inferred factors according to the sample lengths, and provides clear advantages in parsimonious representation, predictive power, and computational complexity over previously proposed discrete latent variable models, which either completely ignore burstiness, or model only the burstiness of the covariates but not that of the factors.

* To appear in Bayesian Analysis
Click to Read Paper and Get Code
I propose the purpose our concept of actual causation serves is minimizing various cost in intervention practice. Actual causation has three features: nonredundant sufficiency, continuity and abnormality; these features correspond to the minimization of exploitative cost, exploratory cost and risk cost in intervention practice. Incorporating these three features, a definition of actual causation is given. I test the definition in 66 causal cases from actual causation literature and show that this definition's application fit intuition better than some other causal modelling based definitions.

* 37 pages, 2 Appendixes
Click to Read Paper and Get Code
First-Order Logic (FOL) is widely regarded as one of the most important foundations for knowledge representation. Nevertheless, in this paper, we argue that FOL has several critical issues for this purpose. Instead, we propose an alternative called assertional logic, in which all syntactic objects are categorized as set theoretic constructs including individuals, concepts and operators, and all kinds of knowledge are formalized by equality assertions. We first present a primitive form of assertional logic that uses minimal assumed knowledge and constructs. Then, we show how to extend it by definitions, which are special kinds of knowledge, i.e., assertions. We argue that assertional logic, although simpler, is more expressive and extensible than FOL. As a case study, we show how assertional logic can be used to unify logic and probability, and more building blocks in AI.

* arXiv admin note: text overlap with arXiv:1603.03511
Click to Read Paper and Get Code
In this extended abstract, we propose Structured Production Systems (SPS), which extend traditional production systems with well-formed syntactic structures. Due to the richness of structures, structured production systems significantly enhance the expressive power as well as the flexibility of production systems, for instance, to handle uncertainty. We show that different rule application strategies can be reduced into the basic one by utilizing structures. Also, many fundamental approaches in computer science, including automata, grammar and logic, can be captured by structured production systems.

Click to Read Paper and Get Code
To construct flexible nonlinear predictive distributions, the paper introduces a family of softplus function based regression models that convolve, stack, or combine both operations by convolving countably infinite stacked gamma distributions, whose scales depend on the covariates. Generalizing logistic regression that uses a single hyperplane to partition the covariate space into two halves, softplus regressions employ multiple hyperplanes to construct a confined space, related to a single convex polytope defined by the intersection of multiple half-spaces or a union of multiple convex polytopes, to separate one class from the other. The gamma process is introduced to support the convolution of countably infinite (stacked) covariate-dependent gamma distributions. For Bayesian inference, Gibbs sampling derived via novel data augmentation and marginalization techniques is used to deconvolve and/or demix the highly complex nonlinear predictive distribution. Example results demonstrate that softplus regressions provide flexible nonlinear decision boundaries, achieving classification accuracies comparable to that of kernel support vector machine while requiring significant less computation for out-of-sample prediction.

* 33 pages + 12 page appendix, 15 figures, 10 tables
Click to Read Paper and Get Code
Predicting ambulance demand accurately at a fine resolution in time and space (e.g., every hour and 1 km$^2$) is critical for staff / fleet management and dynamic deployment. There are several challenges: though the dataset is typically large-scale, demand per time period and locality is almost always zero. The demand arises from complex urban geography and exhibits complex spatio-temporal patterns, both of which need to captured and exploited. To address these challenges, we propose three methods based on Gaussian mixture models, kernel density estimation, and kernel warping. These methods provide spatio-temporal predictions for Toronto and Melbourne that are significantly more accurate than the current industry practice.

* presented at 2016 ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, New York, NY
Click to Read Paper and Get Code