Research papers and code for "Ding-Xuan Zhou":
Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoretical foundation for understanding the approximation or generalization ability of deep learning methods generated by the network architectures such as deep convolutional neural networks having convolutional structures. Here we show that a deep convolutional neural network (CNN) is universal, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. This answers an open question in learning theory. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNNs in dealing with large dimensional data. Our study also demonstrates the role of convolutions in deep CNNs.

Click to Read Paper and Get Code
In this paper we consider online mirror descent (OMD) algorithms, a class of scalable online learning algorithms exploiting data geometric structures through mirror maps. Necessary and sufficient conditions are presented in terms of the step size sequence $\{\eta_t\}_{t}$ for the convergence of an OMD algorithm with respect to the expected Bregman distance induced by the mirror map. The condition is $\lim_{t\to\infty}\eta_t=0, \sum_{t=1}^{\infty}\eta_t=\infty$ in the case of positive variances. It is reduced to $\sum_{t=1}^{\infty}\eta_t=\infty$ in the case of zero variances for which the linear convergence may be achieved by taking a constant step size sequence. A sufficient condition on the almost sure convergence is also given. We establish tight error bounds under mild conditions on the mirror map, the loss function, and the regularizer. Our results are achieved by some novel analysis on the one-step progress of the OMD algorithm using smoothness and strong convexity of the mirror map and the loss function.

Click to Read Paper and Get Code
Regularized empirical risk minimization including support vector machines plays an important role in machine learning theory. In this paper regularized pairwise learning (RPL) methods based on kernels will be investigated. One example is regularized minimization of the error entropy loss which has recently attracted quite some interest from the viewpoint of consistency and learning rates. This paper shows that such RPL methods have additionally good statistical robustness properties, if the loss function and the kernel are chosen appropriately. We treat two cases of particular interest: (i) a bounded and non-convex loss function and (ii) an unbounded convex loss function satisfying a certain Lipschitz type condition.

* 36 pages, 1 figure
Click to Read Paper and Get Code
In this paper, we consider unregularized online learning algorithms in a Reproducing Kernel Hilbert Spaces (RKHS). Firstly, we derive explicit convergence rates of the unregularized online learning algorithms for classification associated with a general gamma-activating loss (see Definition 1 in the paper). Our results extend and refine the results in Ying and Pontil (2008) for the least-square loss and the recent result in Bach and Moulines (2011) for the loss function with a Lipschitz-continuous gradient. Moreover, we establish a very general condition on the step sizes which guarantees the convergence of the last iterate of such algorithms. Secondly, we establish, for the first time, the convergence of the unregularized pairwise learning algorithm with a general loss function and derive explicit rates under the assumption of polynomially decaying step sizes. Concrete examples are used to illustrate our main results. The main techniques are tools from convex analysis, refined inequalities of Gaussian averages, and an induction approach.

Click to Read Paper and Get Code
We establish minimax optimal rates of convergence for estimation in a high dimensional additive model assuming that it is approximately sparse. Our results reveal an interesting phase transition behavior universal to this class of high dimensional problems. In the {\it sparse regime} when the components are sufficiently smooth or the dimensionality is sufficiently large, the optimal rates are identical to those for high dimensional linear regression, and therefore there is no additional cost to entertain a nonparametric model. Otherwise, in the so-called {\it smooth regime}, the rates coincide with the optimal rates for estimating a univariate function, and therefore they are immune to the "curse of dimensionality".

Click to Read Paper and Get Code
Pairwise learning usually refers to a learning task which involves a loss function depending on pairs of examples, among which most notable ones include ranking, metric learning and AUC maximization. In this paper, we study an online algorithm for pairwise learning with a least-square loss function in an unconstrained setting of a reproducing kernel Hilbert space (RKHS), which we refer to as the Online Pairwise lEaRning Algorithm (OPERA). In contrast to existing works \cite{Kar,Wang} which require that the iterates are restricted to a bounded domain or the loss function is strongly-convex, OPERA is associated with a non-strongly convex objective function and learns the target function in an unconstrained RKHS. Specifically, we establish a general theorem which guarantees the almost surely convergence for the last iterate of OPERA without any assumptions on the underlying distribution. Explicit convergence rates are derived under the condition of polynomially decaying step sizes. We also establish an interesting property for a family of widely-used kernels in the setting of pairwise learning and illustrate the above convergence results using such kernels. Our methodology mainly depends on the characterization of RKHSs using its associated integral operators and probability inequalities for random variables with values in a Hilbert space.

Click to Read Paper and Get Code
Additive models play an important role in semiparametric statistics. This paper gives learning rates for regularized kernel based methods for additive models. These learning rates compare favourably in particular in high dimensions to recent results on optimal learning rates for purely nonparametric regularized kernel based quantile regression using the Gaussian radial basis function kernel, provided the assumption of an additive model is valid. Additionally, a concrete example is presented to show that a Gaussian function depending only on one variable lies in a reproducing kernel Hilbert space generated by an additive Gaussian kernel, but does not belong to the reproducing kernel Hilbert space generated by the multivariate Gaussian kernel of the same variance.

* 35 pages, 2 figures
Click to Read Paper and Get Code
Regularized empirical risk minimization using kernels and their corresponding reproducing kernel Hilbert spaces (RKHSs) plays an important role in machine learning. However, the actually used kernel often depends on one or on a few hyperparameters or the kernel is even data dependent in a much more complicated manner. Examples are Gaussian RBF kernels, kernel learning, and hierarchical Gaussian kernels which were recently proposed for deep learning. Therefore, the actually used kernel is often computed by a grid search or in an iterative manner and can often only be considered as an approximation to the "ideal" or "optimal" kernel. The paper gives conditions under which classical kernel based methods based on a convex Lipschitz loss function and on a bounded and smooth kernel are stable, if the probability measure $P$, the regularization parameter $\lambda$, and the kernel $k$ may slightly change in a simultaneous manner. Similar results are also given for pairwise learning. Therefore, the topic of this paper is somewhat more general than in classical robust statistics, where usually only the influence of small perturbations of the probability measure $P$ on the estimated function is considered.

Click to Read Paper and Get Code
We consider the problem of supervised learning with convex loss functions and propose a new form of iterative regularization based on the subgradient method. Unlike other regularization approaches, in iterative regularization no constraint or penalization is considered, and generalization is achieved by (early) stopping an empirical iteration. We consider a nonparametric setting, in the framework of reproducing kernel Hilbert spaces, and prove finite sample bounds on the excess risk under general regularity conditions. Our study provides a new class of efficient regularized learning algorithms and gives insights on the interplay between statistics and optimization in machine learning.

Click to Read Paper and Get Code
We study distributed learning with the least squares regularization scheme in a reproducing kernel Hilbert space (RKHS). By a divide-and-conquer approach, the algorithm partitions a data set into disjoint data subsets, applies the least squares regularization scheme to each data subset to produce an output function, and then takes an average of the individual output functions as a final global estimator or predictor. We show with error bounds in expectation in both the $L^2$-metric and RKHS-metric that the global output function of this distributed learning is a good approximation to the algorithm processing the whole data in one single machine. Our error bounds are sharp and stated in a general setting without any eigenfunction assumption. The analysis is achieved by a novel second order decomposition of operator differences in our integral operator approach. Even for the classical least squares regularization scheme in the RKHS associated with a general kernel, we give the best learning rate in the literature.

* 28 pages
Click to Read Paper and Get Code
Based on the tree architecture, the objective of this paper is to design deep neural networks with two or more hidden layers (called deep nets) for realization of radial functions so as to enable rotational invariance for near-optimal function approximation in an arbitrarily high dimensional Euclidian space. It is shown that deep nets have much better performance than shallow nets (with only one hidden layer) in terms of approximation accuracy and learning capabilities. In particular, for learning radial functions, it is shown that near-optimal rate can be achieved by deep nets but not by shallow nets. Our results illustrate the necessity of depth in neural network design for realization of rotation-invariance target functions.

* 34 pages, 1 figure
Click to Read Paper and Get Code
The subject of deep learning has recently attracted users of machine learning from various disciplines, including: medical diagnosis and bioinformatics, financial market analysis and online advertisement, speech and handwriting recognition, computer vision and natural language processing, time series forecasting, and search engines. However, theoretical development of deep learning is still at its infancy. The objective of this paper is to introduce a deep neural network (also called deep-net) approach to localized manifold learning, with each hidden layer endowed with a specific learning task. For the purpose of illustrations, we only focus on deep-nets with three hidden layers, with the first layer for dimensionality reduction, the second layer for bias reduction, and the third layer for variance reduction. A feedback component also designed to eliminate outliers. The main theoretical result in this paper is the order $\mathcal O\left(m^{-2s/(2s+d)}\right)$ of approximation of the regression function with regularity $s$, in terms of the number $m$ of sample points, where the (unknown) manifold dimension $d$ replaces the dimension $D$ of the sampling (Euclidean) space for shallow nets.

* 22pages
Click to Read Paper and Get Code
In this paper, we study data-dependent generalization error bounds exhibiting a mild dependency on the number of classes, making them suitable for multi-class learning with a large number of label classes. The bounds generally hold for empirical multi-class risk minimization algorithms using an arbitrary norm as regularizer. Key to our analysis are new structural results for multi-class Gaussian complexities and empirical $\ell_\infty$-norm covering numbers, which exploit the Lipschitz continuity of the loss function with respect to the $\ell_2$- and $\ell_\infty$-norm, respectively. We establish data-dependent error bounds in terms of complexities of a linear function class defined on a finite set induced by training examples, for which we show tight lower and upper bounds. We apply the results to several prominent multi-class learning machines, exhibiting a tighter dependency on the number of classes than the state of the art. For instance, for the multi-class SVM by Crammer and Singer (2002), we obtain a data-dependent bound with a logarithmic dependency which significantly improves the previous square-root dependency. Experimental results are reported to verify the effectiveness of our theoretical findings.

Click to Read Paper and Get Code
In this paper we study the consistency of an empirical minimum error entropy (MEE) algorithm in a regression setting. We introduce two types of consistency. The error entropy consistency, which requires the error entropy of the learned function to approximate the minimum error entropy, is shown to be always true if the bandwidth parameter tends to 0 at an appropriate rate. The regression consistency, which requires the learned function to approximate the regression function, however, is a complicated issue. We prove that the error entropy consistency implies the regression consistency for homoskedastic models where the noise is independent of the input variable. But for heteroskedastic models, a counterexample is used to show that the two types of consistency do not coincide. A surprising result is that the regression consistency is always true, provided that the bandwidth parameter tends to infinity at an appropriate rate. Regression consistency of two classes of special models is shown to hold with fixed bandwidth parameter, which further illustrates the complexity of regression consistency of MEE. Fourier transform plays crucial roles in our analysis.

Click to Read Paper and Get Code
We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm in a regression setting. A learning theory approach is presented for this MEE algorithm and explicit error bounds are provided in terms of the approximation ability and capacity of the involved hypothesis space when the MEE scaling parameter is large. Novel asymptotic analysis is conducted for the generalization error associated with Renyi's entropy and a Parzen window function, to overcome technical difficulties arisen from the essential differences between the classical least squares problems and the MEE setting. A semi-norm and the involved symmetrized least squares error are introduced, which is related to some ranking algorithms.

* JMLR 2013
Click to Read Paper and Get Code