Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoretical foundation for understanding the approximation or generalization ability of deep learning methods generated by the network architectures such as deep convolutional neural networks having convolutional structures. Here we show that a deep convolutional neural network (CNN) is universal, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. This answers an open question in learning theory. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNNs in dealing with large dimensional data. Our study also demonstrates the role of convolutions in deep CNNs.

**Click to Read Paper****Click to Read Paper**

On the Robustness of Regularized Pairwise Learning Methods Based on Kernels

Oct 12, 2015

Andreas Christmann, Ding-Xuan Zhou

Regularized empirical risk minimization including support vector machines plays an important role in machine learning theory. In this paper regularized pairwise learning (RPL) methods based on kernels will be investigated. One example is regularized minimization of the error entropy loss which has recently attracted quite some interest from the viewpoint of consistency and learning rates. This paper shows that such RPL methods have additionally good statistical robustness properties, if the loss function and the kernel are chosen appropriately. We treat two cases of particular interest: (i) a bounded and non-convex loss function and (ii) an unbounded convex loss function satisfying a certain Lipschitz type condition.
Oct 12, 2015

Andreas Christmann, Ding-Xuan Zhou

* 36 pages, 1 figure

**Click to Read Paper**

Unregularized Online Learning Algorithms with General Loss Functions

Apr 26, 2015

Yiming Ying, Ding-Xuan Zhou

In this paper, we consider unregularized online learning algorithms in a Reproducing Kernel Hilbert Spaces (RKHS). Firstly, we derive explicit convergence rates of the unregularized online learning algorithms for classification associated with a general gamma-activating loss (see Definition 1 in the paper). Our results extend and refine the results in Ying and Pontil (2008) for the least-square loss and the recent result in Bach and Moulines (2011) for the loss function with a Lipschitz-continuous gradient. Moreover, we establish a very general condition on the step sizes which guarantees the convergence of the last iterate of such algorithms. Secondly, we establish, for the first time, the convergence of the unregularized pairwise learning algorithm with a general loss function and derive explicit rates under the assumption of polynomially decaying step sizes. Concrete examples are used to illustrate our main results. The main techniques are tools from convex analysis, refined inequalities of Gaussian averages, and an induction approach.
Apr 26, 2015

Yiming Ying, Ding-Xuan Zhou

**Click to Read Paper**

Minimax Optimal Rates of Estimation in High Dimensional Additive Models: Universal Phase Transition

Mar 10, 2015

Ming Yuan, Ding-Xuan Zhou

Mar 10, 2015

Ming Yuan, Ding-Xuan Zhou

**Click to Read Paper**

Pairwise learning usually refers to a learning task which involves a loss function depending on pairs of examples, among which most notable ones include ranking, metric learning and AUC maximization. In this paper, we study an online algorithm for pairwise learning with a least-square loss function in an unconstrained setting of a reproducing kernel Hilbert space (RKHS), which we refer to as the Online Pairwise lEaRning Algorithm (OPERA). In contrast to existing works \cite{Kar,Wang} which require that the iterates are restricted to a bounded domain or the loss function is strongly-convex, OPERA is associated with a non-strongly convex objective function and learns the target function in an unconstrained RKHS. Specifically, we establish a general theorem which guarantees the almost surely convergence for the last iterate of OPERA without any assumptions on the underlying distribution. Explicit convergence rates are derived under the condition of polynomially decaying step sizes. We also establish an interesting property for a family of widely-used kernels in the setting of pairwise learning and illustrate the above convergence results using such kernels. Our methodology mainly depends on the characterization of RKHSs using its associated integral operators and probability inequalities for random variables with values in a Hilbert space.

**Click to Read Paper**
Learning rates for the risk of kernel based quantile regression estimators in additive models

May 14, 2014

Andreas Christmann, Ding-Xuan Zhou

Additive models play an important role in semiparametric statistics. This paper gives learning rates for regularized kernel based methods for additive models. These learning rates compare favourably in particular in high dimensions to recent results on optimal learning rates for purely nonparametric regularized kernel based quantile regression using the Gaussian radial basis function kernel, provided the assumption of an additive model is valid. Additionally, a concrete example is presented to show that a Gaussian function depending only on one variable lies in a reproducing kernel Hilbert space generated by an additive Gaussian kernel, but does not belong to the reproducing kernel Hilbert space generated by the multivariate Gaussian kernel of the same variance.
May 14, 2014

Andreas Christmann, Ding-Xuan Zhou

* 35 pages, 2 figures

**Click to Read Paper**

Regularized empirical risk minimization using kernels and their corresponding reproducing kernel Hilbert spaces (RKHSs) plays an important role in machine learning. However, the actually used kernel often depends on one or on a few hyperparameters or the kernel is even data dependent in a much more complicated manner. Examples are Gaussian RBF kernels, kernel learning, and hierarchical Gaussian kernels which were recently proposed for deep learning. Therefore, the actually used kernel is often computed by a grid search or in an iterative manner and can often only be considered as an approximation to the "ideal" or "optimal" kernel. The paper gives conditions under which classical kernel based methods based on a convex Lipschitz loss function and on a bounded and smooth kernel are stable, if the probability measure $P$, the regularization parameter $\lambda$, and the kernel $k$ may slightly change in a simultaneous manner. Similar results are also given for pairwise learning. Therefore, the topic of this paper is somewhat more general than in classical robust statistics, where usually only the influence of small perturbations of the probability measure $P$ on the estimated function is considered.

**Click to Read Paper**
Iterative Regularization for Learning with Convex Loss Functions

Apr 01, 2015

Junhong Lin, Lorenzo Rosasco, Ding-Xuan Zhou

We consider the problem of supervised learning with convex loss functions and propose a new form of iterative regularization based on the subgradient method. Unlike other regularization approaches, in iterative regularization no constraint or penalization is considered, and generalization is achieved by (early) stopping an empirical iteration. We consider a nonparametric setting, in the framework of reproducing kernel Hilbert spaces, and prove finite sample bounds on the excess risk under general regularity conditions. Our study provides a new class of efficient regularized learning algorithms and gives insights on the interplay between statistics and optimization in machine learning.
Apr 01, 2015

Junhong Lin, Lorenzo Rosasco, Ding-Xuan Zhou

**Click to Read Paper**

Distributed learning with regularized least squares

Mar 11, 2017

Shao-Bo Lin, Xin Guo, Ding-Xuan Zhou

We study distributed learning with the least squares regularization scheme in a reproducing kernel Hilbert space (RKHS). By a divide-and-conquer approach, the algorithm partitions a data set into disjoint data subsets, applies the least squares regularization scheme to each data subset to produce an output function, and then takes an average of the individual output functions as a final global estimator or predictor. We show with error bounds in expectation in both the $L^2$-metric and RKHS-metric that the global output function of this distributed learning is a good approximation to the algorithm processing the whole data in one single machine. Our error bounds are sharp and stated in a general setting without any eigenfunction assumption. The analysis is achieved by a novel second order decomposition of operator differences in our integral operator approach. Even for the classical least squares regularization scheme in the RKHS associated with a general kernel, we give the best learning rate in the literature.
Mar 11, 2017

Shao-Bo Lin, Xin Guo, Ding-Xuan Zhou

* 28 pages

**Click to Read Paper**

Construction of neural networks for realization of localized deep learning

Mar 09, 2018

Charles K. Chui, Shao-Bo Lin, Ding-Xuan Zhou

The subject of deep learning has recently attracted users of machine learning from various disciplines, including: medical diagnosis and bioinformatics, financial market analysis and online advertisement, speech and handwriting recognition, computer vision and natural language processing, time series forecasting, and search engines. However, theoretical development of deep learning is still at its infancy. The objective of this paper is to introduce a deep neural network (also called deep-net) approach to localized manifold learning, with each hidden layer endowed with a specific learning task. For the purpose of illustrations, we only focus on deep-nets with three hidden layers, with the first layer for dimensionality reduction, the second layer for bias reduction, and the third layer for variance reduction. A feedback component also designed to eliminate outliers. The main theoretical result in this paper is the order $\mathcal O\left(m^{-2s/(2s+d)}\right)$ of approximation of the regression function with regularity $s$, in terms of the number $m$ of sample points, where the (unknown) manifold dimension $d$ replaces the dimension $D$ of the sampling (Euclidean) space for shallow nets.
Mar 09, 2018

Charles K. Chui, Shao-Bo Lin, Ding-Xuan Zhou

* 22pages

**Click to Read Paper**

Data-dependent Generalization Bounds for Multi-class Classification

Dec 29, 2017

Yunwen Lei, Urun Dogan, Ding-Xuan Zhou, Marius Kloft

Dec 29, 2017

Yunwen Lei, Urun Dogan, Ding-Xuan Zhou, Marius Kloft

**Click to Read Paper**

Consistency Analysis of an Empirical Minimum Error Entropy Algorithm

Dec 17, 2014

Jun Fan, Ting Hu, Qiang Wu, Ding-Xuan Zhou

In this paper we study the consistency of an empirical minimum error entropy (MEE) algorithm in a regression setting. We introduce two types of consistency. The error entropy consistency, which requires the error entropy of the learned function to approximate the minimum error entropy, is shown to be always true if the bandwidth parameter tends to 0 at an appropriate rate. The regression consistency, which requires the learned function to approximate the regression function, however, is a complicated issue. We prove that the error entropy consistency implies the regression consistency for homoskedastic models where the noise is independent of the input variable. But for heteroskedastic models, a counterexample is used to show that the two types of consistency do not coincide. A surprising result is that the regression consistency is always true, provided that the bandwidth parameter tends to infinity at an appropriate rate. Regression consistency of two classes of special models is shown to hold with fixed bandwidth parameter, which further illustrates the complexity of regression consistency of MEE. Fourier transform plays crucial roles in our analysis.
Dec 17, 2014

Jun Fan, Ting Hu, Qiang Wu, Ding-Xuan Zhou

**Click to Read Paper**

Learning Theory Approach to Minimum Error Entropy Criterion

Feb 22, 2013

Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou

We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm in a regression setting. A learning theory approach is presented for this MEE algorithm and explicit error bounds are provided in terms of the approximation ability and capacity of the involved hypothesis space when the MEE scaling parameter is large. Novel asymptotic analysis is conducted for the generalization error associated with Renyi's entropy and a Parzen window function, to overcome technical difficulties arisen from the essential differences between the classical least squares problems and the MEE setting. A semi-norm and the involved symmetrized least squares error are introduced, which is related to some ranking algorithms.
Feb 22, 2013

Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou

* JMLR 2013

**Click to Read Paper**