Models, code, and papers for "Jieping Ye":

We develop an efficient alternating framework for learning a generalized version of Factorization Machine (gFM) on steaming data with provable guarantees. When the instances are sampled from $d$ dimensional random Gaussian vectors and the target second order coefficient matrix in gFM is of rank $k$, our algorithm converges linearly, achieves $O(\epsilon)$ recovery error after retrieving $O(k^{3}d\log(1/\epsilon))$ training instances, consumes $O(kd)$ memory in one-pass of dataset and only requires matrix-vector product operations in each iteration. The key ingredient of our framework is a construction of an estimation sequence endowed with a so-called Conditionally Independent RIP condition (CI-RIP). As special cases of gFM, our framework can be applied to symmetric or asymmetric rank-one matrix sensing problems, such as inductive matrix completion and phase retrieval.

Stochastic gradient algorithms estimate the gradient based on only one or a few samples and enjoy low computational cost per iteration. They have been widely used in large-scale optimization problems. However, stochastic gradient algorithms are usually slow to converge and achieve sub-linear convergence rates, due to the inherent variance in the gradient computation. To accelerate the convergence, some variance-reduced stochastic gradient algorithms, e.g., proximal stochastic variance-reduced gradient (Prox-SVRG) algorithm, have recently been proposed to solve strongly convex problems. Under the strongly convex condition, these variance-reduced stochastic gradient algorithms achieve a linear convergence rate. However, many machine learning problems are convex but not strongly convex. In this paper, we introduce Prox-SVRG and its projected variant called Variance-Reduced Projected Stochastic Gradient (VRPSG) to solve a class of non-strongly convex optimization problems widely used in machine learning. As the main technical contribution of this paper, we show that both VRPSG and Prox-SVRG achieve a linear convergence rate without strong convexity. A key ingredient in our proof is a Semi-Strongly Convex (SSC) inequality which is the first to be rigorously proved for a class of non-strongly convex problems in both constrained and regularized settings. Moreover, the SSC inequality is independent of algorithms and may be applied to analyze other stochastic gradient algorithms besides VRPSG and Prox-SVRG, which may be of independent interest. To the best of our knowledge, this is the first work that establishes the linear convergence rate for the variance-reduced stochastic gradient algorithms on solving both constrained and regularized problems without strong convexity.

Multi-task feature learning (MTFL) is a powerful technique in boosting the predictive performance by learning multiple related classification/regression/clustering tasks simultaneously. However, solving the MTFL problem remains challenging when the feature dimension is extremely large. In this paper, we propose a novel screening rule---that is based on the dual projection onto convex sets (DPC)---to quickly identify the inactive features---that have zero coefficients in the solution vectors across all tasks. One of the appealing features of DPC is that: it is safe in the sense that the detected inactive features are guaranteed to have zero coefficients in the solution vectors across all tasks. Thus, by removing the inactive features from the training phase, we may have substantial savings in the computational cost and memory usage without sacrificing accuracy. To the best of our knowledge, it is the first screening rule that is applicable to sparse models with multiple data matrices. A key challenge in deriving DPC is to solve a nonconvex problem. We show that we can solve for the global optimum efficiently via a properly chosen parametrization of the constraint set. Moreover, DPC has very low computational cost and can be integrated with any existing solvers. We have evaluated the proposed DPC rule on both synthetic and real data sets. The experiments indicate that DPC is very effective in identifying the inactive features---especially for high dimensional data---which leads to a speedup up to several orders of magnitude.

Sparse-Group Lasso (SGL) has been shown to be a powerful regression technique for simultaneously discovering group and within-group sparse patterns by using a combination of the $\ell_1$ and $\ell_2$ norms. However, in large-scale applications, the complexity of the regularizers entails great computational challenges. In this paper, we propose a novel Two-Layer Feature REduction method (TLFre) for SGL via a decomposition of its dual feasible set. The two-layer reduction is able to quickly identify the inactive groups and the inactive features, respectively, which are guaranteed to be absent from the sparse representation and can be removed from the optimization. Existing feature reduction methods are only applicable for sparse models with one sparsity-inducing regularizer. To our best knowledge, TLFre is the first one that is capable of dealing with multiple sparsity-inducing regularizers. Moreover, TLFre has a very low computational cost and can be integrated with any existing solvers. We also develop a screening method---called DPC (DecomPosition of Convex set)---for the nonnegative Lasso problem. Experiments on both synthetic and real data sets show that TLFre and DPC improve the efficiency of SGL and nonnegative Lasso by several orders of magnitude.

We study the problem of estimating multiple predictive functions from a dictionary of basis functions in the nonparametric regression setting. Our estimation scheme assumes that each predictive function can be estimated in the form of a linear combination of the basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program regularized by the trace norm and the $\ell_1$-norm simultaneously. We propose to solve the convex program using the accelerated gradient (AG) method and the alternating direction method of multipliers (ADMM) respectively; we also develop efficient algorithms to solve the key components in both AG and ADMM. In addition, we conduct theoretical analysis on the proposed function estimation scheme: we derive a key property of the optimal solution to the convex program; based on an assumption on the basis functions, we establish a performance bound of the proposed function estimation scheme (via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the proposed algorithms.

Sparse learning has recently received increasing attention in many areas including machine learning, statistics, and applied mathematics. The mixed-norm regularization based on the L1/Lq norm with q > 1 is attractive in many applications of regression and classification in that it facilitates group sparsity in the model. The resulting optimization problem is, however, challenging to solve due to the structure of the L1/Lq -regularization. Existing work deals with special cases including q = 2,infinity, and they cannot be easily extended to the general case. In this paper, we propose an efficient algorithm based on the accelerated gradient method for solving the L1/Lq -regularized problem, which is applicable for all values of q larger than 1, thus significantly extending existing work. One key building block of the proposed algorithm is the L1/Lq -regularized Euclidean projection (EP1q). Our theoretical analysis reveals the key properties of EP1q and illustrates why EP1q for the general q is significantly more challenging to solve than the special cases. Based on our theoretical analysis, we develop an efficient algorithm for EP1q by solving two zero finding problems. Experimental results demonstrate the efficiency of the proposed algorithm.

The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results demonstrate the efficiency and effectiveness of the proposed algorithm.

In this paper, we provide a new framework to obtain the generalization bounds of the learning process for domain adaptation, and then apply the derived bounds to analyze the asymptotical convergence of the learning process. Without loss of generality, we consider two kinds of representative domain adaptation: one is with multiple sources and the other is combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. For either kind of domain adaptation, we develop a related Hoeffding-type deviation inequality and a symmetrization inequality to achieve the corresponding generalization bound based on the uniform entropy number. We also generalized the classical McDiarmid's inequality to a more general setting where independent random variables can take values from different domains. By using this inequality, we then obtain generalization bounds based on the Rademacher complexity. Afterwards, we analyze the asymptotic convergence and the rate of convergence of the learning process for such kind of domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process and the numerical experiments support our theoretical findings as well.

Lasso is a widely used regression technique to find sparse representations. When the dimension of the feature space and the number of samples are extremely large, solving the Lasso problem remains challenging. To improve the efficiency of solving large-scale Lasso problems, El Ghaoui and his colleagues have proposed the SAFE rules which are able to quickly identify the inactive predictors, i.e., predictors that have $0$ components in the solution vector. Then, the inactive predictors or features can be removed from the optimization problem to reduce its scale. By transforming the standard Lasso to its dual form, it can be shown that the inactive predictors include the set of inactive constraints on the optimal dual solution. In this paper, we propose an efficient and effective screening rule via Dual Polytope Projections (DPP), which is mainly based on the uniqueness and nonexpansiveness of the optimal dual solution due to the fact that the feasible set in the dual space is a convex and closed polytope. Moreover, we show that our screening rule can be extended to identify inactive groups in group Lasso. To the best of our knowledge, there is currently no "exact" screening rule for group Lasso. We have evaluated our screening rule using synthetic and real data sets. Results show that our rule is more effective in identifying inactive predictors than existing state-of-the-art screening rules for Lasso.

We consider forward-backward greedy algorithms for solving sparse feature selection problems with general convex smooth functions. A state-of-the-art greedy method, the Forward-Backward greedy algorithm (FoBa-obj) requires to solve a large number of optimization problems, thus it is not scalable for large-size problems. The FoBa-gdt algorithm, which uses the gradient information for feature selection at each forward iteration, significantly improves the efficiency of FoBa-obj. In this paper, we systematically analyze the theoretical properties of both forward-backward greedy algorithms. Our main contributions are: 1) We derive better theoretical bounds than existing analyses regarding FoBa-obj for general smooth convex functions; 2) We show that FoBa-gdt achieves the same theoretical performance as FoBa-obj under the same condition: restricted strong convexity condition. Our new bounds are consistent with the bounds of a special case (least squares) and fills a previously existing theoretical gap for general convex smooth functions; 3) We show that the restricted strong convexity condition is satisfied if the number of independent samples is more than $\bar{k}\log d$ where $\bar{k}$ is the sparsity number and $d$ is the dimension of the variable; 4) We apply FoBa-gdt (with the conditional random field objective) to the sensor selection problem for human indoor activity recognition and our results show that FoBa-gdt outperforms other methods (including the ones based on forward greedy selection and L1-regularization).

The support vector machine (SVM) is a widely used method for classification. Although many efforts have been devoted to develop efficient solvers, it remains challenging to apply SVM to large-scale problems. A nice property of SVM is that the non-support vectors have no effect on the resulting classifier. Motivated by this observation, we present fast and efficient screening rules to discard non-support vectors by analyzing the dual problem of SVM via variational inequalities (DVI). As a result, the number of data instances to be entered into the optimization can be substantially reduced. Some appealing features of our screening method are: (1) DVI is safe in the sense that the vectors discarded by DVI are guaranteed to be non-support vectors; (2) the data set needs to be scanned only once to run the screening, whose computational cost is negligible compared to that of solving the SVM problem; (3) DVI is independent of the solvers and can be integrated with any existing efficient solvers. We also show that the DVI technique can be extended to detect non-support vectors in the least absolute deviations regression (LAD). To the best of our knowledge, there are currently no screening methods for LAD. We have evaluated DVI on both synthetic and real data sets. Experiments indicate that DVI significantly outperforms the existing state-of-the-art screening rules for SVM, and is very effective in discarding non-support vectors for LAD. The speedup gained by DVI rules can be up to two orders of magnitude.

We consider the following signal recovery problem: given a measurement matrix $\Phi\in \mathbb{R}^{n\times p}$ and a noisy observation vector $c\in \mathbb{R}^{n}$ constructed from $c = \Phi\theta^* + \epsilon$ where $\epsilon\in \mathbb{R}^{n}$ is the noise vector whose entries follow i.i.d. centered sub-Gaussian distribution, how to recover the signal $\theta^*$ if $D\theta^*$ is sparse {\rca under a linear transformation} $D\in\mathbb{R}^{m\times p}$? One natural method using convex optimization is to solve the following problem: $$\min_{\theta} {1\over 2}\|\Phi\theta - c\|^2 + \lambda\|D\theta\|_1.$$ This paper provides an upper bound of the estimate error and shows the consistency property of this method by assuming that the design matrix $\Phi$ is a Gaussian random matrix. Specifically, we show 1) in the noiseless case, if the condition number of $D$ is bounded and the measurement number $n\geq \Omega(s\log(p))$ where $s$ is the sparsity number, then the true solution can be recovered with high probability; and 2) in the noisy case, if the condition number of $D$ is bounded and the measurement increases faster than $s\log(p)$, that is, $s\log(p)=o(n)$, the estimate error converges to zero with probability 1 when $p$ and $s$ go to infinity. Our results are consistent with those for the special case $D=\bold{I}_{p\times p}$ (equivalently LASSO) and improve the existing analysis. The condition number of $D$ plays a critical role in our analysis. We consider the condition numbers in two cases including the fused LASSO and the random graph: the condition number in the fused LASSO case is bounded by a constant, while the condition number in the random graph case is bounded with high probability if $m\over p$ (i.e., $#text{edge}\over #text{vertex}$) is larger than a certain constant. Numerical simulations are consistent with our theoretical results.

Sparse learning has recently received increasing attention in many areas including machine learning, statistics, and applied mathematics. The mixed-norm regularization based on the l1q norm with q>1 is attractive in many applications of regression and classification in that it facilitates group sparsity in the model. The resulting optimization problem is, however, challenging to solve due to the inherent structure of the mixed-norm regularization. Existing work deals with special cases with q=1, 2, infinity, and they cannot be easily extended to the general case. In this paper, we propose an efficient algorithm based on the accelerated gradient method for solving the general l1q-regularized problem. One key building block of the proposed algorithm is the l1q-regularized Euclidean projection (EP_1q). Our theoretical analysis reveals the key properties of EP_1q and illustrates why EP_1q for the general q is significantly more challenging to solve than the special cases. Based on our theoretical analysis, we develop an efficient algorithm for EP_1q by solving two zero finding problems. To further improve the efficiency of solving large dimensional mixed-norm regularized problems, we propose a screening method which is able to quickly identify the inactive groups, i.e., groups that have 0 components in the solution. This may lead to substantial reduction in the number of groups to be entered to the optimization. An appealing feature of our screening method is that the data set needs to be scanned only once to run the screening. Compared to that of solving the mixed-norm regularized problems, the computational cost of our screening test is negligible. The key of the proposed screening method is an accurate sensitivity analysis of the dual optimal solution when the regularization parameter varies. Experimental results demonstrate the efficiency of the proposed algorithm.

Sparse feature selection has been demonstrated to be effective in handling high-dimensional data. While promising, most of the existing works use convex methods, which may be suboptimal in terms of the accuracy of feature selection and parameter estimation. In this paper, we expand a nonconvex paradigm to sparse group feature selection, which is motivated by applications that require identifying the underlying group structure and performing feature selection simultaneously. The main contributions of this article are twofold: (1) statistically, we introduce a nonconvex sparse group feature selection model which can reconstruct the oracle estimator. Therefore, consistent feature selection and parameter estimation can be achieved; (2) computationally, we propose an efficient algorithm that is applicable to large-scale problems. Numerical results suggest that the proposed nonconvex method compares favorably against its competitors on synthetic data and real-world applications, thus achieving desired goal of delivering high performance.

Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an $\ell_0$-type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms.

The problem of joint feature selection across a group of related tasks has applications in many areas including biomedical informatics and computer vision. We consider the l2,1-norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the l2,1-norm regularization is that it encourages multiple predictors to share similar sparsity patterns. However, the resulting optimization problem is challenging to solve due to the non-smoothness of the l2,1-norm regularization. In this paper, we propose to accelerate the computation by reformulating it as two equivalent smooth convex optimization problems which are then solved via the Nesterov's method-an optimal first-order black-box method for smooth convex optimization. A key building block in solving the reformulations is the Euclidean projection. We show that the Euclidean projection for the first reformulation can be analytically computed, while the Euclidean projection for the second one can be computed in linear time. Empirical evaluations on several data sets verify the efficiency of the proposed algorithms.

Learning a distance function or metric on a given data manifold is of great importance in machine learning and pattern recognition. Many of the previous works first embed the manifold to Euclidean space and then learn the distance function. However, such a scheme might not faithfully preserve the distance function if the original manifold is not Euclidean. Note that the distance function on a manifold can always be well-defined. In this paper, we propose to learn the distance function directly on the manifold without embedding. We first provide a theoretical characterization of the distance function by its gradient field. Based on our theoretical analysis, we propose to first learn the gradient field of the distance function and then learn the distance function itself. Specifically, we set the gradient field of a local distance function as an initial vector field. Then we transport it to the whole manifold via heat flow on vector fields. Finally, the geodesic distance function can be obtained by requiring its gradient field to be close to the normalized vector field. Experimental results on both synthetic and real data demonstrate the effectiveness of our proposed algorithm.

In this paper, we propose a novel framework to analyze the theoretical properties of the learning process for a representative type of domain adaptation, which combines data from multiple sources and one target (or briefly called representative domain adaptation). In particular, we use the integral probability metric to measure the difference between the distributions of two domains and meanwhile compare it with the H-divergence and the discrepancy distance. We develop the Hoeffding-type, the Bennett-type and the McDiarmid-type deviation inequalities for multiple domains respectively, and then present the symmetrization inequality for representative domain adaptation. Next, we use the derived inequalities to obtain the Hoeffding-type and the Bennett-type generalization bounds respectively, both of which are based on the uniform entropy number. Moreover, we present the generalization bounds based on the Rademacher complexity. Finally, we analyze the asymptotic convergence and the rate of convergence of the learning process for representative domain adaptation. We discuss the factors that affect the asymptotic behavior of the learning process and the numerical experiments support our theoretical findings as well. Meanwhile, we give a comparison with the existing results of domain adaptation and the classical results under the same-distribution assumption.

Network embedding is an effective way to solve the network analytics problems such as node classification, link prediction, etc. It represents network elements using low dimensional vectors such that the graph structural information and properties are maximumly preserved. Many prior works focused on embeddings for networks with the same type of edges or vertices, while some works tried to generate embeddings for heterogeneous network using mechanisms like specially designed meta paths. In this paper, we propose two novel algorithms, GHINE (General Heterogeneous Information Network Embedding) and AHINE (Adaptive Heterogeneous Information Network Embedding), to compute distributed representations for elements in heterogeneous networks. Specially, AHINE uses an adaptive deep model to learn network embeddings that maximizes the likelihood of preserving the relationship chains between non-adjacent nodes. We apply our embeddings to a large network of points of interest (POIs) and achieve superior accuracy on some prediction problems on a ride-hailing platform. In addition, we show that AHINE outperforms state-of-the-art methods on a set of learning tasks on public datasets, including node labelling and similarity ranking in bibliographic networks.

Object detection, as of one the most fundamental and challenging problems in computer vision, has received great attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think of today's object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical evolution, spanning over a quarter-century's time (from the 1990s to 2019). A number of topics have been covered in this paper, including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical improvements in recent years.