Models, code, and papers for "Hui Zhang":

The success of deep convolutional neural network (CNN) in computer vision especially image classification problems requests a new information theory for function of image, instead of image itself. In this article, after establishing a deep mathematical connection between image classification problem and quantum spin model, we propose to use entanglement entropy, a generalization of classical Boltzmann-Shannon entropy, as a powerful tool to characterize the information needed for representation of general function of image. We prove that there is a sub-volume-law bound for entanglement entropy of target functions of reasonable image classification problems. Therefore target functions of image classification only occupy a small subspace of the whole Hilbert space. As a result, a neural network with polynomial number of parameters is efficient for representation of such target functions of image. The concept of entanglement entropy can also be useful to characterize the expressive power of different neural networks. For example, we show that to maintain the same expressive power, number of channels $D$ in a convolutional neural network should scale with the number of convolution layers $n_c$ as $D\sim D_0^{\frac{1}{n_c}}$. Therefore, deeper CNN with large $n_c$ is more efficient than shallow ones.

Machine-learning excels in many areas with well-defined goals. However, a clear goal is usually not available in art forms, such as photography. The success of a photograph is measured by its aesthetic value, a very subjective concept. This adds to the challenge for a machine learning approach. We introduce Creatism, a deep-learning system for artistic content creation. In our system, we break down aesthetics into multiple aspects, each can be learned individually from a shared dataset of professional examples. Each aspect corresponds to an image operation that can be optimized efficiently. A novel editing tool, dramatic mask, is introduced as one operation that improves dramatic lighting for a photo. Our training does not require a dataset with before/after image pairs, or any additional labels to indicate different aspects in aesthetics. Using our system, we mimic the workflow of a landscape photographer, from framing for the best composition to carrying out various post-processing operations. The environment for our virtual photographer is simulated by a collection of panorama images from Google Street View. We design a "Turing-test"-like experiment to objectively measure quality of its creations, where professional photographers rate a mixture of photographs from different sources blindly. Experiments show that a portion of our robot's creation can be confused with professional work.

In this paper, we propose a novel model for high-dimensional data, called the Hybrid Orthogonal Projection and Estimation (HOPE) model, which combines a linear orthogonal projection and a finite mixture model under a unified generative modeling framework. The HOPE model itself can be learned unsupervised from unlabelled data based on the maximum likelihood estimation as well as discriminatively from labelled data. More interestingly, we have shown the proposed HOPE models are closely related to neural networks (NNs) in a sense that each hidden layer can be reformulated as a HOPE model. As a result, the HOPE framework can be used as a novel tool to probe why and how NNs work, more importantly, to learn NNs in either supervised or unsupervised ways. In this work, we have investigated the HOPE framework to learn NNs for several standard tasks, including image recognition on MNIST and speech recognition on TIMIT. Experimental results have shown that the HOPE framework yields significant performance gains over the current state-of-the-art methods in various types of NN learning problems, including unsupervised feature learning, supervised or semi-supervised learning.

In this report, a novel variation of Particle Swarm Optimization (PSO) algorithm, called Multiagent Coordination Optimization (MCO), is implemented in a parallel computing way for practical use by introducing MATLAB built-in function "parfor" into MCO. Then we rigorously analyze the global convergence of MCO by means of semistability theory. Besides sharing global optimal solutions with the PSO algorithm, the MCO algorithm integrates cooperative swarm behavior of multiple agents into the update formula by sharing velocity and position information between neighbors to improve its performance. Numerical evaluation of the parallel MCO algorithm is provided in the report by running the proposed algorithm on supercomputers in the High Performance Computing Center at Texas Tech University. In particular, the optimal value and consuming time are compared with PSO and serial MCO by solving several benchmark functions in the literature, respectively. Based on the simulation results, the performance of the parallel MCO is not only superb compared with PSO for solving many nonlinear, noncovex optimization problems, but also is of high efficiency by saving the computational time.

This sequential technical report extends some of the previous results we posted at arXiv:1306.0225.

Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) into term weighting scheme and propose two novel approaches, i.e., tf.icf and icf-based supervised term weighting schemes. The tf.icf adopts icf to substitute idf factor and favors terms occurring in fewer categories, rather than fewer documents. And the icf-based approach combines icf and relevance frequency (rf) to weight terms in a supervised way. Our cross-classifier and cross-corpus experiments have shown that our proposed approaches are superior or comparable to six supervised term weighting schemes and three traditional schemes in terms of macro-F1 and micro-F1.

This paper develops a general theoretical framework to analyze structured sparse recovery problems using the notation of dual certificate. Although certain aspects of the dual certificate idea have already been used in some previous work, due to the lack of a general and coherent theory, the analysis has so far only been carried out in limited scopes for specific problems. In this context the current paper makes two contributions. First, we introduce a general definition of dual certificate, which we then use to develop a unified theory of sparse recovery analysis for convex programming. Second, we present a class of structured sparsity regularization called structured Lasso for which calculations can be readily performed under our theoretical framework. This new theory includes many seemingly loosely related previous work as special cases; it also implies new results that improve existing ones even for standard formulations such as L1 regularization.

Concave regularization methods provide natural procedures for sparse recovery. However, they are difficult to analyze in the high dimensional setting. Only recently a few sparse recovery results have been established for some specific local solutions obtained via specialized numerical procedures. Still, the fundamental relationship between these solutions such as whether they are identical or their relationship to the global minimizer of the underlying nonconvex formulation is unknown. The current paper fills this conceptual gap by presenting a general theoretical framework showing that under appropriate conditions, the global solution of nonconvex regularization leads to desirable recovery performance; moreover, under suitable conditions, the global solution corresponds to the unique sparse local solution, which can be obtained via different numerical procedures. Under this unified framework, we present an overview of existing results and discuss their connections. The unified view of this work leads to a more satisfactory treatment of concave high dimensional sparse estimation procedures, and serves as guideline for developing further numerical procedures for concave regularization.

We develop some theoretical results for a robust similarity measure named "generalized min-max" (GMM). This similarity has direct applications in machine learning as a positive definite kernel and can be efficiently computed via probabilistic hashing. Owing to the discrete nature, the hashed values can also be used for efficient near neighbor search. We prove the theoretical limit of GMM and the consistency result, assuming that the data follow an elliptical distribution, which is a very general family of distributions and includes the multivariate $t$-distribution as a special case. The consistency result holds as long as the data have bounded first moment (an assumption which essentially holds for datasets commonly encountered in practice). Furthermore, we establish the asymptotic normality of GMM. Compared to the "cosine" similarity which is routinely adopted in current practice in statistics and machine learning, the consistency of GMM requires much weaker conditions. Interestingly, when the data follow the $t$-distribution with $\nu$ degrees of freedom, GMM typically provides a better measure of similarity than "cosine" roughly when $\nu<8$ (which is already very close to normal). These theoretical results will help explain the recent success of GMM in learning tasks.

In this paper, we investigate the sample size requirement for a general class of nuclear norm minimization methods for higher order tensor completion. We introduce a class of tensor norms by allowing for different levels of coherence, which allows us to leverage the incoherence of a tensor. In particular, we show that a $k$th order tensor of rank $r$ and dimension $d\times\cdots\times d$ can be recovered perfectly from as few as $O((r^{(k-1)/2}d^{3/2}+r^{k-1}d)(\log(d))^2)$ uniformly sampled entries through an appropriate incoherent nuclear norm minimization. Our results demonstrate some key differences between completing a matrix and a higher order tensor: They not only point to potential room for improvement over the usual nuclear norm minimization but also highlight the importance of explicitly accounting for incoherence, when dealing with higher order tensors.

Sparse learning is an important topic in many areas such as machine learning, statistical estimation, signal processing, etc. Recently, there emerges a growing interest on structured sparse learning. In this paper we focus on the $\ell_q$-analysis optimization problem for structured sparse learning ($0< q \leq 1$). Compared to previous work, we establish weaker conditions for exact recovery in noiseless case and a tighter non-asymptotic upper bound of estimate error in noisy case. We further prove that the nonconvex $\ell_q$-analysis optimization can do recovery with a lower sample complexity and in a wider range of cosparsity than its convex counterpart. In addition, we develop an iteratively reweighted method to solve the optimization problem under the variational framework. Theoretical analysis shows that our method is capable of pursuing a local minima close to the global minima. Also, empirical results of preliminary computational experiments illustrate that our nonconvex method outperforms both its convex counterpart and other state-of-the-art methods.

We study the use of very sparse random projections for compressed sensing (sparse signal recovery) when the signal entries can be either positive or negative. In our setting, the entries of a Gaussian design matrix are randomly sparsified so that only a very small fraction of the entries are nonzero. Our proposed decoding algorithm is simple and efficient in that the major cost is one linear scan of the coordinates. We have developed two estimators: (i) the {\em tie estimator}, and (ii) the {\em absolute minimum estimator}. Using only the tie estimator, we are able to recover a $K$-sparse signal of length $N$ using $1.551 eK \log K/\delta$ measurements (where $\delta\leq 0.05$ is the confidence). Using only the absolute minimum estimator, we can detect the support of the signal using $eK\log N/\delta$ measurements. For a particular coordinate, the absolute minimum estimator requires fewer measurements (i.e., with a constant $e$ instead of $1.551e$). Thus, the two estimators can be combined to form an even more practical decoding framework. Prior studies have shown that existing one-scan (or roughly one-scan) recovery algorithms using sparse matrices would require substantially more (e.g., one order of magnitude) measurements than L1 decoding by linear programming, when the nonzero entries of signals can be either negative or positive. In this paper, following a known experimental setup, we show that, at the same number of measurements, the recovery accuracies of our proposed method are (at least) similar to the standard L1 decoding.

Many problems can be formulated as recovering a low-rank tensor. Although an increasingly common task, tensor recovery remains a challenging problem because of the delicacy associated with the decomposition of higher order tensors. To overcome these difficulties, existing approaches often proceed by unfolding tensors into matrices and then apply techniques for matrix completion. We show here that such matricization fails to exploit the tensor structure and may lead to suboptimal procedure. More specifically, we investigate a convex optimization approach to tensor completion by directly minimizing a tensor nuclear norm and prove that this leads to an improved sample size requirement. To establish our results, we develop a series of algebraic and probabilistic techniques such as characterization of subdifferetial for tensor nuclear norm and concentration inequalities for tensor martingales, which may be of independent interests and could be useful in other tensor related problems.

We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other $\ell_1$ regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the $\ell_1$ and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso.

Many applications concern sparse signals, for example, detecting anomalies from the differences between consecutive images taken by surveillance cameras. This paper focuses on the problem of recovering a K-sparse signal x in N dimensions. In the mainstream framework of compressed sensing (CS), the vector x is recovered from M non-adaptive linear measurements y = xS, where S (of size N x M) is typically a Gaussian (or Gaussian-like) design matrix, through some optimization procedure such as linear programming (LP). In our proposed method, the design matrix S is generated from an $\alpha$-stable distribution with $\alpha\approx 0$. Our decoding algorithm mainly requires one linear scan of the coordinates, followed by a few iterations on a small number of coordinates which are "undetermined" in the previous iteration. Comparisons with two strong baselines, linear programming (LP) and orthogonal matching pursuit (OMP), demonstrate that our algorithm can be significantly faster in decoding speed and more accurate in recovery quality, for the task of exact spare recovery. Our procedure is robust against measurement noise. Even when there are no sufficient measurements, our algorithm can still reliably recover a significant portion of the nonzero coordinates. To provide the intuition for understanding our method, we also analyze the procedure by assuming an idealistic setting. Interestingly, when K=2, the "idealized" algorithm achieves exact recovery with merely 3 measurements, regardless of N. For general K, the required sample size of the "idealized" algorithm is about 5K.

This paper concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a calibrated spectrum elastic net method with a sum of the nuclear and Frobenius penalties and develop an iterative algorithm to solve the convex minimization problem. The iterative algorithm alternates between imputing the missing entries in the incomplete matrix by the current guess and estimating the matrix by a scaled soft-thresholding singular value decomposition of the imputed matrix until the resulting matrix converges. A calibration step follows to correct the bias caused by the Frobenius penalty. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal order and in proportion to the noise level. This provides a unified analysis of the noisy and noiseless matrix completion problems. Simulation results are presented to compare our proposal with previous ones.

Scaled sparse linear regression jointly estimates the regression coefficients and noise level in a linear model. It chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual square and scaling the penalty in proportion to the estimated noise level. The iterative algorithm costs little beyond the computation of a path or grid of the sparse regression estimator for penalty levels above a proper threshold. For the scaled lasso, the algorithm is a gradient descent in a convex minimization of a penalized joint loss function for the regression coefficients and noise level. Under mild regularity conditions, we prove that the scaled lasso simultaneously yields an estimator for the noise level and an estimated coefficient vector satisfying certain oracle inequalities for prediction, the estimation of the noise level and the regression coefficients. These inequalities provide sufficient conditions for the consistency and asymptotic normality of the noise level estimator, including certain cases where the number of variables is of greater order than the sample size. Parallel results are provided for the least squares estimation after model selection by the scaled lasso. Numerical results demonstrate the superior performance of the proposed methods over an earlier proposal of joint convex minimization.

The $\ell_1$-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted $\ell_1$-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted $\ell_1$-penalized estimator in sparse, high-dimensional settings where the number of predictors $p$ can be much larger than the sample size $n$. Adaptive Lasso is considered as a special case. A multistage method is developed to apply an adaptive Lasso recursively. We provide $\ell_q$ oracle inequalities, a general selection consistency theorem, and an upper bound on the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results.

The Long Short-Term Memory (LSTM) neural network based data association algorithm named as DeepDA for multi-target tracking in clutters is proposed to deal with the NP-hard combinatorial optimization problem in this paper. Different from the classical data association methods involving complex models and accurate prior knowledge on clutter density, filter covariance or associated gating etc, data-driven deep learning methods have been extensively researched for this topic. Firstly, data association mathematical problem for multitarget tracking on unknown target number, missed detection and clutter, which is beyond one-to-one mapping between observations and targets is redefined formally. Subsequently, an LSTM network is designed to learn the measurement-to-track association probability from radar noisy measurements and exist tracks. Moreover, an LSTM-based data-driven deep neural network after a supervised training through the BPTT and RMSprop optimization method can get the association probability directly. Experimental results on simulated data show a significant performance on association ratio, target ID switching and time-consuming for tracking multiple targets even they are crossing each other in the complicated clutter environment.

This paper studies schemes to de-bias the Lasso in sparse linear regression where the goal is to estimate and construct confidence intervals for a low-dimensional projection of the unknown coefficient vector in a preconceived direction $a_0$. We assume that the design matrix has iid Gaussian rows with known covariance matrix $\Sigma$. Our analysis reveals that previous propositions to de-bias the Lasso require a modification in order to enjoy asymptotic efficiency in a full range of the level of sparsity. This modification takes the form of a degrees-of-freedom adjustment that accounts for the dimension of the model selected by the Lasso. Let $s_0$ denote the number of nonzero coefficients of the true coefficient vector. The unadjusted de-biasing schemes proposed in previous studies enjoys efficiency if $s_0\lll n^{2/3}$, up to logarithmic factors. However, if $s_0\ggg n^{2/3}$, the unadjusted scheme cannot be efficient in certain directions $a_0$. In the latter regime, it it necessary to modify existing procedures by an adjustment that accounts for the degrees-of-freedom of the Lasso. The proposed degrees-of-freedom adjustment grants asymptotic efficiency for any direction $a_0$. This holds under a Sparse Riecz Condition on the covariance matrix $\Sigma$ and the sample size requirement $s_0/p\to0$ and $s_0\log(p/s_0)/n\to0$. Our analysis also highlights that the degrees-of-freedom adjustment is not necessary when the initial bias of the Lasso in the direction $a_0$ is small, which is granted under more stringent conditions on $\Sigma^{-1}$. This explains why the necessity of degrees-of-freedom adjustment did not appear in some previous studies. The main proof argument involves a Gaussian interpolation path similar to that used to derive Slepian's lemma. It yields a sharp $\ell_\infty$ error bound for the Lasso under Gaussian design which is of independent interest.