Models, code, and papers for "D":

A Selection of Giant Radio Sources from NVSS

May 10, 2016
D. D. Proctor

Results of the application of pattern recognition techniques to the problem of identifying Giant Radio Sources (GRS) from the data in the NVSS catalog are presented and issues affecting the process are explored. Decision-tree pattern recognition software was applied to training set source pairs developed from known NVSS large angular size radio galaxies. The full training set consisted of 51,195 source pairs, 48 of which were known GRS for which each lobe was primarily represented by a single catalog component. The source pairs had a maximum separation of 20 arc minutes and a minimum component area of 1.87 square arc minutes at the 1.4 mJy level. The importance of comparing resulting probability distributions of the training and application sets for cases of unknown class ratio is demonstrated. The probability of correctly ranking a randomly selected (GRS, non-GRS) pair from the best of the tested classifiers was determined to be 97.8 +/- 1.5%. The best classifiers were applied to the over 870,000 candidate pairs from the entire catalog. Images of higher ranked sources were visually screened and a table of over sixteen hundred candidates, including morphological annotation, is presented. These systems include doubles and triples, Wide-Angle Tail (WAT) and Narrow-Angle Tail (NAT), S- or Z-shaped systems, and core-jets and resolved cores. While some resolved lobe systems are recovered with this technique, generally it is expected that such systems would require a different approach.

* 20 pages of text, 6 figures, 22 pages tables, total 55 pages. The stub for Table 6 is followed by the complete machine readable file. To be published in The Astrophysical Journal Supplement. Revision 1: Corrected typos, references updated/corrected, addition to acknowledgments. Five candidates identified as SNR (Thanks to D. A. Green) 

  Access Model/Code and Paper
Fine-tuning the Ant Colony System algorithm through Particle Swarm Optimization

Mar 21, 2018
D Gómez-Cabrero, D. N. Ranasinghe

Ant Colony System (ACS) is a distributed (agent- based) algorithm which has been widely studied on the Symmetric Travelling Salesman Problem (TSP). The optimum parameters for this algorithm have to be found by trial and error. We use a Particle Swarm Optimization algorithm (PSO) to optimize the ACS parameters working in a designed subset of TSP instances. First goal is to perform the hybrid PSO-ACS algorithm on a single instance to find the optimum parameters and optimum solutions for the instance. Second goal is to analyze those sets of optimum parameters, in relation to instance characteristics. Computational results have shown good quality solutions for single instances though with high computational times, and that there may be sets of parameters that work optimally for a majority of instances.

* 2006 paper. Presented in conference. Technical report in "Universitat de Valencia" 

  Access Model/Code and Paper
Nonlinear tensor product approximation of functions

Sep 04, 2014
D. Bazarkhanov, V. Temlyakov

We are interested in approximation of a multivariate function $f(x_1,\dots,x_d)$ by linear combinations of products $u^1(x_1)\cdots u^d(x_d)$ of univariate functions $u^i(x_i)$, $i=1,\dots,d$. In the case $d=2$ it is a classical problem of bilinear approximation. In the case of approximation in the $L_2$ space the bilinear approximation problem is closely related to the problem of singular value decomposition (also called Schmidt expansion) of the corresponding integral operator with the kernel $f(x_1,x_2)$. There are known results on the rate of decay of errors of best bilinear approximation in $L_p$ under different smoothness assumptions on $f$. The problem of multilinear approximation (nonlinear tensor product approximation) in the case $d\ge 3$ is more difficult and much less studied than the bilinear approximation problem. We will present results on best multilinear approximation in $L_p$ under mixed smoothness assumption on $f$.

  Access Model/Code and Paper
P-model Alternative to the T-model

Jul 08, 2001
Mark D. Roberts

Standard linguistic analysis of syntax uses the T-model. This model requires the ordering: D-structure $>$ S-structure $>$ LF. Between each of these representations there is movement which alters the order of the constituent words; movement is achieved using the principles and parameters of syntactic theory. Psychological serial models do not accommodate the T-model immediately so that here a new model called the P-model is introduced. Here it is argued that the LF representation should be replaced by a variant of Frege's three qualities. In the F-representation the order of elements is not necessarily the same as that in LF and it is suggested that the correct ordering is: F-representation $>$ D-structure $>$ S-structure. Within this framework movement originates as the outcome of emphasis applied to the sentence.

* 28 pages, 73262 bytes, six eps diagrams, 53 references, background to this work is described: 

  Access Model/Code and Paper
Analogy perception applied to seven tests of word comprehension

Jul 22, 2011
Peter D. Turney

It has been argued that analogy is the core of cognition. In AI research, algorithms for analogy are often limited by the need for hand-coded high-level representations as input. An alternative approach is to use high-level perception, in which high-level representations are automatically generated from raw data. Analogy perception is the process of recognizing analogies using high-level perception. We present PairClass, an algorithm for analogy perception that recognizes lexical proportional analogies using representations that are automatically generated from a large corpus of raw textual data. A proportional analogy is an analogy of the form A:B::C:D, meaning "A is to B as C is to D". A lexical proportional analogy is a proportional analogy with words, such as carpenter:wood::mason:stone. PairClass represents the semantic relations between two words using a high-dimensional feature vector, in which the elements are based on frequencies of patterns in the corpus. PairClass recognizes analogies by applying standard supervised machine learning techniques to the feature vectors. We show how seven different tests of word comprehension can be framed as problems of analogy perception and we then apply PairClass to the seven resulting sets of analogy perception problems. We achieve competitive results on all seven tests. This is the first time a uniform approach has handled such a range of tests of word comprehension.

* Journal of Experimental & Theoretical Artificial Intelligence (JETAI), 2011, Volume 23, Issue 3, pages 343-362 
* related work available at 

  Access Model/Code and Paper
Numerical Modeling of Coexistence, Competition and Collapse of Rotating Spiral Waves in Three-Level Excitable Media with Discrete Active Centers and Absorbing Boundaries

Feb 15, 2006
S. D. Makovetskiy

Spatio-temporal dynamics of excitable media with discrete three-level active centers (ACs) and absorbing boundaries is studied numerically by means of a deterministic three-level model (see S. D. Makovetskiy and D. N. Makovetskii, on-line preprint cond-mat/0410460 ), which is a generalization of Zykov- Mikhailov model (see Sov. Phys. -- Doklady, 1986, Vol.31, No.1, P.51) for the case of two-channel diffusion of excitations. In particular, we revealed some qualitatively new features of coexistence, competition and collapse of rotating spiral waves (RSWs) in three-level excitable media under conditions of strong influence of the second channel of diffusion. Part of these features are caused by unusual mechanism of RSWs evolution when RSW's cores get into the surface layer of an active medium (i.~e. the layer of ACs resided at the absorbing boundary). Instead of well known scenario of RSW collapse, which takes place after collision of RSW's core with absorbing boundary, we observed complicated transformations of the core leading to nonlinear ''reflection'' of the RSW from the boundary or even to birth of several new RSWs in the surface layer. To our knowledge, such nonlinear ''reflections'' of RSWs and resulting die hard vorticity in excitable media with absorbing boundaries were unknown earlier. ACM classes: F.1.1, I.6, J.2; PACS numbers: 05.65.+b, 07.05.Tp, 82.20.Wt

* 12 pages (LaTeX2e file) and 3 figures (separate PNG-files) 

  Access Model/Code and Paper
DeepVesselNet: Vessel Segmentation, Centerline Prediction, and Bifurcation Detection in 3-D Angiographic Volumes

Mar 25, 2018
Giles Tetteh, Velizar Efremov, Nils D. Forkert, Matthias Schneider, Jan Kirschke, Bruno Weber, Claus Zimmer, Marie Piraud, Bjoern H. Menze

We present DeepVesselNet, an architecture tailored to the challenges to be addressed when extracting vessel networks and corresponding features in 3-D angiography using deep learning. We discuss the problems of low execution speed and high memory requirements associated with full 3-D convolutional networks, high class imbalance arising from low percentage (less than 3%) of vessel voxels, and unavailability of accurately annotated training data - and offer solutions that are the building blocks of DeepVesselNet. First, we formulate 2-D orthogonal cross-hair filters which make use of 3-D context information. Second, we introduce a class balancing cross-entropy score with false positive rate correction to handle the high class imbalance and high false positive rate problems associated with existing loss functions. Finally, we generate synthetic dataset using a computational angiogenesis model, capable of generating vascular networks under physiological constraints on local network structure and topology, and use these data for transfer learning. DeepVesselNet is optimized for segmenting vessels, predicting centerlines, and localizing bifurcations. We test the performance on a range of angiographic volumes including clinical Time-of-Flight MRA data of the human brain, as well as synchrotron radiation X-ray tomographic microscopy scans of the rat brain. Our experiments show that, by replacing 3-D filters with 2-D orthogonal cross-hair filters in our network, speed is improved by 23% while accuracy is maintained. Our class balancing metric is crucial for training the network and pre-training with synthetic data helps in early convergence of the training process.

  Access Model/Code and Paper
Almost Optimal Tensor Sketch

Sep 03, 2019
Thomas D. Ahle, Jakob B. T. Knudsen

We construct a matrix $M\in R^{m\otimes d^c}$ with just $m=O(c\,\lambda\,\varepsilon^{-2}\text{poly}\log1/\varepsilon\delta)$ rows, which preserves the norm $\|Mx\|_2=(1\pm\varepsilon)\|x\|_2$ of all $x$ in any given $\lambda$ dimensional subspace of $ R^d$ with probability at least $1-\delta$. This matrix can be applied to tensors $x^{(1)}\otimes\dots\otimes x^{(c)}\in R^{d^c}$ in $O(c\, m \min\{d,m\})$ time -- hence the name "Tensor Sketch". (Here $x\otimes y = \text{asvec}(xy^T) = [x_1y_1, x_1y_2,\dots,x_1y_m,x_2y_1,\dots,x_ny_m]\in R^{nm}$.) This improves upon earlier Tensor Sketch constructions by Pagh and Pham~[TOCT 2013, SIGKDD 2013] and Avron et al.~[NIPS 2014] which require $m=\Omega(3^c\lambda^2\delta^{-1})$ rows for the same guarantees. The factors of $\lambda$, $\varepsilon^{-2}$ and $\log1/\delta$ can all be shown to be necessary making our sketch optimal up to log factors. With another construction we get $\lambda$ times more rows $m=\tilde O(c\,\lambda^2\,\varepsilon^{-2}(\log1/\delta)^3)$, but the matrix can be applied to any vector $x^{(1)}\otimes\dots\otimes x^{(c)}\in R^{d^c}$ in just $\tilde O(c\, (d+m))$ time. This matches the application time of Tensor Sketch while still improving the exponential dependencies in $c$ and $\log1/\delta$. Technically, we show two main lemmas: (1) For many Johnson Lindenstrauss (JL) constructions, if $Q,Q'\in R^{m\times d}$ are independent JL matrices, the element-wise product $Qx \circ Q'y$ equals $M(x\otimes y)$ for some $M\in R^{m\times d^2}$ which is itself a JL matrix. (2) If $M^{(i)}\in R^{m\times md}$ are independent JL matrices, then $M^{(1)}(x \otimes (M^{(2)}y \otimes \dots)) = M(x\otimes y\otimes \dots)$ for some $M\in R^{m\times d^c}$ which is itself a JL matrix. Combining these two results give an efficient sketch for tensors of any size.

  Access Model/Code and Paper
Clustering Higher Order Data: Finite Mixtures of Multidimensional Arrays

Jul 19, 2019
Peter A. Tait, Paul D. McNicholas

An approach for clustering multi-way data is introduced based on a finite mixture of multidimensional arrays. Attention to the use of multidimensional arrays for clustering has thus far been limited to two-dimensional arrays, i.e., matrices or order-two tensors. Accordingly, this is the first paper to develop an approach for clustering d-dimensional arrays for d>2 or, in other words, for clustering using order-d tensors.

  Access Model/Code and Paper
Dispersion of Mobile Robots in the Global Communication Model

Sep 04, 2019
Ajay D. Kshemkalyani, Anisur Rahaman Molla, Gokarna Sharma

The dispersion problem on graphs asks $k\leq n$ robots placed initially arbitrarily on the nodes of an $n$-node anonymous graph to reposition autonomously to reach a configuration in which each robot is on a distinct node of the graph. This problem is of significant interest due to its relationship to other fundamental robot coordination problems, such as exploration, scattering, load balancing etc. In this paper, we consider dispersion in the {\em global communication} model where a robot can communicate with any other robot in the graph (but the graph is unknown to robots). We provide three novel deterministic algorithms, two for arbitrary graphs and one for arbitrary trees, in a synchronous setting where all robots perform their actions in every time step. For arbitrary graphs, our first algorithm is based on a DFS traversal and guarantees $O(\min(m,k\Delta))$ steps runtime using $\Theta(\log (\max(k,\Delta)))$ bits at each robot, where $m$ is the number of edges and $\Delta$ is the maximum degree of the graph. The second algorithm for arbitrary graphs is based on a BFS traversal and guarantees $O( \max(D,k) \Delta (D+\Delta))$ steps runtime using $O(\max(D,\Delta \log k))$ bits at each robot, where $D$ is the diameter of the graph. The algorithm for arbitrary trees is also based on a BFS travesal and guarantees $O(D\max(D,k))$ steps runtime using $O(\max(D,\Delta \log k))$ bits at each robot. Our results are significant improvements compared to the existing results established in the {\em local communication} model where a robot can communication only with other robots present at the same node. Particularly, the DFS-based algorithm is optimal for both memory and time in constant-degree arbitrary graphs. The BFS-based algorithm for arbitrary trees is optimal with respect to runtime when $k\leq O(D)$.

* 13 pages 

  Access Model/Code and Paper
Assessment of Amazon Comprehend Medical: Medication Information Extraction

Feb 02, 2020
Benedict Guzman, MS, Isabel Metzger, MS, Yindalon Aphinyanaphongs, M. D., Ph. D., Himanshu Grover, Ph. D

In November 27, 2018, Amazon Web Services (AWS) released Amazon Comprehend Medical (ACM), a deep learning based system that automatically extracts clinical concepts (which include anatomy, medical conditions, protected health information (PH)I, test names, treatment names, and medical procedures, and medications) from clinical text notes. Uptake and trust in any new data product relies on independent validation across benchmark datasets and tools to establish and confirm expected quality of results. This work focuses on the medication extraction task, and particularly, ACM was evaluated using the official test sets from the 2009 i2b2 Medication Extraction Challenge and 2018 n2c2 Track 2: Adverse Drug Events and Medication Extraction in EHRs. Overall, ACM achieved F-scores of 0.768 and 0.828. These scores ranked the lowest when compared to the three best systems in the respective challenges. To further establish the generalizability of its medication extraction performance, a set of random internal clinical text notes from NYU Langone Medical Center were also included in this work. And in this corpus, ACM garnered an F-score of 0.753.

  Access Model/Code and Paper
Wearable-based Mediation State Detection in Individuals with Parkinson's Disease

Sep 19, 2018
Murtadha D. Hssayeni, Michelle A. Burack, M. D., Joohi Jimenez-Shahed, M. D., Behnaz Ghoraani, Ph. D

One of the most prevalent complaints of individuals with mid-stage and advanced Parkinson's disease (PD) is the fluctuating response to their medication (i.e., ON state with maximum benefit from medication and OFF state with no benefit from medication). In order to address these motor fluctuations, the patients go through periodic clinical examination where the treating physician reviews the patients' self-report about duration in different medication states and optimize therapy accordingly. Unfortunately, the patients' self-report can be unreliable and suffer from recall bias. There is a need to a technology-based system that can provide objective measures about the duration in different medication states that can be used by the treating physician to successfully adjust the therapy. In this paper, we developed a medication state detection algorithm to detect medication states using two wearable motion sensors. A series of significant features are extracted from the motion data and used in a classifier that is based on a support vector machine with fuzzy labeling. The developed algorithm is evaluated using a dataset with 19 PD subjects and a total duration of 1,052.24 minutes (17.54 hours). The algorithm resulted in an average classification accuracy of 90.5%, sensitivity of 94.2%, and specificity of 85.4%.

  Access Model/Code and Paper
Statistical Guarantees for Estimating the Centers of a Two-component Gaussian Mixture by EM

Aug 07, 2016
Jason M. Klusowski, W. D. Brinda

Recently, a general method for analyzing the statistical accuracy of the EM algorithm has been developed and applied to some simple latent variable models [Balakrishnan et al. 2016]. In that method, the basin of attraction for valid initialization is required to be a ball around the truth. Using Stein's Lemma, we extend these results in the case of estimating the centers of a two-component Gaussian mixture in $d$ dimensions. In particular, we significantly expand the basin of attraction to be the intersection of a half space and a ball around the origin. If the signal-to-noise ratio is at least a constant multiple of $ \sqrt{d\log d} $, we show that a random initialization strategy is feasible.

  Access Model/Code and Paper
An inexact subsampled proximal Newton-type method for large-scale machine learning

Aug 28, 2017
Xuanqing Liu, Cho-Jui Hsieh, Jason D. Lee, Yuekai Sun

We propose a fast proximal Newton-type algorithm for minimizing regularized finite sums that returns an $\epsilon$-suboptimal point in $\tilde{\mathcal{O}}(d(n + \sqrt{\kappa d})\log(\frac{1}{\epsilon}))$ FLOPS, where $n$ is number of samples, $d$ is feature dimension, and $\kappa$ is the condition number. As long as $n > d$, the proposed method is more efficient than state-of-the-art accelerated stochastic first-order methods for non-smooth regularizers which requires $\tilde{\mathcal{O}}(d(n + \sqrt{\kappa n})\log(\frac{1}{\epsilon}))$ FLOPS. The key idea is to form the subsampled Newton subproblem in a way that preserves the finite sum structure of the objective, thereby allowing us to leverage recent developments in stochastic first-order methods to solve the subproblem. Experimental results verify that the proposed algorithm outperforms previous algorithms for $\ell_1$-regularized logistic regression on real datasets.

  Access Model/Code and Paper
Active Ranking using Pairwise Comparisons

Dec 10, 2011
Kevin G. Jamieson, Robert D. Nowak

This paper examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of $n$ objects can be identified by standard sorting methods using $n log_2 n$ pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. Specifically, we assume that the objects can be embedded into a $d$-dimensional Euclidean space and that the rankings reflect their relative distances from a common reference point in $R^d$. We show that under this assumption the number of possible rankings grows like $n^{2d}$ and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than $d log n$ adaptively selected pairwise comparisons, on average. If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis.

* 17 pages, an extended version of our NIPS 2011 paper. The new version revises the argument of the robust section and slightly modifies the result there to give it more impact 

  Access Model/Code and Paper
Learning Analogies and Semantic Relations

Jul 24, 2003
Peter D. Turney, Michael L. Littman

We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the Scholastic Aptitude Test (SAT). A verbal analogy has the form A:B::C:D, meaning "A is to B as C is to D"; for example, mason:stone::carpenter:wood. SAT analogy questions provide a word pair, A:B, and the problem is to select the most analogous word pair, C:D, from a set of five choices. The VSM algorithm correctly answers 47% of a collection of 374 college-level analogy questions (random guessing would yield 20% correct). We motivate this research by relating it to work in cognitive science and linguistics, and by applying it to a difficult problem in natural language processing, determining semantic relations in noun-modifier pairs. The problem is to classify a noun-modifier pair, such as "laser printer", according to the semantic relation between the noun (printer) and the modifier (laser). We use a supervised nearest-neighbour algorithm that assigns a class to a given noun-modifier pair by finding the most analogous noun-modifier pair in the training data. With 30 classes of semantic relations, on a collection of 600 labeled noun-modifier pairs, the learning algorithm attains an F value of 26.5% (random guessing: 3.3%). With 5 classes of semantic relations, the F value is 43.2% (random: 20%). The performance is state-of-the-art for these challenging problems.

* 28 pages, issued 2003 

  Access Model/Code and Paper
Corpus-based Learning of Analogies and Semantic Relations

Aug 23, 2005
Peter D. Turney, Michael L. Littman

We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the SAT college entrance exam. A verbal analogy has the form A:B::C:D, meaning "A is to B as C is to D"; for example, mason:stone::carpenter:wood. SAT analogy questions provide a word pair, A:B, and the problem is to select the most analogous word pair, C:D, from a set of five choices. The VSM algorithm correctly answers 47% of a collection of 374 college-level analogy questions (random guessing would yield 20% correct; the average college-bound senior high school student answers about 57% correctly). We motivate this research by applying it to a difficult problem in natural language processing, determining semantic relations in noun-modifier pairs. The problem is to classify a noun-modifier pair, such as "laser printer", according to the semantic relation between the noun (printer) and the modifier (laser). We use a supervised nearest-neighbour algorithm that assigns a class to a given noun-modifier pair by finding the most analogous noun-modifier pair in the training data. With 30 classes of semantic relations, on a collection of 600 labeled noun-modifier pairs, the learning algorithm attains an F value of 26.5% (random guessing: 3.3%). With 5 classes of semantic relations, the F value is 43.2% (random: 20%). The performance is state-of-the-art for both verbal analogies and noun-modifier relations.

* Machine Learning, (2005), 60(1-3), 251-278 
* related work available at and 

  Access Model/Code and Paper
Analysis of Hydrological and Suspended Sediment Events from Mad River Wastershed using Multivariate Time Series Clustering

Nov 28, 2019
Ali Javed, Scott D. Hamshaw, Donna M. Rizzo, Byung Suk Lee

Hydrological storm events are a primary driver for transporting water quality constituents such as turbidity, suspended sediments and nutrients. Analyzing the concentration (C) of these water quality constituents in response to increased streamflow discharge (Q), particularly when monitored at high temporal resolution during a hydrological event, helps to characterize the dynamics and flux of such constituents. A conventional approach to storm event analysis is to reduce the C-Q time series to two-dimensional (2-D) hysteresis loops and analyze these 2-D patterns. While effective and informative to some extent, this hysteresis loop approach has limitations because projecting the C-Q time series onto a 2-D plane obscures detail (e.g., temporal variation) associated with the C-Q relationships. In this paper, we address this issue using a multivariate time series clustering approach. Clustering is applied to sequences of river discharge and suspended sediment data (acquired through turbidity-based monitoring) from six watersheds located in the Lake Champlain Basin in the northeastern United States. While clusters of the hydrological storm events using the multivariate time series approach were found to be correlated to 2-D hysteresis loop classifications and watershed locations, the clusters differed from the 2-D hysteresis classifications. Additionally, using available meteorological data associated with storm events, we examine the characteristics of computational clusters of storm events in the study watersheds and identify the features driving the clustering approach.

  Access Model/Code and Paper
Geared Rotationally Identical and Invariant Convolutional Neural Network Systems

Aug 10, 2018
ShihChung B. Lo, Ph. D., Matthew T. Freedman, M. D., Seong K. Mun, Ph. D., Heang-Ping Chan, Ph. D

Theorems and techniques to form different types of transformationally invariant processing and to produce the same output quantitatively based on either transformationally invariant operators or symmetric operations have recently been introduced by the authors. In this study, we further propose to compose a geared rotationally identical CNN system (GRI-CNN) with a small step angle by connecting networks of participated processes at the first flatten layer. Using an ordinary CNN structure as a base, requirements for constructing a GRI-CNN include the use of either symmetric input vector or kernels with an angle increment that can form a complete cycle as a "gearwheel". Four basic GRI-CNN structures were studied. Each of them can produce quantitatively identical output results when a rotation angle of the input vector is evenly divisible by the step angle of the gear. Our study showed when an input vector rotated with an angle does not match to a step angle, the GRI-CNN can also produce a highly consistent result. With a design of using an ultra-fine gear-tooth step angle (e.g., 1 degree or 0.1 degree), all four GRI-CNN systems can be constructed virtually isotropically.

* 14 pages, 6 figures, 8 tables 

  Access Model/Code and Paper
Learning One-hidden-layer Neural Networks with Landscape Design

Nov 03, 2017
Rong Ge, Jason D. Lee, Tengyu Ma

We consider the problem of learning a one-hidden-layer neural network: we assume the input $x\in \mathbb{R}^d$ is from Gaussian distribution and the label $y = a^\top \sigma(Bx) + \xi$, where $a$ is a nonnegative vector in $\mathbb{R}^m$ with $m\le d$, $B\in \mathbb{R}^{m\times d}$ is a full-rank weight matrix, and $\xi$ is a noise vector. We first give an analytic formula for the population risk of the standard squared loss and demonstrate that it implicitly attempts to decompose a sequence of low-rank tensors simultaneously. Inspired by the formula, we design a non-convex objective function $G(\cdot)$ whose landscape is guaranteed to have the following properties: 1. All local minima of $G$ are also global minima. 2. All global minima of $G$ correspond to the ground truth parameters. 3. The value and gradient of $G$ can be estimated using samples. With these properties, stochastic gradient descent on $G$ provably converges to the global minimum and learn the ground-truth parameters. We also prove finite sample complexity result and validate the results by simulations.

  Access Model/Code and Paper