We show that for any convex differentiable loss function, a deep linear network has no spurious local minima as long as it is true for the two layer case. When applied to the quadratic loss, our result immediately implies the powerful result in [Kawaguchi 2016] that there is no spurious local minima in deep linear networks. Further, with the recent work [Zhou and Liang 2018], we can remove all the assumptions in [Kawaguchi 2016]. Our proof is short and elementary. It builds on the recent work of [Laurent and von Brecht 2018] and uses a new rank one perturbation argument.

Click to Read Paper
Traditional intelligent fault diagnosis of rolling bearings work well only under a common assumption that the labeled training data (source domain) and unlabeled testing data (target domain) are drawn from the same distribution. However, in many real-world applications, this assumption does not hold, especially when the working condition varies. In this paper, a new adversarial adaptive 1-D CNN called A2CNN is proposed to address this problem. A2CNN consists of four parts, namely, a source feature extractor, a target feature extractor, a label classifier and a domain discriminator. The layers between the source and target feature extractor are partially untied during the training stage to take both training efficiency and domain adaptation into consideration. Experiments show that A2CNN has strong fault-discriminative and domain-invariant capacity, and therefore can achieve high accuracy under different working conditions. We also visualize the learned features and the networks to explore the reasons behind the high performance of our proposed model.

Click to Read Paper
Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a "policy network" and a "value network" to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics.

Click to Read Paper
Many well-established recommender systems are based on representation learning in Euclidean space. In these models, matching functions such as the Euclidean distance or inner product are typically used for computing similarity scores between user and item embeddings. This paper investigates the notion of learning user and item representations in Hyperbolic space. In this paper, we argue that Hyperbolic space is more suitable for learning user-item embeddings in the recommendation domain. Unlike Euclidean spaces, Hyperbolic spaces are intrinsically equipped to handle hierarchical structure, encouraged by its property of exponentially increasing distances away from origin. We propose HyperBPR (Hyperbolic Bayesian Personalized Ranking), a conceptually simple but highly effective model for the task at hand. Our proposed HyperBPR not only outperforms their Euclidean counterparts, but also achieves state-of-the-art performance on multiple benchmark datasets, demonstrating the effectiveness of personalized recommendation in Hyperbolic space.

Click to Read Paper
Compressed sensing (sparse signal recovery) often encounters nonnegative data (e.g., images). Recently we developed the methodology of using (dense) Compressed Counting for recovering nonnegative K-sparse signals. In this paper, we adopt very sparse Compressed Counting for nonnegative signal recovery. Our design matrix is sampled from a maximally-skewed p-stable distribution (0<p<1), and we sparsify the design matrix so that on average (1-g)-fraction of the entries become zero. The idea is related to very sparse stable random projections (Li et al 2006 and Li 2007), the prior work for estimating summary statistics of the data. In our theoretical analysis, we show that, when p->0, it suffices to use M= K/(1-exp(-gK) log N measurements, so that all coordinates can be recovered in one scan of the coordinates. If g = 1 (i.e., dense design), then M = K log N. If g= 1/K or 2/K (i.e., very sparse design), then M = 1.58K log N or M = 1.16K log N. This means the design matrix can be indeed very sparse at only a minor inflation of the sample complexity. Interestingly, as p->1, the required number of measurements is essentially M = 2.7K log N, provided g= 1/K. It turns out that this result is a general worst-case bound.

Click to Read Paper
Multi-label learning deals with the classification problems where each instance can be assigned with multiple labels simultaneously. Conventional multi-label learning approaches mainly focus on exploiting label correlations. It is usually assumed, explicitly or implicitly, that the label sets for training instances are fully labeled without any missing labels. However, in many real-world multi-label datasets, the label assignments for training instances can be incomplete. Some ground-truth labels can be missed by the labeler from the label set. This problem is especially typical when the number instances is very large, and the labeling cost is very high, which makes it almost impossible to get a fully labeled training set. In this paper, we study the problem of large-scale multi-label learning with incomplete label assignments. We propose an approach, called MPU, based upon positive and unlabeled stochastic gradient descent and stacked models. Unlike prior works, our method can effectively and efficiently consider missing labels and label correlations simultaneously, and is very scalable, that has linear time complexities over the size of the data. Extensive experiments on two real-world multi-label datasets show that our MPU model consistently outperform other commonly-used baselines.

Click to Read Paper
Recently, we have seen a rapid development of Deep Neural Network (DNN) based visual tracking solutions. Some trackers combine the DNN-based solutions with Discriminative Correlation Filters (DCF) to extract semantic features and successfully deliver the state-of-the-art tracking accuracy. However, these solutions are highly compute-intensive, which require long processing time, resulting unsecured real-time performance. To deliver both high accuracy and reliable real-time performance, we propose a novel tracker called SiamVGG. It combines a Convolutional Neural Network (CNN) backbone and a cross-correlation operator, and takes advantage of the features from exemplary images for more accurate object tracking. The architecture of SiamVGG is customized from VGG-16, with the parameters shared by both exemplary images and desired input video frames. We demonstrate the proposed SiamVGG on OTB-2013/50/100 and VOT 2015/2016/2017 datasets with the state-of-the-art accuracy while maintaining a decent real-time performance of 50 FPS running on a GTX 1080Ti. Our design can achieve 2% higher Expected Average Overlap (EAO) compared to the ECO and C-COT in VOT2017 Challenge.

Click to Read Paper
MapReduce and its variants have significantly simplified and accelerated the process of developing parallel programs. However, most MapReduce implementations focus on data-intensive tasks while many real-world tasks are compute intensive and their data can fit distributedly into the memory. For these tasks, the speed of MapReduce programs can be much slower than those hand-optimized ones. We present Blaze, a C++ library that makes it easy to develop high performance parallel programs for such compute intensive tasks. At the core of Blaze is a highly-optimized in-memory MapReduce function, which has three main improvements over conventional MapReduce implementations: eager reduction, fast serialization, and special treatment for a small fixed key range. We also offer additional conveniences that make developing parallel programs similar to developing serial programs. These improvements make Blaze an easy-to-use cluster computing library that approaches the speed of hand-optimized parallel code. We apply Blaze to some common data mining tasks, including word frequency count, PageRank, k-means, expectation maximization (Gaussian mixture model), and k-nearest neighbors. Blaze outperforms Apache Spark by more than 10 times on average for these tasks, and the speed of Blaze scales almost linearly with the number of nodes. In addition, Blaze uses only the MapReduce function and 3 utility functions in its implementation while Spark uses almost 30 different parallel primitives in its official implementation.

Click to Read Paper
Matrix Product States (MPS), also known as Tensor Train (TT) decomposition in mathematics, has been proposed originally for describing an (especially one-dimensional) quantum system, and recently has found applications in various applications such as compressing high-dimensional data, supervised kernel linear classifier, and unsupervised generative modeling. However, when applied to systems which are not defined on one-dimensional lattices, a serious drawback of the MPS is the exponential decay of the correlations, which limits its power in capturing long-range dependences among variables in the system. To alleviate this problem, we propose to introduce long-range interactions, which act as shortcuts, to MPS, resulting in a new model \textit{ Shortcut Matrix Product States} (SMPS). When chosen properly, the shortcuts can decrease significantly the correlation length of the MPS, while preserving the computational efficiency. We develop efficient training methods of SMPS for various tasks, establish some of their mathematical properties, and show how to find a good location to add shortcuts. Finally, using extensive numerical experiments we evaluate its performance in a variety of applications, including function fitting, partition function calculation of $2-$d Ising model, and unsupervised generative modeling of handwritten digits, to illustrate its advantages over vanilla matrix product states.

* 15pages, 11 figures
Click to Read Paper
Estimating multiple attributes from a single facial image gives comprehensive descriptions on the high level semantics of the face. It is naturally regarded as a multi-task supervised learning problem with a single deep CNN, in which lower layers are shared, and higher ones are task-dependent with the multi-branch structure. Within the traditional deep multi-task learning (DMTL) framework, this paper intends to fully exploit the correlations among different attributes by constructing a graph. The node in graph represents the feature vector from a particular branch for a given attribute, and the edge can be defined by either the prior knowledge or the similarity between two nodes in the embedding with a fully data-driven manner. We analyze that the attention mechanism actually takes effect in the latter case, and utilize the Graph Attention Layer (GAL) for exploring on the most relevant attribute feature and refining the task-dependant feature by considering other attributes. Experiments show that by mining the correlations among attributes, our method can improve the recognition accuracy on CelebA and LFWA dataset. And it also achieves competitive performance.

* 9 pages, 5 figures, summit to AAAI2019
Click to Read Paper
Electricity consumption forecasting has important implications for the mineral companies on guiding quarterly work, normal power system operation, and the management. However, electricity consumption prediction for the mineral company is different from traditional electricity load prediction since mineral company electricity consumption can be affected by various factors (e.g., ore grade, processing quantity of the crude ore, ball milling fill rate). The problem is non-trivial due to three major challenges for traditional methods: insufficient training data, high computational cost and low prediction accu-racy. To tackle these challenges, we firstly propose a Regressive Convolution Neural Network (RCNN) to predict the electricity consumption. While RCNN still suffers from high computation overhead, we utilize RCNN to extract features from the history data and Regressive Support Vector Machine (SVR) trained with the features to predict the electricity consumption. The experimental results show that the proposed RCNN-SVR model achieves higher accuracy than using the traditional RNN or SVM alone. The MSE, MAPE, and CV-RMSE of RCNN-SVR model are 0.8564, 1.975%, and 0.0687% respectively, which illustrates the low predicting error rate of the proposed model.

* Future of Information and Communications Conference (FICC) 2019
Click to Read Paper
Recently, self-normalizing neural networks (SNNs) have been proposed with the intention to avoid batch or weight normalization. The key step in SNNs is to properly scale the exponential linear unit (referred to as SELU) to inherently incorporate normalization based on central limit theory. SELU is a monotonically increasing function, where it has an approximately constant negative output for large negative input. In this work, we propose a new activation function to break the monotonicity property of SELU while still preserving the self-normalizing property. Differently from SELU, the new function introduces a bump-shaped function in the region of negative input by regularizing a linear function with a scaled exponential function, which is referred to as a scaled exponentially-regularized linear unit (SERLU). The bump-shaped function has approximately zero response to large negative input while being able to push the output of SERLU towards zero mean statistically. To effectively combat over-fitting, we develop a so-called shift-dropout for SERLU, which includes standard dropout as a special case. Experimental results on MNIST, CIFAR10 and CIFAR100 show that SERLU-based neural networks provide consistently promising results in comparison to other 5 activation functions including ELU, SELU, Swish, Leakly ReLU and ReLU.

* 9 pages
Click to Read Paper
Many-objective evolutionary algorithms (MOEAs), especially the decomposition-based MOEAs, have attracted wide attention in recent years. Recent studies show that a well designed combination of the decomposition method and the domination method can improve the performance ,i.e., convergence and diversity, of a MOEA. In this paper, a novel way of combining the decomposition method and the domination method is proposed. More precisely, a set of weight vectors is employed to decompose a given many-objective optimization problem(MaOP), and a hybrid method of the penalty-based boundary intersection function and dominance is proposed to compare local solutions within a subpopulation defined by a weight vector. A MOEA based on the hybrid method is implemented and tested on problems chosen from two famous test suites, i.e., DTLZ and WFG. The experimental results show that our algorithm is very competitive in dealing with MaOPs. Subsequently, our algorithm is extended to solve constraint MaOPs, and the constrained version of our algorithm also shows good performance in terms of convergence and diversity. These reveals that using dominance locally and combining it with the decomposition method can effectively improve the performance of a MOEA.

* arXiv admin note: substantial text overlap with arXiv:1803.06282, arXiv:1806.10950
Click to Read Paper
Web page saliency prediction is a challenge problem in image transformation and computer vision. In this paper, we propose a new model combined with web page outline information to prediction people's interest region in web page. For each web page image, our model can generate the saliency map which indicates the region of interest for people. A two-stage generative adversarial networks are proposed and image outline information is introduced for better transferring. Experiment results on FIWI dataset show that our model have better performance in terms of saliency prediction.

Click to Read Paper
We consider a new setting of online clustering of contextual cascading bandits, an online learning problem where the underlying cluster structure over users is unknown and needs to be learned from a random prefix feedback. More precisely, a learning agent recommends an ordered list of items to a user, who checks the list and stops at the first satisfactory item, if any. We propose an algorithm of CLUB-cascade for this setting and prove an $n$-step regret bound of order $\tilde{O}(\sqrt{n})$. Previous work corresponds to the degenerate case of only one cluster, and our general regret bound in this special case also significantly improves theirs. We conduct experiments on both synthetic and real data, and demonstrate the effectiveness of our algorithm and the advantage of incorporating online clustering method.

Click to Read Paper
In this paper, we propose an efficient approximated rank one update for covariance matrix adaptation evolution strategy (CMA-ES). It makes use of two evolution paths as simple as that of CMA-ES, while avoiding the computational matrix decomposition. We analyze the algorithms' properties and behaviors. We experimentally study the proposed algorithm's performances. It generally outperforms or performs competitively to the Cholesky CMA-ES.

* 10 pages, 10 figures
Click to Read Paper
Bio-inspired algorithms have received a significant amount of attention in both academic and engineering societies. In this paper, based on the observation of two major survival rules of a species of woodlice, i.e., porcellio scaber, we design and propose an algorithm called the porcellio scaber algorithm (PSA) for solving optimization problems, including differentiable and non-differential ones as well as the case with local optimums. Numerical results based on benchmark problems are presented to validate the efficacy of PSA.

* 3 pages, 4 figures
Click to Read Paper
Transfer learning significantly accelerates the reinforcement learning process by exploiting relevant knowledge from previous experiences. The problem of optimally selecting source policies during the learning process is of great importance yet challenging. There has been little theoretical analysis of this problem. In this paper, we develop an optimal online method to select source policies for reinforcement learning. This method formulates online source policy selection as a multi-armed bandit problem and augments Q-learning with policy reuse. We provide theoretical guarantees of the optimal selection process and convergence to the optimal policy. In addition, we conduct experiments on a grid-based robot navigation domain to demonstrate its efficiency and robustness by comparing to the state-of-the-art transfer learning method.

Click to Read Paper
Demand response is designed to motivate electricity customers to modify their loads at critical time periods. The accurate estimation of impact of demand response signals to customers' consumption is central to any successful program. In practice, learning these response is nontrivial because operators can only send a limited number of signals. In addition, customer behavior also depends on a large number of exogenous covariates. These two features lead to a high dimensional inference problem with limited number of observations. In this paper, we formulate this problem by using a multivariate linear model and adopt an experimental design approach to estimate the impact of demand response signals. We show that randomized assignment, which is widely used to estimate the average treatment effect, is not efficient in reducing the variance of the estimator when a large number of covariates is present. In contrast, we present a tractable algorithm that strategically assigns demand response signals to customers. This algorithm achieves the optimal reduction in estimation variance, independent of the number of covariates. The results are validated from simulations on synthetic data.

* A shorter version appeared in Proceedings of the 2016 Allerton Conference
Click to Read Paper
Precise recommendation of followers helps in improving the user experience and maintaining the prosperity of twitter and microblog platforms. In this paper, we design a hybrid recommender system of microblog as a solution of KDD Cup 2012, track 1 task, which requires predicting users a user might follow in Tencent Microblog. We describe the background of the problem and present the algorithm consisting of keyword analysis, user taxonomy, (potential)interests extraction and item recommendation. Experimental result shows the high performance of our algorithm. Some possible improvements are discussed, which leads to further study.

* 7 pages
Click to Read Paper