We propose a novel probabilistic dimensionality reduction framework that can naturally integrate the generative model and the locality information of data. Based on this framework, we present a new model, which is able to learn a smooth skeleton of embedding points in a low-dimensional space from high-dimensional noisy data. The formulation of the new model can be equivalently interpreted as two coupled learning problem, i.e., structure learning and the learning of projection matrix. This interpretation motivates the learning of the embedding points that can directly form an explicit graph structure. We develop a new method to learn the embedding points that form a spanning tree, which is further extended to obtain a discriminative and compact feature representation for clustering problems. Unlike traditional clustering methods, we assume that centers of clusters should be close to each other if they are connected in a learned graph, and other cluster centers should be distant. This can greatly facilitate data visualization and scientific discovery in downstream analysis. Extensive experiments are performed that demonstrate that the proposed framework is able to obtain discriminative feature representations, and correctly recover the intrinsic structures of various real-world datasets.

* 32 pages, 6 figures
Click to Read Paper
Recent years have witnessed wide application of hashing for large-scale image retrieval. However, most existing hashing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Recently, deep hashing methods have been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels. For another common application scenario with pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing(DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. Experiments on real datasets show that our DPSH method can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.

* IJCAI 2016
Click to Read Paper
Recently popularized graph neural networks achieve the state-of-the-art accuracy on a number of standard benchmark datasets for graph-based semi-supervised learning, improving significantly over existing approaches. These architectures alternate between a propagation layer that aggregates the hidden states of the local neighborhood and a fully-connected layer. Perhaps surprisingly, we show that a linear model, that removes all the intermediate fully-connected layers, is still able to achieve a performance comparable to the state-of-the-art models. This significantly reduces the number of parameters, which is critical for semi-supervised learning where number of labeled examples are small. This in turn allows a room for designing more innovative propagation layers. Based on this insight, we propose a novel graph neural network that removes all the intermediate fully-connected layers, and replaces the propagation layers with attention mechanisms that respect the structure of the graph. The attention mechanism allows us to learn a dynamic and adaptive local summary of the neighborhood to achieve more accurate predictions. In a number of experiments on benchmark citation networks datasets, we demonstrate that our approach outperforms competing methods. By examining the attention weights among neighbors, we show that our model provides some interesting insights on how neighbors influence each other.

Click to Read Paper
The labeling cost of large number of bounding boxes is one of the main challenges for training modern object detectors. To reduce the dependence on expensive bounding box annotations, we propose a new semi-supervised object detection formulation, in which a few seed box level annotations and a large scale of image level annotations are used to train the detector. We adopt a training-mining framework, which is widely used in weakly supervised object detection tasks. However, the mining process inherently introduces various kinds of labelling noises: false negatives, false positives and inaccurate boundaries, which can be harmful for training the standard object detectors (e.g. Faster RCNN). We propose a novel NOise Tolerant Ensemble RCNN (NOTE-RCNN) object detector to handle such noisy labels. Comparing to standard Faster RCNN, it contains three highlights: an ensemble of two classification heads and a distillation head to avoid overfitting on noisy labels and improve the mining precision, masking the negative sample loss in box predictor to avoid the harm of false negative labels, and training box regression head only on seed annotations to eliminate the harm from inaccurate boundaries of mined bounding boxes. We evaluate the methods on ILSVRC 2013 and MSCOCO 2017 dataset; we observe that the detection accuracy consistently improves as we iterate between mining and training steps, and state-of-the-art performance is achieved.

* 8 pages
Click to Read Paper
Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a "policy network" and a "value network" to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics.

Click to Read Paper
Modern cities experience heavy traffic flows and congestions regularly across space and time. Monitoring traffic situations becomes an important challenge for the Traffic Control and Surveillance Systems (TCSS). In advanced TCSS, it is helpful to automatically detect and classify different traffic incidents such as severity of congestion, abnormal driving pattern, abrupt or illegal stop on road, etc. Although most TCSS are equipped with basic incident detection algorithms, they are however crude to be really useful as an automated tool for further classification. In literature, there is a lack of research for Automated Incident Classification (AIC). Therefore, a novel AIC method is proposed in this paper to tackle such challenges. In the proposed method, traffic signals are firstly extracted from captured videos and converted as spatial-temporal (ST) signals. Based on the characteristics of the ST signals, a set of realistic simulation data are generated to construct an extended big traffic database to cover a variety of traffic situations. Next, a Mean-Shift filter is introduced to suppress the effect of noise and extract significant features from the ST signals. The extracted features are then associated with various types of traffic data: one normal type (inliers) and multiple abnormal types (outliers). For the classification, an adaptive boosting classifier is trained to detect outliers in traffic data automatically. Further, a Support Vector Machine (SVM) based method is adopted to train the model for identifying the categories of outliers. In short, this hybrid approach is called an Adaptive Boosting Support Vector Machines (AB-SVM) method. Experimental results show that the proposed AB-SVM method achieves a satisfied result with more than 92% classification accuracy on average.

* 27 pages, 8 figures
Click to Read Paper
Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4x FLOPs reduction, we achieved 2.7% better accuracy than the hand- crafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81x speedup of measured inference latency on an Android phone and 1.43x speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy.

Click to Read Paper
In this paper, we propose a general model for plane-based clustering. The general model contains many existing plane-based clustering methods, e.g., k-plane clustering (kPC), proximal plane clustering (PPC), twin support vector clustering (TWSVC) and its extensions. Under this general model, one may obtain an appropriate clustering method for specific purpose. The general model is a procedure corresponding to an optimization problem, where the optimization problem minimizes the total loss of the samples. Thereinto, the loss of a sample derives from both within-cluster and between-cluster. In theory, the termination conditions are discussed, and we prove that the general model terminates in a finite number of steps at a local or weak local optimal point. Furthermore, based on this general model, we propose a plane-based clustering method by introducing a new loss function to capture the data distribution precisely. Experimental results on artificial and public available datasets verify the effectiveness of the proposed method.

* 13 pages, 43 figures
Click to Read Paper
Accurate identification and localization of abnormalities from radiology images play an integral part in clinical diagnosis and treatment planning. Building a highly accurate prediction model for these tasks usually requires a large number of images manually annotated with labels and finding sites of abnormalities. In reality, however, such annotated data are expensive to acquire, especially the ones with location annotations. We need methods that can work well with only a small amount of location annotations. To address this challenge, we present a unified approach that simultaneously performs disease identification and localization through the same underlying model for all images. We demonstrate that our approach can effectively leverage both class information as well as limited location annotation, and significantly outperforms the comparative reference baseline in both classification and localization tasks.

* Conference on Computer Vision and Pattern Recognition 2018 (CVPR 2018). V1: CVPR submission; V2: +supplementary; V3: CVPR camera-ready; V4: correction, update reference baseline results according to their latest post; V5: minor correction; V6: Identification results using NIH data splits and various image models
Click to Read Paper
Segmentation of colorectal cancerous regions from Magnetic Resonance (MR) images is a crucial procedure for radiotherapy which conventionally requires accurate delineation of tumour boundaries at an expense of labor, time and reproducibility. To address this important yet challenging task within the framework of performance-leading deep learning methods, regions of interest (RoIs) localization from large whole volume 3D images serves as a preceding operation that brings about multiple benefits in terms of speed, target completeness and reduction of false positives. Distinct from sliding window or discrete localization-segmentation based models, we propose a novel multi-task framework referred to as 3D RoI-aware U-Net (3D RU-Net), for RoI localization and intra-RoI segmentation where the two tasks share one backbone encoder network. With the region proposals from the encoder, we crop multi-level feature maps from the backbone network to form a GPU memory-efficient decoder for detail-preserving intra-RoI segmentation. To effectively train the model, we designed a Dice formulated loss function for the global-to-local multi-task learning procedure. Based on the promising efficiency gains demonstrated by the proposed method, we went on to ensemble multiple models to achieve even higher performance costing minor extra computational expensiveness. Extensive experiments were subsequently conducted on 64 cancerous cases with a four-fold cross-validation, and the results showed significant superiority in terms of accuracy and efficiency over conventional state-of-the art frameworks. In conclusion, the proposed method has a huge potential for extension to other 3D object segmentation tasks from medical images due to its inherent generalizability. The code for the proposed method is publicly available.

Click to Read Paper
A deep convolutional fuzzy system (DCFS) on a high-dimensional input space is a multi-layer connection of many low-dimensional fuzzy systems, where the input variables to the low-dimensional fuzzy systems are selected through a moving window (a convolution operator) across the input spaces of the layers. To design the DCFS based on input-output data pairs, we propose a bottom-up layer-by-layer scheme. Specifically, by viewing each of the first-layer fuzzy systems as a weak estimator of the output based only on a very small portion of the input variables, we can design these fuzzy systems using the WM Method. After the first-layer fuzzy systems are designed, we pass the data through the first layer and replace the inputs in the original data set by the corresponding outputs of the first layer to form a new data set, then we design the second-layer fuzzy systems based on this new data set in the same way as designing the first-layer fuzzy systems. Repeating this process we design the whole DCFS. Since the WM Method requires only one-pass of the data, this training algorithm for the DCFS is very fast. We apply the DCFS model with the training algorithm to predict a synthetic chaotic plus random time-series and the real Hang Seng Index of the Hong Kong stock market.

Click to Read Paper
In multi-label classification, an instance may be associated with a set of labels simultaneously. Recently, the research on multi-label classification has largely shifted its focus to the other end of the spectrum where the number of labels is assumed to be extremely large. The existing works focus on how to design scalable algorithms that offer fast training procedures and have a small memory footprint. However they ignore and even compound another challenge - the label imbalance problem. To address this drawback, we propose a novel Representation-based Multi-label Learning with Sampling (RMLS) approach. To the best of our knowledge, we are the first to tackle the imbalance problem in multi-label classification with many labels. Our experimentations with real-world datasets demonstrate the effectiveness of the proposed approach.

Click to Read Paper
We propose a new heavy-tailed distribution --- Gaussian-Chain (GC) distribution, which is inspirited by the hierarchical structures prevailing in social organizations. We determine the mean, variance and kurtosis of the Gaussian-Chain distribution to show its heavy-tailed property, and compute the tail distribution table to give specific numbers showing how heavy is the heavy-tails. To filter out the heavy-tailed noise, we construct two filters --- 2nd and 3rd-order GC filters --- based on the maximum likelihood principle. Simulation results show that the GC filters perform much better than the benchmark least-squares algorithm when the noise is heavy-tail distributed. Using the GC filters, we propose a trading strategy, named Ride-the-Mood, to follow the mood of the market by detecting the actions of the big buyers and the big sellers in the market based on the noisy, heavy-tailed price data. Application of the Ride-the-Mood strategy to five blue-chip Hong Kong stocks over the recent two-year period from April 2, 2012 to March 31, 2014 shows that their returns are higher than the returns of the benchmark Buy-and-Hold strategy and the Hang Seng Index Fund.

Click to Read Paper
Many tasks in natural language processing, ranging from machine translation to question answering, can be reduced to the problem of matching two sentences or more generally two short texts. We propose a new approach to the problem, called Deep Match Tree (DeepMatch$_{tree}$), under a general setting. The approach consists of two components, 1) a mining algorithm to discover patterns for matching two short-texts, defined in the product space of dependency trees, and 2) a deep neural network for matching short texts using the mined patterns, as well as a learning algorithm to build the network having a sparse structure. We test our algorithm on the problem of matching a tweet and a response in social media, a hard matching problem proposed in [Wang et al., 2013], and show that DeepMatch$_{tree}$ can outperform a number of competitor models including one without using dependency trees and one based on word-embedding, all with large margins

* Accepted by IJCAI-2015 as full paper
Click to Read Paper
Our goal is to predict the location of the next crime in a crime series, based on the identified previous offenses in the series. We build a predictive model called Next Hit Predictor (NHP) that finds the most likely location of the next serial crime via a carefully designed risk model. The risk model follows the paradigm of a self-exciting point process which consists of a background crime risk and triggered risks stimulated by previous offenses in the series. Thus, NHP creates a risk map for a crime series at hand. To train the risk model, we formulate a convex learning objective that considers pairwise rankings of locations and use stochastic gradient descent to learn the optimal parameters. Next Hit Predictor incorporates both spatial-temporal features and geographical characteristics of prior crime locations in the series. Next Hit Predictor has demonstrated promising results on decades' worth of serial crime data collected by the Crime Analysis Unit of the Cambridge Police Department in Massachusetts, USA.

* AI for Social Good Workshop NIPS2018
Click to Read Paper
We consider the problem of decomposition of multiway tensor with binary entries. Such data problems arise frequently in numerous applications such as neuroimaging, recommendation system, topic modeling, and sensor network localization. We propose that the observed binary entries follow a Bernoulli model, develop a rank-constrained likelihood-based estimation procedure, and obtain the theoretical accuracy guarantees. Specifically, we establish the error bound of the tensor estimation, and show that the obtained rate is minimax optimal under the considered model. We demonstrate the efficacy of our approach through both simulations and analyses of multiple real-world datasets on the tasks of tensor completion and clustering.

* 22 pages, 5 figures
Click to Read Paper
Advanced travel information and warning, if provided accurately, can help road users avoid traffic congestion through dynamic route planning and behavior change. It also enables traffic control centres mitigate the impact of congestion by activating Intelligent Transport System (ITS) proactively. Deep learning has become increasingly popular in recent years, following a surge of innovative GPU technology, high-resolution, big datasets and thriving machine learning algorithms. However, there are few examples exploiting this emerging technology to develop applications for traffic prediction. This is largely due to the difficulty in capturing random, seasonal, non-linear, and spatio-temporal correlated nature of traffic data. In this paper, we propose a data-driven modelling approach with a novel hierarchical D-CLSTM-t deep learning model for short-term traffic speed prediction, a framework combined with convolutional neural network (CNN) and long short-term memory (LSTM) models. A deep CNN model is employed to learn the spatio-temporal traffic patterns of the input graphs, which are then fed into a deep LSTM model for sequence learning. To capture traffic seasonal variations, time of the day and day of the week indicators are fused with trained features. The model is trained end-to-end to predict travel speed in 15 to 90 minutes in the future. We compare the model performance against other baseline models including CNN, LGBM, LSTM, and traditional speed-flow curves. Experiment results show that the D-CLSTM-t outperforms other models considerably. Model tests show that speed upstream also responds sensibly to a sudden accident occurring downstream. Our D-CLSTM-t model framework is also highly scalable for future extension such as for network-wide traffic prediction, which can also be improved by including additional features such as weather, long term seasonality and accident information.

* 17 pages, 10 Figures, 4 Tables; To be presented in European Transport Conference in Dublin, Oct 2018
Click to Read Paper
Stochastic configuration networks (SCNs) as a class of randomized learner model have been successfully employed in data analytics due to its universal approximation capability and fast modelling property. The technical essence lies in stochastically configuring hidden nodes (or basis functions) based on a supervisory mechanism rather than data-independent randomization as usually adopted for building randomized neural networks. Given image data modelling tasks, the use of one-dimensional SCNs potentially demolishes the spatial information of images, and may result in undesirable performance. This paper extends the original SCNs to two-dimensional version, termed 2DSCNs, for fast building randomized learners with matrix-inputs. Some theoretical analyses on the goodness of 2DSCNs against SCNs, including the complexity of the random parameter space, and the superiority of generalization, are presented. Empirical results over one regression, four benchmark handwritten digits classification, and two human face recognition datasets demonstrate that the proposed 2DSCNs perform favourably and show good potential for image data analytics.

* This paper has been submitted to IEEE Transactions on Cybernetics, on August 30, 2018
Click to Read Paper
Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide & deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

* Accepted by 2018 International Joint Conference on Artificial Intelligence
Click to Read Paper
We propose Image-Semantic-Transformation-Reconstruction-Circle(ISTRC) model, a novel and powerful method using facenet's Euclidean latent space to understand the images. As the name suggests, ISTRC construct the circle, able to perfectly reconstruct images. One powerful Euclidean latent space embedded in ISTRC is FaceNet's last layer with the power of distinguishing and understanding images. Our model will reconstruct the images and manipulate Euclidean latent vectors to achieve semantic transformations and semantic images arthimetic calculations. In this paper, we show that ISTRC performs 10 high-level semantic transformations like "Male and female","add smile","open mouth", "deduct beard or add mustache", "bigger/smaller nose", "make older and younger", "bigger lips", "bigger eyes", "bigger/smaller mouths" and "more attractive". It just takes 3 hours(GTX 1080) to train the models of 10 semantic transformations.

* ECCV 2018 submission, 14 pages
Click to Read Paper