We present the KOKO system that takes declarative information extraction to a new level by incorporating advances in natural language processing techniques in its extraction language. KOKO is novel in that its extraction language simultaneously supports conditions on the surface of the text and on the structure of the dependency parse tree of sentences, thereby allowing for more refined extractions. KOKO also supports conditions that are forgiving to linguistic variation of expressing concepts and allows to aggregate evidence from the entire document in order to filter extractions. To scale up, KOKO exploits a multi-indexing scheme and heuristics for efficient extractions. We extensively evaluate KOKO over publicly available text corpora. We show that KOKO indices take up the smallest amount of space, are notably faster and more effective than a number of prior indexing schemes. Finally, we demonstrate KOKO's scale up on a corpus of 5 million Wikipedia articles.

Click to Read Paper
We propose a simple algorithm to train stochastic neural networks to draw samples from given target distributions for probabilistic inference. Our method is based on iteratively adjusting the neural network parameters so that the output changes along a Stein variational gradient direction (Liu & Wang, 2016) that maximally decreases the KL divergence with the target distribution. Our method works for any target distribution specified by their unnormalized density function, and can train any black-box architectures that are differentiable in terms of the parameters we want to adapt. We demonstrate our method with a number of applications, including variational autoencoder (VAE) with expressive encoders to model complex latent space structures, and hyper-parameter learning of MCMC samplers that allows Bayesian inference to adaptively improve itself when seeing more data.

* Accepted by UAI 2017
Click to Read Paper
The eigendeomposition of nearest-neighbor (NN) graph Laplacian matrices is the main computational bottleneck in spectral clustering. In this work, we introduce a highly-scalable, spectrum-preserving graph sparsification algorithm that enables to build ultra-sparse NN (u-NN) graphs with guaranteed preservation of the original graph spectrums, such as the first few eigenvectors of the original graph Laplacian. Our approach can immediately lead to scalable spectral clustering of large data networks without sacrificing solution quality. The proposed method starts from constructing low-stretch spanning trees (LSSTs) from the original graphs, which is followed by iteratively recovering small portions of "spectrally critical" off-tree edges to the LSSTs by leveraging a spectral off-tree embedding scheme. To determine the suitable amount of off-tree edges to be recovered to the LSSTs, an eigenvalue stability checking scheme is proposed, which enables to robustly preserve the first few Laplacian eigenvectors within the sparsified graph. Additionally, an incremental graph densification scheme is proposed for identifying extra edges that have been missing in the original NN graphs but can still play important roles in spectral clustering tasks. Our experimental results for a variety of well-known data sets show that the proposed method can dramatically reduce the complexity of NN graphs, leading to significant speedups in spectral clustering.

Click to Read Paper
Voxel is an important format to represent geometric data, which has been widely used for 3D deep learning in shape analysis due to its generalization ability and regular data format. However, fine-grained tasks like part segmentation require detailed structural information, which increases voxel resolution and thus causes other issues such as the exhaustion of computational resources. In this paper, we propose a novel volumetric convolutional neural network, which could extract discriminative features encoding detailed information from voxelized 3D data under a limited resolution. To this purpose, a spatial dense extraction (SDE) module is designed to preserve the spatial resolution during the feature extraction procedure, alleviating the loss of detail caused by sub-sampling operations such as max-pooling. An attention feature aggregation (AFA) module is also introduced to adaptively select informative features from different abstraction scales, leading to segmentation with both semantic consistency and high accuracy of details. Experiment results on the large-scale dataset demonstrate the effectiveness of our method in 3D shape part segmentation.

* 11 pages, 10 figures
Click to Read Paper
Given two possible treatments, there may exist subgroups who benefit greater from one treatment than the other. This problem is relevant to the field of marketing, where treatments may correspond to different ways of selling a product. It is similarly relevant to the field of public policy, where treatments may correspond to specific government programs. And finally, personalized medicine is a field wholly devoted to understanding which subgroups of individuals will benefit from particular medical treatments. We present a computationally fast tree-based method, ABtree, for treatment effect differentiation. Unlike other methods, ABtree specifically produces decision rules for optimal treatment assignment on a per-individual basis. The treatment choices are selected for maximizing the overall occurrence of a desired binary outcome, conditional on a set of covariates. In this poster, we present the methodology on tree growth and pruning, and show performance results when applied to simulated data as well as real data.

Click to Read Paper
In this paper we study the problem of optimal layout of an offshore wind farm to minimize the wake effect impacts. Considering the specific requirements of concerned offshore wind farm, we propose an adaptive genetic algorithm (AGA) which introduces location swaps to replace random crossovers in conventional GAs. That way the total number of turbines in the resulting layout will be effectively kept to the initially specified value. We experiment the proposed AGA method on three cases with free wind speed of 12 m/s, 20 m/s, and a typical offshore wind distribution setting respectively. Numerical results verify the effectiveness of our proposed algorithm which achieves a much faster convergence compared to conventional GA algorithms.

Click to Read Paper
In this paper, a unified susceptible-exposed-infected-susceptible-aware (SEIS-A) framework is proposed to combine epidemic spreading with individuals' on-line self-consultation behaviors. An epidemic spreading prediction model is established based on the SEIS-A framework. The prediction process contains two phases. In phase I, the time series data of disease density are decomposed through the empirical mode decomposition (EMD) method to obtain the intrinsic mode functions (IMFs). In phase II, the ensemble learning techniques which use the on-line query data as an additional input are applied to these IMFs. Finally, experiments for prediction of weekly consultation rates of Hand-foot-and-mouth disease (HFMD) in Hong Kong are conducted to validate the effectiveness of the proposed method. The main advantage of this method is that it outperforms other methods on fluctuating complex data.

* Some issues need to be addressed in this manuscript
Click to Read Paper
Self-attention network, an attention-based feedforward neural network, has recently shown the potential to replace recurrent neural networks (RNNs) in a variety of NLP tasks. However, it is not clear if the self-attention network could be a good alternative of RNNs in automatic speech recognition (ASR), which processes the longer speech sequences and may have online recognition requirements. In this paper, we present a RNN-free end-to-end model: self-attention aligner (SAA), which applies the self-attention networks to a simplified recurrent neural aligner (RNA) framework. We also propose a chunk-hopping mechanism, which enables the SAA model to encode on segmented frame chunks one after another to support online recognition. Experiments on two Mandarin ASR datasets show the replacement of RNNs by the self-attention networks yields a 8.4%-10.2% relative character error rate (CER) reduction. In addition, the chunk-hopping mechanism allows the SAA to have only a 2.5% relative CER degradation with a 320ms latency. After jointly training with a self-attention network language model, our SAA model obtains further error rate reduction on multiple datasets. Especially, it achieves 24.12% CER on the Mandarin ASR benchmark (HKUST), exceeding the best end-to-end model by over 2% absolute CER.

* To appear at ICASSP 2019
Click to Read Paper
The deep network model, with the majority built on neural networks, has been proved to be a powerful framework to represent complex data for high performance machine learning. In recent years, more and more studies turn to nonneural network approaches to build diverse deep structures, and the Deep Stacking Network (DSN) model is one of such approaches that uses stacked easy-to-learn blocks to build a parameter-training-parallelizable deep network. In this paper, we propose a novel SVM-based Deep Stacking Network (SVM-DSN), which uses the DSN architecture to organize linear SVM classifiers for deep learning. A BP-like layer tuning scheme is also proposed to ensure holistic and local optimizations of stacked SVMs simultaneously. Some good math properties of SVM, such as the convex optimization, is introduced into the DSN framework by our model. From a global view, SVM-DSN can iteratively extract data representations layer by layer as a deep neural network but with parallelizability, and from a local view, each stacked SVM can converge to its optimal solution and obtain the support vectors, which compared with neural networks could lead to interesting improvements in anti-saturation and interpretability. Experimental results on both image and text data sets demonstrate the excellent performances of SVM-DSN compared with some competitive benchmark models.

Click to Read Paper
In this paper, we present a method using AI techniques to solve a case of pure mathematics applications for finding narrow admissible tuples. The original problem is formulated into a combinatorial optimization problem. In particular, we show how to exploit the local search structure to formulate the problem landscape for dramatic reductions in search space and for non-trivial elimination in search barriers, and then to realize intelligent search strategies for effectively escaping from local minima. Experimental results demonstrate that the proposed method is able to efficiently find best known solutions. This research sheds light on exploiting the local problem structure for an efficient search in combinatorial landscapes as an application of AI to a new problem domain.

* International Joint Conference on Artificial Intelligence, New York, 2016, pp.2683-2689
* 7 pages, 2 figures, conference
Click to Read Paper
We present CGO-AS, a generalized Ant System (AS) implemented in the framework of Cooperative Group Optimization (CGO), to show the leveraged optimization with a mixed individual and social learning. Ant colony is a simple yet efficient natural system for understanding the effects of primary intelligence on optimization. However, existing AS algorithms are mostly focusing on their capability of using social heuristic cues while ignoring their individual learning. CGO can integrate the advantages of a cooperative group and a low-level algorithm portfolio design, and the agents of CGO can explore both individual and social search. In CGO-AS, each ant (agent) is added with an individual memory, and is implemented with a novel search strategy to use individual and social cues in a controlled proportion. The presented CGO-AS is therefore especially useful in exposing the power of the mixed individual and social learning for improving optimization. The optimization performance is tested with instances of the Traveling Salesman Problem (TSP). The results prove that a cooperative ant group using both individual and social learning obtains a better performance than the systems solely using either individual or social learning. The best performance is achieved under the condition when agents use individual memory as their primary information source, and simultaneously use social memory as their searching guidance. In comparison with existing AS systems, CGO-AS retains a faster learning speed toward those higher-quality solutions, especially in the later learning cycles. The leverage in optimization by CGO-AS is highly possible due to its inherent feature of adaptively maintaining the population diversity in the individual memory of agents, and of accelerating the learning process with accumulated knowledge in the social memory.

* Applied Soft Computing, 50: 223-234, 2017
Click to Read Paper
In ECOC framework, the ternary coding strategy is widely deployed in coding process. It relabels classes with {"-1,0,1" }, where -1/1 means to assign the corresponding classes to the negative/positive group, and label 0 leads to ignore the corresponding classes in the training process. However, the application of hard labels may lose some information about the tendency of class distributions. Instead, we propose a Centroid distance-based Soft coding scheme to indicate such tendency, named as CSECOC. In our algorithm, Sequential Forward Floating Selection (SFFS) is applied to search an optimal class assignment by minimizing the ratio of intra-group and inter-group distance. In this way, a hard coding matrix is generated initially. Then we propose a measure, named as coverage, to describe the probability of a sample in a class falling to a correct group. The coverage of a class a group replace the corresponding hard element, so as to form a soft coding matrix. Compared with the hard ones, such soft elements can reflect the tendency of a class belonging to positive or negative group. Instead of classifiers, regressors are used as base learners in this algorithm. To the best of our knowledge, it is the first time that soft coding scheme has been proposed. The results on five UCI datasets show that compared with some state-of-art ECOC algorithms, our algorithm can produce comparable or better classification accuracy with small scale ensembles.

Click to Read Paper
We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.

Click to Read Paper
This paper addresses the issue on how to more effectively coordinate the depth with RGB aiming at boosting the performance of RGB-D object detection. Particularly, we investigate two primary ideas under the CNN model: property derivation and property fusion. Firstly, we propose that the depth can be utilized not only as a type of extra information besides RGB but also to derive more visual properties for comprehensively describing the objects of interest. So a two-stage learning framework consisting of property derivation and fusion is constructed. Here the properties can be derived either from the provided color/depth or their pairs (e.g. the geometry contour adopted in this paper). Secondly, we explore the fusion method of different properties in feature learning, which is boiled down to, under the CNN model, from which layer the properties should be fused together. The analysis shows that different semantic properties should be learned separately and combined before passing into the final classifier. Actually, such a detection way is in accordance with the mechanism of the primary neural cortex (V1) in brain. We experimentally evaluate the proposed method on the challenging dataset, and have achieved state-of-the-art performance.

* 9 pages, 3 figures, and 4 tables. Accepted by CVPR2016 Workshops
Click to Read Paper
The recurrent neural networks (RNN) can be used to solve the sequence to sequence problem, where both the input and the output have sequential structures. Usually there are some implicit relations between the structures. However, it is hard for the common RNN model to fully explore the relations between the sequences. In this survey, we introduce some attention based RNN models which can focus on different parts of the input for each output item, in order to explore and take advantage of the implicit relations between the input and the output items. The different attention mechanisms are described in detail. We then introduce some applications in computer vision which apply the attention based RNN models. The superiority of the attention based RNN model is shown by the experimental results. At last some future research directions are given.

Click to Read Paper
We consider the problem of learning decision rules for prediction with feature budget constraint. In particular, we are interested in pruning an ensemble of decision trees to reduce expected feature cost while maintaining high prediction accuracy for any test example. We propose a novel 0-1 integer program formulation for ensemble pruning. Our pruning formulation is general - it takes any ensemble of decision trees as input. By explicitly accounting for feature-sharing across trees together with accuracy/cost trade-off, our method is able to significantly reduce feature cost by pruning subtrees that introduce more loss in terms of feature cost than benefit in terms of prediction accuracy gain. Theoretically, we prove that a linear programming relaxation produces the exact solution of the original integer program. This allows us to use efficient convex optimization tools to obtain an optimally pruned ensemble for any given budget. Empirically, we see that our pruning algorithm significantly improves the performance of the state of the art ensemble method BudgetRF.

Click to Read Paper
We seek decision rules for prediction-time cost reduction, where complete data is available for training, but during prediction-time, each feature can only be acquired for an additional cost. We propose a novel random forest algorithm to minimize prediction error for a user-specified {\it average} feature acquisition budget. While random forests yield strong generalization performance, they do not explicitly account for feature costs and furthermore require low correlation among trees, which amplifies costs. Our random forest grows trees with low acquisition cost and high strength based on greedy minimax cost-weighted-impurity splits. Theoretically, we establish near-optimal acquisition cost guarantees for our algorithm. Empirically, on a number of benchmark datasets we demonstrate superior accuracy-cost curves against state-of-the-art prediction-time algorithms.

Click to Read Paper
We propose novel methods for max-cost Discrete Function Evaluation Problem (DFEP) under budget constraints. We are motivated by applications such as clinical diagnosis where a patient is subjected to a sequence of (possibly expensive) tests before a decision is made. Our goal is to develop strategies for minimizing max-costs. The problem is known to be NP hard and greedy methods based on specialized impurity functions have been proposed. We develop a broad class of \emph{admissible} impurity functions that admit monomials, classes of polynomials, and hinge-loss functions that allow for flexible impurity design with provably optimal approximation bounds. This flexibility is important for datasets when max-cost can be overly sensitive to "outliers." Outliers bias max-cost to a few examples that require a large number of tests for classification. We design admissible functions that allow for accuracy-cost trade-off and result in $O(\log n)$ guarantees of the optimal cost among trees with corresponding classification accuracy levels.

Click to Read Paper
Binary codes have been widely used in vision problems as a compact feature representation to achieve both space and time advantages. Various methods have been proposed to learn data-dependent hash functions which map a feature vector to a binary code. However, considerable data information is inevitably lost during the binarization step which also causes ambiguity in measuring sample similarity using Hamming distance. Besides, the learned hash functions cannot be changed after training, which makes them incapable of adapting to new data outside the training data set. To address both issues, in this paper we propose a flexible bitwise weight learning framework based on the binary codes obtained by state-of-the-art hashing methods, and incorporate the learned weights into the weighted Hamming distance computation. We then formulate the proposed framework as a ranking problem and leverage the Ranking SVM model to offline tackle the weight learning. The framework is further extended to an online mode which updates the weights at each time new data comes, thereby making it scalable to large and dynamic data sets. Extensive experimental results demonstrate significant performance gains of using binary codes with bitwise weighting in image retrieval tasks. It is appealing that the online weight learning leads to comparable accuracy with its offline counterpart, which thus makes our approach practical for realistic applications.

Click to Read Paper
This paper describes our solution to the multi-modal learning challenge of ICML. This solution comprises constructing three-level representations in three consecutive stages and choosing correct tag words with a data-specific strategy. Firstly, we use typical methods to obtain level-1 representations. Each image is represented using MPEG-7 and gist descriptors with additional features released by the contest organizers. And the corresponding word tags are represented by bag-of-words model with a dictionary of 4000 words. Secondly, we learn the level-2 representations using two stacked RBMs for each modality. Thirdly, we propose a bimodal auto-encoder to learn the similarities/dissimilarities between the pairwise image-tags as level-3 representations. Finally, during the test phase, based on one observation of the dataset, we come up with a data-specific strategy to choose the correct tag words leading to a leap of an improved overall performance. Our final average accuracy on the private test set is 100%, which ranks the first place in this challenge.

* 6 pages, 1 figure, Presented at the Workshop on Representation Learning, ICML 2013
Click to Read Paper