Transfer learning has been developed to improve the performances of different but related tasks in machine learning. However, such processes become less efficient with the increase of the size of training data and the number of tasks. Moreover, privacy can be violated as some tasks may contain sensitive and private data, which are communicated between nodes and tasks. We propose a consensus-based distributed transfer learning framework, where several tasks aim to find the best linear support vector machine (SVM) classifiers in a distributed network. With alternating direction method of multipliers, tasks can achieve better classification accuracies more efficiently and privately, as each node and each task train with their own data, and only decision variables are transferred between different tasks and nodes. Numerical experiments on MNIST datasets show that the knowledge transferred from the source tasks can be used to decrease the risks of the target tasks that lack training data or have unbalanced training labels. We show that the risks of the target tasks in the nodes without the data of the source tasks can also be reduced using the information transferred from the nodes who contain the data of the source tasks. We also show that the target tasks can enter and leave in real-time without rerunning the whole algorithm.

Click to Read Paper
Distributed Support Vector Machines (DSVM) have been developed to solve large-scale classification problems in networked systems with a large number of sensors and control units. However, the systems become more vulnerable as detection and defense are increasingly difficult and expensive. This work aims to develop secure and resilient DSVM algorithms under adversarial environments in which an attacker can manipulate the training data to achieve his objective. We establish a game-theoretic framework to capture the conflicting interests between an adversary and a set of distributed data processing units. The Nash equilibrium of the game allows predicting the outcome of learning algorithms in adversarial environments, and enhancing the resilience of the machine learning through dynamic distributed learning algorithms. We prove that the convergence of the distributed algorithm is guaranteed without assumptions on the training data or network topologies. Numerical experiments are conducted to corroborate the results. We show that network topology plays an important role in the security of DSVM. Networks with fewer nodes and higher average degrees are more secure. Moreover, a balanced network is found to be less vulnerable to attacks.

* arXiv admin note: text overlap with arXiv:1710.04677
Click to Read Paper
With a large number of sensors and control units in networked systems, distributed support vector machines (DSVMs) play a fundamental role in scalable and efficient multi-sensor classification and prediction tasks. However, DSVMs are vulnerable to adversaries who can modify and generate data to deceive the system to misclassification and misprediction. This work aims to design defense strategies for DSVM learner against a potential adversary. We establish a game-theoretic framework to capture the conflicting interests between the DSVM learner and the attacker. The Nash equilibrium of the game allows predicting the outcome of learning algorithms in adversarial environments, and enhancing the resilience of the machine learning through dynamic distributed learning algorithms. We show that the DSVM learner is less vulnerable when he uses a balanced network with fewer nodes and higher degree. We also show that adding more training samples is an efficient defense strategy against an attacker. We present secure and resilient DSVM algorithms with verification method and rejection method, and show their resiliency against adversary with numerical experiments.

Click to Read Paper
Natural-language-facilitated human-robot cooperation (NLC) refers to using natural language (NL) to facilitate interactive information sharing and task executions with a common goal constraint between robots and humans. Recently, NLC research has received increasing attention. Typical NLC scenarios include robotic daily assistance, robotic health caregiving, intelligent manufacturing, autonomous navigation, and robot social accompany. However, a thorough review, that can reveal latest methodologies to use NL to facilitate human-robot cooperation, is missing. In this review, a comprehensive summary about methodologies for NLC is presented. NLC research includes three main research focuses: NL instruction understanding, NL-based execution plan generation, and knowledge-world mapping. In-depth analyses on theoretical methods, applications, and model advantages and disadvantages are made. Based on our paper review and perspective, potential research directions of NLC are summarized.

* 13 pages, 9 figures
Click to Read Paper
Natural-language-facilitated human-robot cooperation (NLC), in which natural language (NL) is used to share knowledge between a human and a robot for conducting intuitive human-robot cooperation (HRC), is continuously developing in the recent decade. Currently, NLC is used in several robotic domains such as manufacturing, daily assistance and health caregiving. It is necessary to summarize current NLC-based robotic systems and discuss the future developing trends, providing helpful information for future NLC research. In this review, we first analyzed the driving forces behind the NLC research. Regarding to a robot s cognition level during the cooperation, the NLC implementations then were categorized into four types {NL-based control, NL-based robot training, NL-based task execution, NL-based social companion} for comparison and discussion. Last based on our perspective and comprehensive paper review, the future research trends were discussed.

* 21 pages, 10 figures
Click to Read Paper
It is critical for advanced manufacturing machines to autonomously execute a task by following an end-user's natural language (NL) instructions. However, NL instructions are usually ambiguous and abstract so that the machines may misunderstand and incorrectly execute the task. To address this NL-based human-machine communication problem and enable the machines to appropriately execute tasks by following the end-user's NL instructions, we developed a Machine-Executable-Plan-Generation (exePlan) method. The exePlan method conducts task-centered semantic analysis to extract task-related information from ambiguous NL instructions. In addition, the method specifies machine execution parameters to generate a machine-executable plan by interpreting abstract NL instructions. To evaluate the exePlan method, an industrial robot Baxter was instructed by NL to perform three types of industrial tasks {'drill a hole', 'clean a spot', 'install a screw'}. The experiment results proved that the exePlan method was effective in generating machine-executable plans from the end-user's NL instructions. Such a method has the promise to endow a machine with the ability of NL-instructed task execution.

* 16 pages, 10 figures, article submitted to Robotics and Computer-Integrated Manufacturing, 2016 Aug
Click to Read Paper
The recent developments of basis pursuit and compressed sensing seek to extract information from as few samples as possible. In such applications, since the number of samples is restricted, one should deploy the sampling points wisely. We are motivated to study the optimal distribution of finite sampling points. Formulation under the framework of optimal reconstruction yields a minimization problem. In the discrete case, we estimate the distance between the optimal subspace resulting from a general Karhunen-Loeve transform and the kernel space to obtain another algorithm that is computationally favorable. Numerical experiments are then presented to illustrate the performance of the algorithms for the searching of optimal sampling points.

Click to Read Paper
We propose a new sampler that integrates the protocol of parallel tempering with the Nos\'e-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.

Click to Read Paper
We consider the matrix completion problem with a deterministic pattern of observed entries. In this setting, we aim to answer the question: under what condition there will be (at least locally) unique solution to the matrix completion problem, i.e., the underlying true matrix is identifiable. We answer the question from a certain point of view and outline a geometric perspective. We give an algebraically verifiable sufficient condition, which we call the well-posedness condition, for the local uniqueness of MRMC solutions. We argue that this condition is necessary for local stability of MRMC solutions, and we show that the condition is generic using the characteristic rank. We also argue that the low-rank approximation approaches are more stable than MRMC and further propose a sequential statistical testing procedure to determine the "true" rank from observed entries. Finally, we provide numerical examples aimed at verifying validity of the presented theory.

Click to Read Paper
Recommendation systems based on image recognition could prove a vital tool in enhancing the experience of museum audiences. However, for practical systems utilizing wearable cameras, a number of challenges exist which affect the quality of image recognition. In this pilot study, we focus on recognition of museum collections by using a wearable camera in three different museum spaces. We discuss the application of wearable cameras, and the practical and technical challenges in devising a robust system that can recognize artworks viewed by the visitors to create a detailed record of their visit. Specifically, to illustrate the impact of different kinds of museum spaces on image recognition, we collect three training datasets of museum exhibits containing variety of paintings, clocks, and sculptures. Subsequently, we equip selected visitors with wearable cameras to capture artworks viewed by them as they stroll along exhibitions. We use Convolutional Neural Networks (CNN) which are pre-trained on the ImageNet dataset and fine-tuned on each of the training sets for the purpose of artwork identification. In the testing stage, we use CNNs to identify artworks captured by the visitors with a wearable camera. We analyze the accuracy of their recognition and provide an insight into the applicability of such a system to further engage audiences with museum exhibitions.

* Museums and the Web, 2017
Click to Read Paper
Vision based text entry systems aim to help disabled people achieve text communication using eye movement. Most previous methods have employed an existing eye tracker to predict gaze direction and design an input method based upon that. However, these methods can result in eye tracking quality becoming easily affected by various factors and lengthy amounts of time for calibration. Our paper presents a novel efficient gaze based text input method, which has the advantage of low cost and robustness. Users can type in words by looking at an on-screen keyboard and blinking. Rather than estimate gaze angles directly to track eyes, we introduce a method that divides the human gaze into nine directions. This method can effectively improve the accuracy of making a selection by gaze and blinks. We build a Convolutional Neural Network (CNN) model for 9-direction gaze estimation. On the basis of the 9-direction gaze, we use a nine-key T9 input method which is widely used in candy bar phones. Bar phones were very popular in the world decades ago and have cultivated strong user habits and language models. To train a robust gaze estimator, we created a large-scale dataset with images of eyes sourced from 25 people. According to the results from our experiments, our CNN model is able to accurately estimate different people's gaze under various lighting conditions by different devices. In considering disable people's needs, we removed the complex calibration process. The input methods can run in screen mode and portable off-screen mode. Moreover, The datasets used in our experiments are made available to the community to allow further experimentation.

Click to Read Paper
The goal of sentence and document modeling is to accurately represent the meaning of sentences and documents for various Natural Language Processing tasks. In this work, we present Dependency Sensitive Convolutional Neural Networks (DSCNN) as a general-purpose classification system for both sentences and documents. DSCNN hierarchically builds textual representations by processing pretrained word embeddings via Long Short-Term Memory networks and subsequently extracting features with convolution operators. Compared with existing recursive neural models with tree structures, DSCNN does not rely on parsers and expensive phrase labeling, and thus is not restricted to sentence-level tasks. Moreover, unlike other CNN-based models that analyze sentences locally by sliding windows, our system captures both the dependency information within each sentence and relationships across sentences in the same document. Experiment results demonstrate that our approach is achieving state-of-the-art performance on several tasks, including sentiment analysis, question type classification, and subjectivity classification.

* NAACL2016
Click to Read Paper
With the development of e-commerce, many products are now being sold worldwide, and manufacturers are eager to obtain a better understanding of customer behavior in various regions. To achieve this goal, most previous efforts have focused mainly on questionnaires, which are time-consuming and costly. The tremendous volume of product reviews on e-commerce websites has seen a new trend emerge, whereby manufacturers attempt to understand user preferences by analyzing online reviews. Following this trend, this paper addresses the problem of studying customer behavior by exploiting recently developed opinion mining techniques. This work is novel for three reasons. First, questionnaire-based investigation is automatically enabled by employing algorithms for template-based question generation and opinion mining-based answer extraction. Using this system, manufacturers are able to obtain reports of customer behavior featuring a much larger sample size, more direct information, a higher degree of automation, and a lower cost. Second, international customer behavior study is made easier by integrating tools for multilingual opinion mining. Third, this is the first time an automatic questionnaire investigation has been conducted to compare customer behavior in China and America, where product reviews are written and read in Chinese and English, respectively. Our study on digital cameras, smartphones, and tablet computers yields three findings. First, Chinese customers follow the Doctrine of the Mean, and often use euphemistic expressions, while American customers express their opinions more directly. Second, Chinese customers care more about general feelings, while American customers pay more attention to product details. Third, Chinese customers focus on external features, while American customers care more about the internal features of products.

Click to Read Paper
Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Although there is no consensus on a definition, human emotional states usually can be apperceived by auditory and visual systems. Inspired by this cognitive process in human beings, it's natural to simultaneously utilize audio and visual information in AER. However, most traditional fusion approaches only build a linear paradigm, such as feature concatenation and multi-system fusion, which hardly captures complex association between audio and video. In this paper, we introduce factorized bilinear pooling (FBP) to deeply integrate the features of audio and video. Specifically, the features are selected through the embedded attention mechanism from respective modalities to obtain the emotion-related regions. The whole pipeline can be completed in a neural network. Validated on the AFEW database of the audio-video sub-challenge in EmotiW2018, the proposed approach achieves an accuracy of 62.48%, outperforming the state-of-the-art result.

Click to Read Paper
The demand of applying semantic segmentation model on mobile devices has been increasing rapidly. Current state-of-the-art networks have enormous amount of parameters hence unsuitable for mobile devices, while other small memory footprint models ignore the inherent characteristic of semantic segmentation. To tackle this problem, we propose a novel Context Guided Network (CGNet), which is a light-weight network for semantic segmentation on mobile devices. We first propose the Context Guided (CG) block, which learns the joint feature of both local feature and surrounding context, and further improves the joint feature with the global context. Based on the CG block, we develop Context Guided Network (CGNet), which captures contextual information in all stages of the network and is specially tailored for increasing segmentation accuracy. CGNet is also elaborately designed to reduce the number of parameters and save memory footprint. Under an equivalent number of parameters, the proposed CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach. Specifically, without any post-processing, CGNet achieves 64.8% mean IoU on Cityscapes with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 $\times$ 1024 high-resolution images. The source code for the complete system are publicly available.

* Code: https://github.com/wutianyiRosun/CGNet
Click to Read Paper
Weakly supervised object detection has recently received much attention, since it only requires image-level labels instead of the bounding-box labels consumed in strongly supervised learning. Nevertheless, the save in labeling expense is usually at the cost of model accuracy. In this paper, we propose a simple but effective weakly supervised collaborative learning framework to resolve this problem, which trains a weakly supervised learner and a strongly supervised learner jointly by enforcing partial feature sharing and prediction consistency. For object detection, taking WSDDN-like architecture as weakly supervised detector sub-network and Faster-RCNN-like architecture as strongly supervised detector sub-network, we propose an end-to-end Weakly Supervised Collaborative Detection Network. As there is no strong supervision available to train the Faster-RCNN-like sub-network, a new prediction consistency loss is defined to enforce consistency of predictions between the two sub-networks as well as within the Faster-RCNN-like sub-networks. At the same time, the two detectors are designed to partially share features to further guarantee the model consistency at perceptual level. Extensive experiments on PASCAL VOC 2007 and 2012 data sets have demonstrated the effectiveness of the proposed framework.

Click to Read Paper
Photorealistic frontal view synthesis from a single face image has a wide range of applications in the field of face recognition. Although data-driven deep learning methods have been proposed to address this problem by seeking solutions from ample face data, this problem is still challenging because it is intrinsically ill-posed. This paper proposes a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details. Four landmark located patch networks are proposed to attend to local textures in addition to the commonly used global encoder-decoder network. Except for the novel architecture, we make this ill-posed problem well constrained by introducing a combination of adversarial loss, symmetry loss and identity preserving loss. The combined loss function leverages both frontal face distribution and pre-trained discriminative deep face models to guide an identity preserving inference of frontal views from profiles. Different from previous deep learning methods that mainly rely on intermediate features for recognition, our method directly leverages the synthesized identity preserving image for downstream tasks like face recognition and attribution estimation. Experimental results demonstrate that our method not only presents compelling perceptual results but also outperforms state-of-the-art results on large pose face recognition.

* accepted at ICCV 2017, main paper & supplementary material, 11 pages
Click to Read Paper
In recent year, tremendous strides have been made in face detection thanks to deep learning. However, most published face detectors deteriorate dramatically as the faces become smaller. In this paper, we present the Small Faces Attention (SFA) face detector to better detect faces with small scale. First, we propose a new scale-invariant face detection architecture which pays more attention to small faces, including 4-branch detection architecture and small faces sensitive anchor design. Second, feature maps fusion strategy is applied in SFA by partially combining high-level features into low-level features to further improve the ability of finding hard faces. Third, we use multi-scale training and testing strategy to enhance face detection performance in practice. Comprehensive experiments show that SFA significantly improves face detection performance, especially on small faces. Our real-time SFA face detector can run at 5 FPS on a single GPU as well as maintain high performance. Besides, our final SFA face detector achieves state-of-the-art detection performance on challenging face detection benchmarks, including WIDER FACE and FDDB datasets, with competitive runtime speed. Both our code and models will be available to the research community.

* 10 pages, 0 figures, 0 tables, 41 references
Click to Read Paper
Most existing semantic segmentation methods employ atrous convolution to enlarge the receptive field of filters, but neglect partial information. To tackle this issue, we firstly propose a novel Kronecker convolution which adopts Kronecker product to expand the standard convolutional kernel for taking into account the partial feature neglected by atrous convolutions. Therefore, it can capture partial information and enlarge the receptive field of filters simultaneously without introducing extra parameters. Secondly, we propose Tree-structured Feature Aggregation (TFA) module which follows a recursive rule to expand and forms a hierarchical structure. Thus, it can naturally learn representations of multi-scale objects and encode hierarchical contextual information in complex scenes. Finally, we design Tree-structured Kronecker Convolutional Networks (TKCN) which employs Kronecker convolution and TFA module. Extensive experiments on three datasets, PASCAL VOC 2012, PASCAL-Context and Cityscapes, verify the effectiveness of our proposed approach. We make the code and the trained model publicly available at https://github.com/wutianyiRosun/TKCN.

* Code: https://github.com/wutianyiRosun/TKCN
Click to Read Paper
Volatility is a quantity of measurement for the price movements of stocks or options which indicates the uncertainty within financial markets. As an indicator of the level of risk or the degree of variation, volatility is important to analyse the financial market, and it is taken into consideration in various decision-making processes in financial activities. On the other hand, recent advancement in deep learning techniques has shown strong capabilities in modelling sequential data, such as speech and natural language. In this paper, we empirically study the applicability of the latest deep structures with respect to the volatility modelling problem, through which we aim to provide an empirical guidance for the theoretical analysis of the marriage between deep learning techniques and financial applications in the future. We examine both the traditional approaches and the deep sequential models on the task of volatility prediction, including the most recent variants of convolutional and recurrent networks, such as the dilated architecture. Accordingly, experiments with real-world stock price datasets are performed on a set of 1314 daily stock series for 2018 days of transaction. The evaluation and comparison are based on the negative log likelihood (NLL) of real-world stock price time series. The result shows that the dilated neural models, including dilated CNN and Dilated RNN, produce most accurate estimation and prediction, outperforming various widely-used deterministic models in the GARCH family and several recently proposed stochastic models. In addition, the high flexibility and rich expressive power are validated in this study.

* NIPS 2018, Workshop on Challenges and Opportunities for AI in Financial Services
Click to Read Paper