Research papers and code for "Rui Zhang":
Transfer learning has been developed to improve the performances of different but related tasks in machine learning. However, such processes become less efficient with the increase of the size of training data and the number of tasks. Moreover, privacy can be violated as some tasks may contain sensitive and private data, which are communicated between nodes and tasks. We propose a consensus-based distributed transfer learning framework, where several tasks aim to find the best linear support vector machine (SVM) classifiers in a distributed network. With alternating direction method of multipliers, tasks can achieve better classification accuracies more efficiently and privately, as each node and each task train with their own data, and only decision variables are transferred between different tasks and nodes. Numerical experiments on MNIST datasets show that the knowledge transferred from the source tasks can be used to decrease the risks of the target tasks that lack training data or have unbalanced training labels. We show that the risks of the target tasks in the nodes without the data of the source tasks can also be reduced using the information transferred from the nodes who contain the data of the source tasks. We also show that the target tasks can enter and leave in real-time without rerunning the whole algorithm.

Click to Read Paper and Get Code
Distributed Support Vector Machines (DSVM) have been developed to solve large-scale classification problems in networked systems with a large number of sensors and control units. However, the systems become more vulnerable as detection and defense are increasingly difficult and expensive. This work aims to develop secure and resilient DSVM algorithms under adversarial environments in which an attacker can manipulate the training data to achieve his objective. We establish a game-theoretic framework to capture the conflicting interests between an adversary and a set of distributed data processing units. The Nash equilibrium of the game allows predicting the outcome of learning algorithms in adversarial environments, and enhancing the resilience of the machine learning through dynamic distributed learning algorithms. We prove that the convergence of the distributed algorithm is guaranteed without assumptions on the training data or network topologies. Numerical experiments are conducted to corroborate the results. We show that network topology plays an important role in the security of DSVM. Networks with fewer nodes and higher average degrees are more secure. Moreover, a balanced network is found to be less vulnerable to attacks.

* arXiv admin note: text overlap with arXiv:1710.04677
Click to Read Paper and Get Code
With a large number of sensors and control units in networked systems, distributed support vector machines (DSVMs) play a fundamental role in scalable and efficient multi-sensor classification and prediction tasks. However, DSVMs are vulnerable to adversaries who can modify and generate data to deceive the system to misclassification and misprediction. This work aims to design defense strategies for DSVM learner against a potential adversary. We establish a game-theoretic framework to capture the conflicting interests between the DSVM learner and the attacker. The Nash equilibrium of the game allows predicting the outcome of learning algorithms in adversarial environments, and enhancing the resilience of the machine learning through dynamic distributed learning algorithms. We show that the DSVM learner is less vulnerable when he uses a balanced network with fewer nodes and higher degree. We also show that adding more training samples is an efficient defense strategy against an attacker. We present secure and resilient DSVM algorithms with verification method and rejection method, and show their resiliency against adversary with numerical experiments.

Click to Read Paper and Get Code
Natural-language-facilitated human-robot cooperation (NLC) refers to using natural language (NL) to facilitate interactive information sharing and task executions with a common goal constraint between robots and humans. Recently, NLC research has received increasing attention. Typical NLC scenarios include robotic daily assistance, robotic health caregiving, intelligent manufacturing, autonomous navigation, and robot social accompany. However, a thorough review, that can reveal latest methodologies to use NL to facilitate human-robot cooperation, is missing. In this review, a comprehensive summary about methodologies for NLC is presented. NLC research includes three main research focuses: NL instruction understanding, NL-based execution plan generation, and knowledge-world mapping. In-depth analyses on theoretical methods, applications, and model advantages and disadvantages are made. Based on our paper review and perspective, potential research directions of NLC are summarized.

* 13 pages, 9 figures
Click to Read Paper and Get Code
Natural-language-facilitated human-robot cooperation (NLC), in which natural language (NL) is used to share knowledge between a human and a robot for conducting intuitive human-robot cooperation (HRC), is continuously developing in the recent decade. Currently, NLC is used in several robotic domains such as manufacturing, daily assistance and health caregiving. It is necessary to summarize current NLC-based robotic systems and discuss the future developing trends, providing helpful information for future NLC research. In this review, we first analyzed the driving forces behind the NLC research. Regarding to a robot s cognition level during the cooperation, the NLC implementations then were categorized into four types {NL-based control, NL-based robot training, NL-based task execution, NL-based social companion} for comparison and discussion. Last based on our perspective and comprehensive paper review, the future research trends were discussed.

* 21 pages, 10 figures
Click to Read Paper and Get Code
It is critical for advanced manufacturing machines to autonomously execute a task by following an end-user's natural language (NL) instructions. However, NL instructions are usually ambiguous and abstract so that the machines may misunderstand and incorrectly execute the task. To address this NL-based human-machine communication problem and enable the machines to appropriately execute tasks by following the end-user's NL instructions, we developed a Machine-Executable-Plan-Generation (exePlan) method. The exePlan method conducts task-centered semantic analysis to extract task-related information from ambiguous NL instructions. In addition, the method specifies machine execution parameters to generate a machine-executable plan by interpreting abstract NL instructions. To evaluate the exePlan method, an industrial robot Baxter was instructed by NL to perform three types of industrial tasks {'drill a hole', 'clean a spot', 'install a screw'}. The experiment results proved that the exePlan method was effective in generating machine-executable plans from the end-user's NL instructions. Such a method has the promise to endow a machine with the ability of NL-instructed task execution.

* 16 pages, 10 figures, article submitted to Robotics and Computer-Integrated Manufacturing, 2016 Aug
Click to Read Paper and Get Code
The recent developments of basis pursuit and compressed sensing seek to extract information from as few samples as possible. In such applications, since the number of samples is restricted, one should deploy the sampling points wisely. We are motivated to study the optimal distribution of finite sampling points. Formulation under the framework of optimal reconstruction yields a minimization problem. In the discrete case, we estimate the distance between the optimal subspace resulting from a general Karhunen-Loeve transform and the kernel space to obtain another algorithm that is computationally favorable. Numerical experiments are then presented to illustrate the performance of the algorithms for the searching of optimal sampling points.

Click to Read Paper and Get Code
Building height estimation is important in many applications such as 3D city reconstruction, urban planning, and navigation. Recently, a new building height estimation method using street scene images and 2D maps was proposed. This method is more scalable than traditional methods that use high-resolution optical data, LiDAR data, or RADAR data which are expensive to obtain. The method needs to detect building rooflines and then compute building height via the pinhole camera model. We observe that this method has limitations in handling complex street scene images in which buildings overlap with each other and the rooflines are difficult to locate. We propose CBHE, a building height estimation algorithm considering both building corners and rooflines. CBHE first obtains building corner and roofline candidates in street scene images based on building footprints from 2D maps and the camera parameters. Then, we use a deep neural network named BuildingNet to classify and filter corner and roofline candidates. Based on the valid corners and rooflines from BuildingNet, CBHE computes building height via the pinhole camera model. Experimental results show that the proposed BuildingNet yields a higher accuracy on building corner and roofline candidate filtering compared with the state-of-the-art open set classifiers. Meanwhile, CBHE outperforms the baseline algorithm by over 10% in building height estimation accuracy.

Click to Read Paper and Get Code
The recent direction of unpaired image-to-image translation is on one hand very exciting as it alleviates the big burden in obtaining label-intensive pixel-to-pixel supervision, but it is on the other hand not fully satisfactory due to the presence of artifacts and degenerated transformations. In this paper, we take a manifold view of the problem by introducing a smoothness term over the sample graph to attain harmonic functions to enforce consistent mappings during the translation. We develop HarmonicGAN to learn bi-directional translations between the source and the target domains. With the help of similarity-consistency, the inherent self-consistency property of samples can be maintained. Distance metrics defined on two types of features including histogram and CNN are exploited. Under an identical problem setting as CycleGAN, without additional manual inputs and only at a small training-time cost, HarmonicGAN demonstrates a significant qualitative and quantitative improvement over the state of the art, as well as improved interpretability. We show experimental results in a number of applications including medical imaging, object transfiguration, and semantic labeling. We outperform the competing methods in all tasks, and for a medical imaging task in particular our method turns CycleGAN from a failure to a success, halving the mean-squared error, and generating images that radiologists prefer over competing methods in 95% of cases.

Click to Read Paper and Get Code
We propose a new sampler that integrates the protocol of parallel tempering with the Nos\'e-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.

Click to Read Paper and Get Code
We consider the matrix completion problem with a deterministic pattern of observed entries. In this setting, we aim to answer the question: under what condition there will be (at least locally) unique solution to the matrix completion problem, i.e., the underlying true matrix is identifiable. We answer the question from a certain point of view and outline a geometric perspective. We give an algebraically verifiable sufficient condition, which we call the well-posedness condition, for the local uniqueness of MRMC solutions. We argue that this condition is necessary for local stability of MRMC solutions, and we show that the condition is generic using the characteristic rank. We also argue that the low-rank approximation approaches are more stable than MRMC and further propose a sequential statistical testing procedure to determine the "true" rank from observed entries. Finally, we provide numerical examples aimed at verifying validity of the presented theory.

Click to Read Paper and Get Code
Recommendation systems based on image recognition could prove a vital tool in enhancing the experience of museum audiences. However, for practical systems utilizing wearable cameras, a number of challenges exist which affect the quality of image recognition. In this pilot study, we focus on recognition of museum collections by using a wearable camera in three different museum spaces. We discuss the application of wearable cameras, and the practical and technical challenges in devising a robust system that can recognize artworks viewed by the visitors to create a detailed record of their visit. Specifically, to illustrate the impact of different kinds of museum spaces on image recognition, we collect three training datasets of museum exhibits containing variety of paintings, clocks, and sculptures. Subsequently, we equip selected visitors with wearable cameras to capture artworks viewed by them as they stroll along exhibitions. We use Convolutional Neural Networks (CNN) which are pre-trained on the ImageNet dataset and fine-tuned on each of the training sets for the purpose of artwork identification. In the testing stage, we use CNNs to identify artworks captured by the visitors with a wearable camera. We analyze the accuracy of their recognition and provide an insight into the applicability of such a system to further engage audiences with museum exhibitions.

* Museums and the Web, 2017
Click to Read Paper and Get Code
Vision based text entry systems aim to help disabled people achieve text communication using eye movement. Most previous methods have employed an existing eye tracker to predict gaze direction and design an input method based upon that. However, these methods can result in eye tracking quality becoming easily affected by various factors and lengthy amounts of time for calibration. Our paper presents a novel efficient gaze based text input method, which has the advantage of low cost and robustness. Users can type in words by looking at an on-screen keyboard and blinking. Rather than estimate gaze angles directly to track eyes, we introduce a method that divides the human gaze into nine directions. This method can effectively improve the accuracy of making a selection by gaze and blinks. We build a Convolutional Neural Network (CNN) model for 9-direction gaze estimation. On the basis of the 9-direction gaze, we use a nine-key T9 input method which is widely used in candy bar phones. Bar phones were very popular in the world decades ago and have cultivated strong user habits and language models. To train a robust gaze estimator, we created a large-scale dataset with images of eyes sourced from 25 people. According to the results from our experiments, our CNN model is able to accurately estimate different people's gaze under various lighting conditions by different devices. In considering disable people's needs, we removed the complex calibration process. The input methods can run in screen mode and portable off-screen mode. Moreover, The datasets used in our experiments are made available to the community to allow further experimentation.

Click to Read Paper and Get Code
The goal of sentence and document modeling is to accurately represent the meaning of sentences and documents for various Natural Language Processing tasks. In this work, we present Dependency Sensitive Convolutional Neural Networks (DSCNN) as a general-purpose classification system for both sentences and documents. DSCNN hierarchically builds textual representations by processing pretrained word embeddings via Long Short-Term Memory networks and subsequently extracting features with convolution operators. Compared with existing recursive neural models with tree structures, DSCNN does not rely on parsers and expensive phrase labeling, and thus is not restricted to sentence-level tasks. Moreover, unlike other CNN-based models that analyze sentences locally by sliding windows, our system captures both the dependency information within each sentence and relationships across sentences in the same document. Experiment results demonstrate that our approach is achieving state-of-the-art performance on several tasks, including sentiment analysis, question type classification, and subjectivity classification.

* NAACL2016
Click to Read Paper and Get Code
With the development of e-commerce, many products are now being sold worldwide, and manufacturers are eager to obtain a better understanding of customer behavior in various regions. To achieve this goal, most previous efforts have focused mainly on questionnaires, which are time-consuming and costly. The tremendous volume of product reviews on e-commerce websites has seen a new trend emerge, whereby manufacturers attempt to understand user preferences by analyzing online reviews. Following this trend, this paper addresses the problem of studying customer behavior by exploiting recently developed opinion mining techniques. This work is novel for three reasons. First, questionnaire-based investigation is automatically enabled by employing algorithms for template-based question generation and opinion mining-based answer extraction. Using this system, manufacturers are able to obtain reports of customer behavior featuring a much larger sample size, more direct information, a higher degree of automation, and a lower cost. Second, international customer behavior study is made easier by integrating tools for multilingual opinion mining. Third, this is the first time an automatic questionnaire investigation has been conducted to compare customer behavior in China and America, where product reviews are written and read in Chinese and English, respectively. Our study on digital cameras, smartphones, and tablet computers yields three findings. First, Chinese customers follow the Doctrine of the Mean, and often use euphemistic expressions, while American customers express their opinions more directly. Second, Chinese customers care more about general feelings, while American customers pay more attention to product details. Third, Chinese customers focus on external features, while American customers care more about the internal features of products.

Click to Read Paper and Get Code
Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Although there is no consensus on a definition, human emotional states usually can be apperceived by auditory and visual systems. Inspired by this cognitive process in human beings, it's natural to simultaneously utilize audio and visual information in AER. However, most traditional fusion approaches only build a linear paradigm, such as feature concatenation and multi-system fusion, which hardly captures complex association between audio and video. In this paper, we introduce factorized bilinear pooling (FBP) to deeply integrate the features of audio and video. Specifically, the features are selected through the embedded attention mechanism from respective modalities to obtain the emotion-related regions. The whole pipeline can be completed in a neural network. Validated on the AFEW database of the audio-video sub-challenge in EmotiW2018, the proposed approach achieves an accuracy of 62.48%, outperforming the state-of-the-art result.

Click to Read Paper and Get Code
The demand of applying semantic segmentation model on mobile devices has been increasing rapidly. Current state-of-the-art networks have enormous amount of parameters hence unsuitable for mobile devices, while other small memory footprint models ignore the inherent characteristic of semantic segmentation. To tackle this problem, we propose a novel Context Guided Network (CGNet), which is a light-weight network for semantic segmentation on mobile devices. We first propose the Context Guided (CG) block, which learns the joint feature of both local feature and surrounding context, and further improves the joint feature with the global context. Based on the CG block, we develop Context Guided Network (CGNet), which captures contextual information in all stages of the network and is specially tailored for increasing segmentation accuracy. CGNet is also elaborately designed to reduce the number of parameters and save memory footprint. Under an equivalent number of parameters, the proposed CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach. Specifically, without any post-processing, CGNet achieves 64.8% mean IoU on Cityscapes with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 $\times$ 1024 high-resolution images. The source code for the complete system are publicly available.

* Code: https://github.com/wutianyiRosun/CGNet
Click to Read Paper and Get Code
Weakly supervised object detection has recently received much attention, since it only requires image-level labels instead of the bounding-box labels consumed in strongly supervised learning. Nevertheless, the save in labeling expense is usually at the cost of model accuracy. In this paper, we propose a simple but effective weakly supervised collaborative learning framework to resolve this problem, which trains a weakly supervised learner and a strongly supervised learner jointly by enforcing partial feature sharing and prediction consistency. For object detection, taking WSDDN-like architecture as weakly supervised detector sub-network and Faster-RCNN-like architecture as strongly supervised detector sub-network, we propose an end-to-end Weakly Supervised Collaborative Detection Network. As there is no strong supervision available to train the Faster-RCNN-like sub-network, a new prediction consistency loss is defined to enforce consistency of predictions between the two sub-networks as well as within the Faster-RCNN-like sub-networks. At the same time, the two detectors are designed to partially share features to further guarantee the model consistency at perceptual level. Extensive experiments on PASCAL VOC 2007 and 2012 data sets have demonstrated the effectiveness of the proposed framework.

Click to Read Paper and Get Code
Photorealistic frontal view synthesis from a single face image has a wide range of applications in the field of face recognition. Although data-driven deep learning methods have been proposed to address this problem by seeking solutions from ample face data, this problem is still challenging because it is intrinsically ill-posed. This paper proposes a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details. Four landmark located patch networks are proposed to attend to local textures in addition to the commonly used global encoder-decoder network. Except for the novel architecture, we make this ill-posed problem well constrained by introducing a combination of adversarial loss, symmetry loss and identity preserving loss. The combined loss function leverages both frontal face distribution and pre-trained discriminative deep face models to guide an identity preserving inference of frontal views from profiles. Different from previous deep learning methods that mainly rely on intermediate features for recognition, our method directly leverages the synthesized identity preserving image for downstream tasks like face recognition and attribution estimation. Experimental results demonstrate that our method not only presents compelling perceptual results but also outperforms state-of-the-art results on large pose face recognition.

* accepted at ICCV 2017, main paper & supplementary material, 11 pages
Click to Read Paper and Get Code
Stabilization and trajectory control of a quadrotor carrying a suspended load with a fixed known mass has been extensively studied in recent years. However, the load mass is not always known beforehand or may vary during the practical transportations. This mass uncertainty brings uncertain disturbances to the quadrotor system, causing existing controllers to have worse stability and trajectory tracking performance. To improve the quadrotor stability and trajectory tracking capability in this situation, we fully investigate the impacts of the uncertain load mass on the quadrotor. By comparing the performances of three different controllers -- the proportional-derivative (PD) controller, the sliding mode controller (SMC), and the model predictive controller (MPC) -- stabilization rather than trajectory tracking error is proved to be the main influence in the load mass uncertainty. A critical motion mass exists for the quadrotor to maintain a desired transportation performance. Moreover, simulation results verify that a controller with strong robustness against disturbances is a good choice for practical applications.

* 56 pages, 12 figures, article submitted to ASME Journal of Dynamic Systems Measurement and Control, 2016 April
Click to Read Paper and Get Code