Research papers and code for "Xiaoyu Zhang":
Understanding physical relations between objects, especially their support relations, is crucial for robotic manipulation. There has been work on reasoning about support relations and structural stability of simple configurations in RGB-D images. In this paper, we propose a method for extracting more detailed physical knowledge from a set of RGB-D images taken from the same scene but from different views using qualitative reasoning and intuitive physical models. Rather than providing a simple contact relation graph and approximating stability over convex shapes, our method is able to provide a detailed supporting relation analysis based on a volumetric representation. Specifically, true supporting relations between objects (e.g., if an object supports another object by touching it on the side or if the object above contributes to the stability of the object below) are identified. We apply our method to real-world structures captured in warehouse scenarios and show our method works as desired.

Click to Read Paper and Get Code
Trajectory prediction is a critical technique in the navigation of robots and autonomous vehicles. However, the complex traffic and dynamic uncertainties yield challenges in the effectiveness and robustness in modeling. We purpose a data-driven approach socially aware Kalman neural networks (SAKNN) where the interaction layer and the Kalman layer are embedded in the architecture, resulting in a class of architectures with huge potential to directly learn from high variance sensor input and robustly generate low variance outcomes. The evaluation of our approach on NGSIM dataset demonstrates that SAKNN performs state-of-the-art on prediction effectiveness in a relatively long-term horizon and significantly improves the signal-to-noise ratio of the predicted signal.

* Superceded by arXiv:1902.10928
Click to Read Paper and Get Code
Even though face recognition in frontal view and normal lighting condition works very well, the performance degenerates sharply in extreme conditions. Recently there are many work dealing with pose and illumination problems, respectively. However both the lighting and pose variation will always be encountered at the same time. Accordingly we propose an end-to-end face recognition method to deal with pose and illumination simultaneously based on convolutional networks where the discriminative nonlinear features that are invariant to pose and illumination are extracted. Normally the global structure for images taken in different views is quite diverse. Therefore we propose to use the 1*1 convolutional kernel to extract the local features. Furthermore the parallel multi-stream multi-layer 1*1 convolution network is developed to extract multi-hierarchy features. In the experiments we obtained the average face recognition rate of 96.9% on multiPIE dataset,which improves the state-of-the-art of face recognition across poses and illumination by 7.5%. Especially for profile-wise positions, the average recognition rate of our proposed network is 97.8%, which increases the state-of-the-art recognition rate by 19%.

Click to Read Paper and Get Code
This paper presents an overview of the sixth AIBIRDS competition, held at the 26th International Joint Conference on Artificial Intelligence. This competition tasked participants with developing an intelligent agent which can play the physics-based puzzle game Angry Birds. This game uses a sophisticated physics engine that requires agents to reason and predict the outcome of actions with only limited environmental information. Agents entered into this competition were required to solve a wide assortment of previously unseen levels within a set time limit. The physical reasoning and planning required to solve these levels are very similar to those of many real-world problems. This year's competition featured some of the best agents developed so far and even included several new AI techniques such as deep reinforcement learning. Within this paper we describe the framework, rules, submitted agents and results for this competition. We also provide some background information on related work and other video game AI competitions, as well as discussing some potential ideas for future AIBIRDS competitions and agent improvements.

Click to Read Paper and Get Code
Traffic prediction is a fundamental and vital task in Intelligence Transportation System (ITS), but it is very challenging to get high accuracy while containing low computational complexity due to the spatiotemporal characteristics of traffic flow, especially under the metropolitan circumstances. In this work, a new topological framework, called Linkage Network, is proposed to model the road networks and present the propagation patterns of traffic flow. Based on the Linkage Network model, a novel online predictor, named Graph Recurrent Neural Network (GRNN), is designed to learn the propagation patterns in the graph. It could simultaneously predict traffic flow for all road segments based on the information gathered from the whole graph, which thus reduces the computational complexity significantly from O(nm) to O(n+m), while keeping the high accuracy. Moreover, it can also predict the variations of traffic trends. Experiments based on real-world data demonstrate that the proposed method outperforms the existing prediction methods.

* 8 pages, 7 figures
Click to Read Paper and Get Code
With the rapid development of fashion market, the customers' demands of customers for fashion recommendation are rising. In this paper, we aim to investigate a practical problem of fashion recommendation by answering the question "which item should we select to match with the given fashion items and form a compatible outfit". The key to this problem is to estimate the outfit compatibility. Previous works which focus on the compatibility of two items or represent an outfit as a sequence fail to make full use of the complex relations among items in an outfit. To remedy this, we propose to represent an outfit as a graph. In particular, we construct a Fashion Graph, where each node represents a category and each edge represents interaction between two categories. Accordingly, each outfit can be represented as a subgraph by putting items into their corresponding category nodes. To infer the outfit compatibility from such a graph, we propose Node-wise Graph Neural Networks (NGNN) which can better model node interactions and learn better node representations. In NGNN, the node interaction on each edge is different, which is determined by parameters correlated to the two connected nodes. An attention mechanism is utilized to calculate the outfit compatibility score with learned node representations. NGNN can not only be used to model outfit compatibility from visual or textual modality but also from multiple modalities. We conduct experiments on two tasks: (1) Fill-in-the-blank: suggesting an item that matches with existing components of outfit; (2) Compatibility prediction: predicting the compatibility scores of given outfits. Experimental results demonstrate the great superiority of our proposed method over others.

* 11 pages, accepted by the 2019 World Wide Web Conference (WWW-2019)
Click to Read Paper and Get Code
In this paper, we present a reverberation removal approach for speaker verification, utilizing dual-label deep neural networks (DNNs). The networks perform feature mapping between the spectral features of reverberant and clean speech. Long short term memory recurrent neural networks (LSTMs) are trained to map corrupted Mel filterbank (MFB) features to two sets of labels: i) the clean MFB features, and ii) either estimated pitch tracks or the fast Fourier transform (FFT) spectrogram of clean speech. The performance of reverberation removal is evaluated by equal error rates (EERs) of speaker verification experiments.

* 4 pages, 3 figures, submitted to Interspeech 2018
Click to Read Paper and Get Code
In this paper, a novel strategy of Secure Steganograpy based on Generative Adversarial Networks is proposed to generate suitable and secure covers for steganography. The proposed architecture has one generative network, and two discriminative networks. The generative network mainly evaluates the visual quality of the generated images for steganography, and the discriminative networks are utilized to assess their suitableness for information hiding. Different from the existing work which adopts Deep Convolutional Generative Adversarial Networks, we utilize another form of generative adversarial networks. By using this new form of generative adversarial networks, significant improvements are made on the convergence speed, the training stability and the image quality. Furthermore, a sophisticated steganalysis network is reconstructed for the discriminative network, and the network can better evaluate the performance of the generated images. Numerous experiments are conducted on the publicly available datasets to demonstrate the effectiveness and robustness of the proposed method.

Click to Read Paper and Get Code
Transmission control protocol (TCP) congestion control is one of the key techniques to improve network performance. TCP congestion control algorithm identification (TCP identification) can be used to significantly improve network efficiency. Existing TCP identification methods can only be applied to limited number of TCP congestion control algorithms and focus on wired networks. In this paper, we proposed a machine learning based passive TCP identification method for wired and wireless networks. After comparing among three typical machine learning models, we concluded that the 4-layers Long Short Term Memory (LSTM) model achieves the best identification accuracy. Our approach achieves better than 98% accuracy in wired and wireless networks and works for newly proposed TCP congestion control algorithms.

Click to Read Paper and Get Code
Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a "policy network" and a "value network" to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics.

Click to Read Paper and Get Code
Recently, there emerged revived interests of designing automatic programs (e.g., using genetic/evolutionary algorithms) to optimize the structure of Convolutional Neural Networks (CNNs) for a specific task. The challenge in designing such programs lies in how to balance between large search space of the network structures and high computational costs. Existing works either impose strong restrictions on the search space or use enormous computing resources. In this paper, we study how to design a genetic programming approach for optimizing the structure of a CNN for a given task under limited computational resources yet without imposing strong restrictions on the search space. To reduce the computational costs, we propose two general strategies that are observed to be helpful: (i) aggressively selecting strongest individuals for survival and reproduction, and killing weaker individuals at a very early age; (ii) increasing mutation frequency to encourage diversity and faster evolution. The combined strategy with additional optimization techniques allows us to explore a large search space but with affordable computational costs. Our results on standard benchmark datasets (MNIST, SVHN, CIFAR-10, CIFAR-100) are competitive to similar approaches with significantly reduced computational costs.

Click to Read Paper and Get Code
Deep CNNs have achieved great success in text detection. Most of existing methods attempt to improve accuracy with sophisticated network design, while paying less attention on speed. In this paper, we propose a general framework for text detection called Guided CNN to achieve the two goals simultaneously. The proposed model consists of one guidance subnetwork, where a guidance mask is learned from the input image itself, and one primary text detector, where every convolution and non-linear operation are conducted only in the guidance mask. On the one hand, the guidance subnetwork filters out non-text regions coarsely, greatly reduces the computation complexity. On the other hand, the primary text detector focuses on distinguishing between text and hard non-text regions and regressing text bounding boxes, achieves a better detection accuracy. A training strategy, called background-aware block-wise random synthesis, is proposed to further boost up the performance. We demonstrate that the proposed Guided CNN is not only effective but also efficient with two state-of-the-art methods, CTPN and EAST, as backbones. On the challenging benchmark ICDAR 2013, it speeds up CTPN by 2.9 times on average, while improving the F-measure by 1.5%. On ICDAR 2015, it speeds up EAST by 2.0 times while improving the F-measure by 1.0%.

* Submitted to British Machine Vision Conference (BMVC), 2018
Click to Read Paper and Get Code
Forecasting the motion of surrounding dynamic obstacles (vehicles, bicycles, pedestrians and etc.) benefits the on-road motion planning for autonomous vehicles. Complex traffic scenes yield great challenges in modeling the traffic patterns of surrounding dynamic obstacles. In this paper, we propose a multi-layer architecture Interaction-aware Kalman Neural Networks (IaKNN) which involves an interaction layer for resolving high-dimensional traffic environmental observations as interaction-aware accelerations, a motion layer for transforming the accelerations to interaction-aware trajectories, and a filter layer for estimating future trajectories with a Kalman filter. Experiments on the NGSIM dataset demonstrate that IaKNN outperforms the state-of-the-art methods in terms of effectiveness for trajectory prediction.

Click to Read Paper and Get Code
In this paper, we propose a novel multi-label learning framework, called Multi-Label Self-Paced Learning (MLSPL), in an attempt to incorporate the self-paced learning strategy into multi-label learning regime. In light of the benefits of adopting the easy-to-hard strategy proposed by self-paced learning, the devised MLSPL aims to learn multiple labels jointly by gradually including label learning tasks and instances into model training from the easy to the hard. We first introduce a self-paced function as a regularizer in the multi-label learning formulation, so as to simultaneously rank priorities of the label learning tasks and the instances in each learning iteration. Considering that different multi-label learning scenarios often need different self-paced schemes during optimization, we thus propose a general way to find the desired self-paced functions. Experimental results on three benchmark datasets suggest the state-of-the-art performance of our approach.

Click to Read Paper and Get Code
E-commerce sponsored search contributes an important part of revenue for the e-commerce company. In consideration of effectiveness and efficiency, a large-scale sponsored search system commonly adopts a multi-stage architecture. We name these stages as ad retrieval, ad pre-ranking and ad ranking. Ad retrieval and ad pre-ranking are collectively referred to as ad matching in this paper. We propose an end-to-end neural matching framework (EENMF) to model two tasks---vector-based ad retrieval and neural networks based ad pre-ranking. Under the deep matching framework, vector-based ad retrieval harnesses user recent behavior sequence to retrieve relevant ad candidates without the constraint of keyword bidding. Simultaneously, the deep model is employed to perform the global pre-ranking of ad candidates from multiple retrieval paths effectively and efficiently. Besides, the proposed model tries to optimize the pointwise cross-entropy loss which is consistent with the objective of predict models in the ranking stage. We conduct extensive evaluation to validate the performance of the proposed framework. In the real traffic of a large-scale e-commerce sponsored search, the proposed approach significantly outperforms the baseline.

Click to Read Paper and Get Code
Weight pruning and weight quantization are two important categories of DNN model compression. Prior work on these techniques are mainly based on heuristics. A recent work developed a systematic frame-work of DNN weight pruning using the advanced optimization technique ADMM (Alternating Direction Methods of Multipliers), achieving one of state-of-art in weight pruning results. In this work, we first extend such one-shot ADMM-based framework to guarantee solution feasibility and provide fast convergence rate, and generalize to weight quantization as well. We have further developed a multi-step, progressive DNN weight pruning and quantization framework, with dual benefits of (i) achieving further weight pruning/quantization thanks to the special property of ADMM regularization, and (ii) reducing the search space within each step. Extensive experimental results demonstrate the superior performance compared with prior work. Some highlights: (i) we achieve 246x,36x, and 8x weight pruning on LeNet-5, AlexNet, and ResNet-50 models, respectively, with (almost) zero accuracy loss; (ii) even a significant 61x weight pruning in AlexNet (ImageNet) results in only minor degradation in actual accuracy compared with prior work; (iii) we are among the first to derive notable weight pruning results for ResNet and MobileNet models; (iv) we derive the first lossless, fully binarized (for all layers) LeNet-5 for MNIST and VGG-16 for CIFAR-10; and (v) we derive the first fully binarized (for all layers) ResNet for ImageNet with reasonable accuracy loss.

Click to Read Paper and Get Code
In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be adaptively pruned at vertex-level as well as edge-level without any post-processing, thus GAP can directly get practical model compression and inference speed-up. Moreover, it does not need any customized computation library or hardware support. Finetuning is conducted after pruning to restore the model performance. In the finetuning step, we adopt a self-taught knowledge distillation (KD) strategy by utilizing information from the original model, through which, the performance of the optimized model can be sufficiently improved, without introduction of any other teacher model. Experimental results show the proposed GAP can achieve promising result to make inference more efficient, e.g., for ResNeXt-29 on CIFAR10, it can get 13X model compression and 4.3X practical speed-up with marginal loss of accuracy.

* 7 pages, 7 figures
Click to Read Paper and Get Code
In this paper, a new data-driven information hiding scheme called generative steganography by sampling (GSS) is proposed. The stego is directly sampled by a powerful generator without an explicit cover. Secret key shared by both parties is used for message embedding and extraction, respectively. Jensen-Shannon Divergence is introduced as new criteria for evaluation of the security of the generative steganography. Based on these principles, a simple practical generative steganography method is proposed using semantic image inpainting. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated stego images.

* arXiv admin note: substantial text overlap with arXiv:1804.06514 and arXiv:1803.09219
Click to Read Paper and Get Code
Conventional video compression approaches use the predictive coding architecture and encode the corresponding motion information and residual information. In this paper, taking advantage of both classical architecture in the conventional video compression method and the powerful non-linear representation ability of neural networks, we propose the first end-to-end video compression deep model that jointly optimizes all the components for video compression. Specifically, learning based optical flow estimation is utilized to obtain the motion information and reconstruct the current frames. Then we employ two auto-encoder style neural networks to compress the corresponding motion and residual information. All the modules are jointly learned through a single loss function, in which they collaborate with each other by considering the trade-off between reducing the number of compression bits and improving quality of the decoded video. Experimental results show that the proposed approach outperforms the widely used video coding standard H.264 in terms of PSNR and be even on par with the latest standard H.265 in terms of MS-SSIM. Code will be publicly available upon acceptance.

Click to Read Paper and Get Code