Models, code, and papers for "Xin Yu":

##### O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis

Dec 05, 2017
Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, Xin Tong

We present O-CNN, an Octree-based Convolutional Neural Network (CNN) for 3D shape analysis. Built upon the octree representation of 3D shapes, our method takes the average normal vectors of a 3D model sampled in the finest leaf octants as input and performs 3D CNN operations on the octants occupied by the 3D shape surface. We design a novel octree data structure to efficiently store the octant information and CNN features into the graphics memory and execute the entire O-CNN training and evaluation on the GPU. O-CNN supports various CNN structures and works for 3D shapes in different representations. By restraining the computations on the octants occupied by 3D surfaces, the memory and computational costs of the O-CNN grow quadratically as the depth of the octree increases, which makes the 3D CNN feasible for high-resolution 3D models. We compare the performance of the O-CNN with other existing 3D CNN solutions and demonstrate the efficiency and efficacy of O-CNN in three shape analysis tasks, including object classification, shape retrieval, and shape segmentation.

##### View-volume Network for Semantic Scene Completion from a Single Depth Image

Jun 14, 2018
Yu-Xiao Guo, Xin Tong

We introduce a View-Volume convolutional neural network (VVNet) for inferring the occupancy and semantic labels of a volumetric 3D scene from a single depth image. The VVNet concatenates a 2D view CNN and a 3D volume CNN with a differentiable projection layer. Given a single RGBD image, our method extracts the detailed geometric features from the input depth image with a 2D view CNN and then projects the features into a 3D volume according to the input depth map via a projection layer. After that, we learn the 3D context information of the scene with a 3D volume CNN for computing the result volumetric occupancy and semantic labels. With combined 2D and 3D representations, the VVNet efficiently reduces the computational cost, enables feature extraction from multi-channel high resolution inputs, and thus significantly improves the result accuracy. We validate our method and demonstrate its efficiency and effectiveness on both synthetic SUNCG and real NYU dataset.

* To appear in IJCAI 2018
##### Learning Strict Identity Mappings in Deep Residual Networks

May 16, 2018
Xin Yu, Zhiding Yu, Srikumar Ramalingam

A family of super deep networks, referred to as residual networks or ResNet, achieved record-beating performance in various visual tasks such as image recognition, object detection, and semantic segmentation. The ability to train very deep networks naturally pushed the researchers to use enormous resources to achieve the best performance. Consequently, in many applications super deep residual networks were employed for just a marginal improvement in performance. In this paper, we propose epsilon-ResNet that allows us to automatically discard redundant layers, which produces responses that are smaller than a threshold epsilon, with a marginal or no loss in performance. The epsilon-ResNet architecture can be achieved using a few additional rectified linear units in the original ResNet. Our method does not use any additional variables nor numerous trials like other hyper-parameter optimization techniques. The layer selection is achieved using a single training process and the evaluation is performed on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. In some instances, we achieve about 80% reduction in the number of parameters.

##### HRGE-Net: Hierarchical Relational Graph Embedding Network for Multi-view 3D Shape Recognition

Aug 27, 2019
Xin Wei, Ruixuan Yu, Jian Sun

View-based approach that recognizes 3D shape through its projected 2D images achieved state-of-the-art performance for 3D shape recognition. One essential challenge for view-based approach is how to aggregate the multi-view features extracted from 2D images to be a global 3D shape descriptor. In this work, we propose a novel feature aggregation network by fully investigating the relations among views. We construct a relational graph with multi-view images as nodes, and design relational graph embedding by modeling pairwise and neighboring relations among views. By gradually coarsening the graph, we build a hierarchical relational graph embedding network (HRGE-Net) to aggregate the multi-view features to be a global shape descriptor. Extensive experiments show that HRGE-Net achieves stateof-the-art performance for 3D shape classification and retrieval on benchmark datasets.

##### On the Decision Boundary of Deep Neural Networks

Aug 23, 2018
Yu Li, Lizhong Ding, Xin Gao

While deep learning models and techniques have achieved great empirical success, our understanding of the source of success in many aspects remains very limited. In an attempt to bridge the gap, we investigate the decision boundary of a production deep learning architecture with weak assumptions on both the training data and the model. We demonstrate, both theoretically and empirically, that the last weight layer of a neural network converges to a linear SVM trained on the output of the last hidden layer, for both the binary case and the multi-class case with the commonly used cross-entropy loss. Furthermore, we show empirically that training a neural network as a whole, instead of only fine-tuning the last weight layer, may result in better bias constant for the last weight layer, which is important for generalization. In addition to facilitating the understanding of deep learning, our result can be helpful for solving a broad range of practical problems of deep learning, such as catastrophic forgetting and adversarial attacking. The experiment codes are available at https://github.com/lykaust15/NN_decision_boundary

##### Topic2Vec: Learning Distributed Representations of Topics

Jun 28, 2015
Li-Qiang Niu, Xin-Yu Dai

Latent Dirichlet Allocation (LDA) mining thematic structure of documents plays an important role in nature language processing and machine learning areas. However, the probability distribution from LDA only describes the statistical relationship of occurrences in the corpus and usually in practice, probability is not the best choice for feature representations. Recently, embedding methods have been proposed to represent words and documents by learning essential concepts and representations, such as Word2Vec and Doc2Vec. The embedded representations have shown more effectiveness than LDA-style representations in many tasks. In this paper, we propose the Topic2Vec approach which can learn topic representations in the same semantic vector space with words, as an alternative to probability. The experimental results show that Topic2Vec achieves interesting and meaningful results.

* 6 pages, 3 figures
##### Pre-train and Learn: Preserve Global Information for Graph Neural Networks

Oct 27, 2019
Danhao Zhu, Xin-yu Dai, Jiajun Chen

Graph neural networks (GNNs) have shown great power in learning on attributed graphs. However, it is still a challenge for GNNs to utilize information faraway from the source node. Moreover, general GNNs require graph attributes as input, so they cannot be appled to plain graphs. In the paper, we propose new models named G-GNNs (Global information for GNNs) to address the above limitations. First, the global structure and attribute features for each node are obtained via unsupervised pre-training, which preserve the global information associated to the node. Then, using the global features and the raw network attributes, we propose a parallel framework of GNNs to learn different aspects from these features. The proposed learning methods can be applied to both plain graphs and attributed graphs. Extensive experiments have shown that G-GNNs can outperform other state-of-the-art models on three standard evaluation graphs. Specially, our methods establish new benchmark records on Cora (84.31\%) and Pubmed (80.95\%) when learning on attributed graphs.

##### On the approximation ability of evolutionary optimization with application to minimum set cover

Jan 08, 2012
Yang Yu, Xin Yao, Zhi-Hua Zhou

Evolutionary algorithms (EAs) are heuristic algorithms inspired by natural evolution. They are often used to obtain satisficing solutions in practice. In this paper, we investigate a largely underexplored issue: the approximation performance of EAs in terms of how close the solution obtained is to an optimal solution. We study an EA framework named simple EA with isolated population (SEIP) that can be implemented as a single- or multi-objective EA. We analyze the approximation performance of SEIP using the partial ratio, which characterizes the approximation ratio that can be guaranteed. Specifically, we analyze SEIP using a set cover problem that is NP-hard. We find that in a simple configuration, SEIP efficiently achieves an $H_n$-approximation ratio, the asymptotic lower bound, for the unbounded set cover problem. We also find that SEIP efficiently achieves an $(H_k-\frac{k-1}/{8k^9})$-approximation ratio, the currently best-achievable result, for the k-set cover problem. Moreover, for an instance class of the k-set cover problem, we disclose how SEIP, using either one-bit or bit-wise mutation, can overcome the difficulty that limits the greedy algorithm.

##### 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss

Feb 10, 2020
Xin Yu, Zheyu Zhuang, Piotr Koniusz, Hongdong Li

Estimating a 6DOF object pose from a single image is very challenging due to occlusions or textureless appearances. Vector-field based keypoint voting has demonstrated its effectiveness and superiority on tackling those issues. However, direct regression of vector-fields neglects that the distances between pixels and keypoints also affect the deviations of hypotheses dramatically. In other words, small errors in direction vectors may generate severely deviated hypotheses when pixels are far away from a keypoint. In this paper, we aim to reduce such errors by incorporating the distances between pixels and keypoints into our objective. To this end, we develop a simple yet effective differentiable proxy voting loss (DPVL) which mimics the hypothesis selection in the voting procedure. By exploiting our voting loss, we are able to train our network in an end-to-end manner. Experiments on widely used datasets, i.e. LINEMOD and Occlusion LINEMOD, manifest that our DPVL improves pose estimation performance significantly and speeds up the training convergence.

##### Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

Oct 24, 2019
Dongxu Li, Cristian Rodriguez Opazo, Xin Yu, Hongdong Li

Vision-based sign language recognition aims at helping the hearing-impaired people to communicate with others. However, most existing sign language datasets are limited to a small number of words. Due to the limited vocabulary size, models learned from those datasets cannot be applied in practice. In this paper, we introduce a new large-scale Word-Level American Sign Language (WLASL) video dataset, containing more than 2000 words performed by over 100 signers. This dataset will be made publicly available to the research community. To our knowledge, it is by far the largest public ASL dataset to facilitate word-level sign recognition research. Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios. Specifically we implement and compare two different models,i.e., (i) holistic visual appearance-based approach, and (ii) 2D human pose based approach. Both models are valuable baselines that will benefit the community for method benchmarking. Moreover, we also propose a novel pose-based temporal graph convolution networks (Pose-TGCN) that models spatial and temporal dependencies in human pose trajectories simultaneously, which has further boosted the performance of the pose-based method. Our results show that pose-based and appearance-based models achieve comparable performances up to 66% at top-10 accuracy on 2,000 words/glosses, demonstrating the validity and challenges of our dataset. We will make the large-scale dataset, as well as our baseline deep models, freely available online.

* Accepted by WACV2020, First Round
##### Point2SpatialCapsule: Aggregating Features and Spatial Relationships of Local Regions on Point Clouds using Spatial-aware Capsules

Aug 29, 2019
Xin Wen, Zhizhong Han, Xinhai Liu, Yu-Shen Liu

Learning discriminative shape representation directly on point clouds is still challenging in 3D shape analysis and understanding. Recent studies usually involve three steps: first splitting a point cloud into some local regions, then extracting corresponding feature of each local region, and finally aggregating all individual local region features into a global feature as shape representation using simple max pooling. However, such pooling-based feature aggregation methods do not adequately take the spatial relationships between local regions into account, which greatly limits the ability to learn discriminative shape representation. To address this issue, we propose a novel deep learning network, named Point2SpatialCapsule, for aggregating features and spatial relationships of local regions on point clouds, which aims to learn more discriminative shape representation. Compared with traditional max-pooling based feature aggregation networks, Point2SpatialCapsule can explicitly learn not only geometric features of local regions but also spatial relationships among them. It consists of two modules. To resolve the disorder problem of local regions, the first module, named geometric feature aggregation, is designed to aggregate the local region features into the learnable cluster centers, which explicitly encodes the spatial locations from the original 3D space. The second module, named spatial relationship aggregation, is proposed for further aggregating clustered features and the spatial relationships among them in the feature space using the spatial-aware capsules developed in this paper. Compared to the previous capsule network based methods, the feature routing on the spatial-aware capsules can learn more discriminative spatial relationships among local regions for point clouds, which establishes a direct mapping between log priors and the spatial locations through feature clusters.

##### A Context-and-Spatial Aware Network for Multi-Person Pose Estimation

May 14, 2019
Dongdong Yu, Kai Su, Xin Geng, Changhu Wang

Multi-person pose estimation is a fundamental yet challenging task in computer vision. Both rich context information and spatial information are required to precisely locate the keypoints for all persons in an image. In this paper, a novel Context-and-Spatial Aware Network (CSANet), which integrates both a Context Aware Path and Spatial Aware Path, is proposed to obtain effective features involving both context information and spatial information. Specifically, we design a Context Aware Path with structure supervision strategy and spatial pyramid pooling strategy to enhance the context information. Meanwhile, a Spatial Aware Path is proposed to preserve the spatial information, which also shortens the information propagation path from low-level features to high-level features. On top of these two paths, we employ a Heavy Head Path to further combine and enhance the features effectively. Experimentally, our proposed network outperforms state-of-the-art methods on the COCO keypoint benchmark, which verifies the effectiveness of our method and further corroborates the above proposition.

##### SAM-GCNN: A Gated Convolutional Neural Network with Segment-Level Attention Mechanism for Home Activity Monitoring

Oct 03, 2018
Yu-Han Shen, Ke-Xin He, Wei-Qiang Zhang

In this paper, we propose a method for home activity monitoring. We demonstrate our model on dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Challenge Task 5. This task aims to classify multi-channel audios into one of the provided pre-defined classes. All of these classes are daily activities performed in a home environment. To tackle this task, we propose a gated convolutional neural network with segment-level attention mechanism (SAM-GCNN). The proposed framework is a convolutional model with two auxiliary modules: a gated convolutional neural network and a segment-level attention mechanism. Furthermore, we adopted model ensemble to enhance the capability of generalization of our model. We evaluated our work on the development dataset of DCASE 2018 Task 5 and achieved competitive performance, with a macro-averaged F-1 score increasing from 83.76% to 89.33%, compared with the convolutional baseline system.

* 6 pages
##### A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

May 18, 2018
Xin Chen, Hua Zhou, Yuxiang Gao, Yu Zhu

Large scale deep Convolution Neural Networks (CNNs) increasingly demands the computing power. It is key for researchers to own a great powerful computing platform to leverage deep learning (DL) advancing.On the other hand, as the commonly-used accelerator, the commodity GPUs cards of new generations are more and more expensive. Consequently, it is of importance to design an affordable distributed heterogeneous system that provides powerful computational capacity and develop a well-suited software that efficiently utilizes its computational capacity. In this paper, we present our co-design distributed system including a peta-scale GPU cluster, called "Manoa". Based on properties and topology of Manoa, we first propose job server framework and implement it, named "MiMatrix". The central node of MiMatrix, referred to as the job server, undertakes all of controlling, scheduling and monitoring, and I/O tasks without weight data transfer for AllReduce processing in each iteration. Therefore, MiMatrix intrinsically solves the bandwidth bottleneck of central node in parameter server framework that is widely used in distributed DL tasks. Meanwhile, we also propose a new AllReduce algorithm, GPUDirect RDMA-Aware AllReduce~(GDRAA), in which both computation and handshake message are O(1) and the number of synchronization is two in each iteration that is a theoretical minimum number. Owe to the dedicated co-design distributed system, MiMatrix efficiently makes use of the Manoa's computational capacity and bandwidth. We benchmark Manoa Resnet50 and Resenet101 on Imagenet-1K dataset. Some of results have demonstrated state-of-the-art.

* 23 pages, 4 figures, 1 table
##### Face Destylization

Feb 05, 2018
Fatemeh Shiri, Xin Yu, Fatih Porikli, Piotr Koniusz

Numerous style transfer methods which produce artistic styles of portraits have been proposed to date. However, the inverse problem of converting the stylized portraits back into realistic faces is yet to be investigated thoroughly. Reverting an artistic portrait to its original photo-realistic face image has potential to facilitate human perception and identity analysis. In this paper, we propose a novel Face Destylization Neural Network (FDNN) to restore the latent photo-realistic faces from the stylized ones. We develop a Style Removal Network composed of convolutional, fully-connected and deconvolutional layers. The convolutional layers are designed to extract facial components from stylized face images. Consecutively, the fully-connected layer transfers the extracted feature maps of stylized images into the corresponding feature maps of real faces and the deconvolutional layers generate real faces from the transferred feature maps. To enforce the destylized faces to be similar to authentic face images, we employ a discriminative network, which consists of convolutional and fully connected layers. We demonstrate the effectiveness of our network by conducting experiments on an extensive set of synthetic images. Furthermore, we illustrate our network can recover faces from stylized portraits and real paintings for which the stylized data was unavailable during the training phase.

##### Sequential Dual Deep Learning with Shape and Texture Features for Sketch Recognition

Aug 09, 2017
Qi Jia, Meiyu Yu, Xin Fan, Haojie Li

Recognizing freehand sketches with high arbitrariness is greatly challenging. Most existing methods either ignore the geometric characteristics or treat sketches as handwritten characters with fixed structural ordering. Consequently, they can hardly yield high recognition performance even though sophisticated learning techniques are employed. In this paper, we propose a sequential deep learning strategy that combines both shape and texture features. A coded shape descriptor is exploited to characterize the geometry of sketch strokes with high flexibility, while the outputs of constitutional neural networks (CNN) are taken as the abstract texture feature. We develop dual deep networks with memorable gated recurrent units (GRUs), and sequentially feed these two types of features into the dual networks, respectively. These dual networks enable the feature fusion by another gated recurrent unit (GRU), and thus accurately recognize sketches invariant to stroke ordering. The experiments on the TU-Berlin data set show that our method outperforms the average of human and state-of-the-art algorithms even when significant shape and appearance variations occur.

* 8 pages, 8 figures
##### Concept Drift Adaptation by Exploiting Historical Knowledge

Feb 12, 2017
Yu Sun, Ke Tang, Zexuan Zhu, Xin Yao

Incremental learning with concept drift has often been tackled by ensemble methods, where models built in the past can be re-trained to attain new models for the current data. Two design questions need to be addressed in developing ensemble methods for incremental learning with concept drift, i.e., which historical (i.e., previously trained) models should be preserved and how to utilize them. A novel ensemble learning method, namely Diversity and Transfer based Ensemble Learning (DTEL), is proposed in this paper. Given newly arrived data, DTEL uses each preserved historical model as an initial model and further trains it with the new data via transfer learning. Furthermore, DTEL preserves a diverse set of historical models, rather than a set of historical models that are merely accurate in terms of classification accuracy. Empirical studies on 15 synthetic data streams and 4 real-world data streams (all with concept drifts) demonstrate that DTEL can handle concept drift more effectively than 4 other state-of-the-art methods.

* First version
##### On Multiplicative Multitask Feature Learning

Oct 24, 2016
Xin Wang, Jinbo Bi, Shipeng Yu, Jiangwen Sun

We investigate a general framework of multiplicative multitask feature learning which decomposes each task's model parameters into a multiplication of two components. One of the components is used across all tasks and the other component is task-specific. Several previous methods have been proposed as special cases of our framework. We study the theoretical properties of this framework when different regularization conditions are applied to the two decomposed components. We prove that this framework is mathematically equivalent to the widely used multitask feature learning methods that are based on a joint regularization of all model parameters, but with a more general form of regularizers. Further, an analytical formula is derived for the across-task component as related to the task-specific component for all these regularizers, leading to a better understanding of the shrinkage effect. Study of this framework motivates new multitask learning algorithms. We propose two new learning formulations by varying the parameters in the proposed framework. Empirical studies have revealed the relative advantages of the two new formulations by comparing with the state of the art, which provides instructive insights into the feature learning problem with multiple tasks.

* Advances in Neural Information Processing Systems 2014
##### Transferring Cross-domain Knowledge for Video Sign Language Recognition

Mar 17, 2020
Dongxu Li, Xin Yu, Chenchen Xu, Lars Petersson, Hongdong Li

Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. It requires models to recognize isolated sign words from videos. However, annotating WSLR data needs expert knowledge, thus limiting WSLR dataset acquisition. On the contrary, there are abundant subtitled sign news videos on the internet. Since these videos have no word-level annotation and exhibit a large domain gap from isolated signs, they cannot be directly used for training WSLR models. We observe that despite the existence of a large domain gap, isolated and news signs share the same visual concepts, such as hand gestures and body movements. Motivated by this observation, we propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them. To this end, we extract news signs using a base WSLR model, and then design a classifier jointly trained on news and isolated signs to coarsely align these two domain features. In order to learn domain-invariant features within each class and suppress domain-specific features, our method further resorts to an external memory to store the class centroids of the aligned news signs. We then design a temporal attention based on the learnt descriptor to improve recognition performance. Experimental results on standard WSLR datasets show that our method outperforms previous state-of-the-art methods significantly. We also demonstrate the effectiveness of our method on automatically localizing signs from sign news, achieving 28.1 for AP@0.5.

* CVPR2020 (oral) preprint