Models, code, and papers for "Haiyang Xu":

Explicit Shape Encoding for Real-Time Instance Segmentation

Aug 12, 2019
Wenqiang Xu, Haiyang Wang, Fubo Qi, Cewu Lu

In this paper, we propose a novel top-down instance segmentation framework based on explicit shape encoding, named \textbf{ESE-Seg}. It largely reduces the computational consumption of the instance segmentation by explicitly decoding the multiple object shapes with tensor operations, thus performs the instance segmentation at almost the same speed as the object detection. ESE-Seg is based on a novel shape signature Inner-center Radius (IR), Chebyshev polynomial fitting and the strong modern object detectors. ESE-Seg with YOLOv3 outperforms the Mask R-CNN on Pascal VOC 2012 at mAP$^r$@0.5 while 7 times faster.

* to appear in ICCV2019 

  Click for Model/Code and Paper
Visual Rhythm Prediction with Feature-Aligning Network

Jan 29, 2019
Yutong Xie, Haiyang Wang, Yan Hao, Zihao Xu

In this paper, we propose a data-driven visual rhythm prediction method, which overcomes the previous works' deficiency that predictions are made primarily by human-crafted hard rules. In our approach, we first extract features including original frames and their residuals, optical flow, scene change, and body pose. These visual features will be next taken into an end-to-end neural network as inputs. Here we observe that there are some slight misaligning between features over the timeline and assume that this is due to the distinctions between how different features are computed. To solve this problem, the extracted features are aligned by an elaborately designed layer, which can also be applied to other models suffering from mismatched features, and boost performance. Then these aligned features are fed into sequence labeling layers implemented with BiLSTM and CRF to predict the onsets. Due to the lack of existing public training and evaluation set, we experiment on a dataset constructed by ourselves based on professionally edited Music Videos (MVs), and the F1 score of our approach reaches 79.6.

* 6 pages, 4 figures 

  Click for Model/Code and Paper
Learning Alignment for Multimodal Emotion Recognition from Speech

Sep 06, 2019
Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, Xiangang Li

Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.

* InterSpeech 2019 

  Click for Model/Code and Paper
DRFN: Deep Recurrent Fusion Network for Single-Image Super-Resolution with Large Factors

Aug 23, 2019
Xin Yang, Haiyang Mei, Jiqing Zhang, Ke Xu, Baocai Yin, Qiang Zhang, Xiaopeng Wei

Recently, single-image super-resolution has made great progress owing to the development of deep convolutional neural networks (CNNs). The vast majority of CNN-based models use a pre-defined upsampling operator, such as bicubic interpolation, to upscale input low-resolution images to the desired size and learn non-linear mapping between the interpolated image and ground truth high-resolution (HR) image. However, interpolation processing can lead to visual artifacts as details are over-smoothed, particularly when the super-resolution factor is high. In this paper, we propose a Deep Recurrent Fusion Network (DRFN), which utilizes transposed convolution instead of bicubic interpolation for upsampling and integrates different-level features extracted from recurrent residual blocks to reconstruct the final HR images. We adopt a deep recurrence learning strategy and thus have a larger receptive field, which is conducive to reconstructing an image more accurately. Furthermore, we show that the multi-level fusion structure is suitable for dealing with image super-resolution problems. Extensive benchmark evaluations demonstrate that the proposed DRFN performs better than most current deep learning methods in terms of accuracy and visual effects, especially for large-scale images, while using fewer parameters.

* IEEE Transactions on Multimedia ( Volume: 21 , Issue: 2 , Feb. 2019 ) 328 - 337 

  Click for Model/Code and Paper
Where Is My Mirror?

Oct 03, 2019
Xin Yang, Haiyang Mei, Ke Xu, Xiaopeng Wei, Baocai Yin, Rynson W. H. Lau

Mirrors are everywhere in our daily lives. Existing computer vision systems do not consider mirrors, and hence may get confused by the reflected content inside a mirror, resulting in a severe performance degradation. However, separating the real content outside a mirror from the reflected content inside it is non-trivial. The key challenge is that mirrors typically reflect contents similar to their surroundings, making it very difficult to differentiate the two. In this paper, we present a novel method to segment mirrors from an input image. To the best of our knowledge, this is the first work to address the mirror segmentation problem with a computational approach. We make the following contributions. First, we construct a large-scale mirror dataset that contains mirror images with corresponding manually annotated masks. This dataset covers a variety of daily life scenes, and will be made publicly available for future research. Second, we propose a novel network, called MirrorNet, for mirror segmentation, by modeling both semantical and low-level color/texture discontinuities between the contents inside and outside of the mirrors. Third, we conduct extensive experiments to evaluate the proposed method, and show that it outperforms the carefully chosen baselines from the state-of-the-art detection and segmentation methods.

* Accepted by ICCV 2019. Project homepage: https://mhaiyang.github.io/ICCV2019_MirrorNet/index.html 

  Click for Model/Code and Paper
DELTA: A DEep learning based Language Technology plAtform

Aug 02, 2019
Kun Han, Junwen Chen, Hui Zhang, Haiyang Xu, Yiping Peng, Yun Wang, Ning Ding, Hui Deng, Yonghu Gao, Tingwei Guo, Yi Zhang, Yahao He, Baochang Ma, Yulong Zhou, Kangli Zhang, Chao Liu, Ying Lyu, Chenxi Wang, Cheng Gong, Yunbo Wang, Wei Zou, Hui Song, Xiangang Li

In this paper we present DELTA, a deep learning based language technology platform. DELTA is an end-to-end platform designed to solve industry level natural language and speech processing problems. It integrates most popular neural network models for training as well as comprehensive deployment tools for production. DELTA aims to provide easy and fast experiences for using, deploying, and developing natural language processing and speech models for both academia and industry use cases. We demonstrate the reliable performance with DELTA on several natural language processing and speech tasks, including text classification, named entity recognition, natural language inference, speech recognition, speaker verification, etc. DELTA has been used for developing several state-of-the-art algorithms for publications and delivering real production to serve millions of users.

* White paper for an open source library: https://github.com/didi/delta. 13 pages, 3 figures 

  Click for Model/Code and Paper
Volume Preserving Image Segmentation with Entropic Regularization Optimal Transport and Its Applications in Deep Learning

Sep 22, 2019
Haifeng Li, Jun Liu, Li Cui, Haiyang Huang, Xue-cheng Tai

Image segmentation with a volume constraint is an important prior for many real applications. In this work, we present a novel volume preserving image segmentation algorithm, which is based on the framework of entropic regularized optimal transport theory. The classical Total Variation (TV) regularizer and volume preserving are integrated into a regularized optimal transport model, and the volume and classification constraints can be regarded as two measures preserving constraints in the optimal transport problem. By studying the dual problem, we develop a simple and efficient dual algorithm for our model. Moreover, to be different from many variational based image segmentation algorithms, the proposed algorithm can be directly unrolled to a new Volume Preserving and TV regularized softmax (VPTV-softmax) layer for semantic segmentation in the popular Deep Convolution Neural Network (DCNN). The experiment results show that our proposed model is very competitive and can improve the performance of many semantic segmentation nets such as the popular U-net.


  Click for Model/Code and Paper
Improving the Robustness of Speech Translation

Nov 02, 2018
Xiang Li, Haiyang Xue, Wei Chen, Yang Liu, Yang Feng, Qun Liu

Although neural machine translation (NMT) has achieved impressive progress recently, it is usually trained on the clean parallel data set and hence cannot work well when the input sentence is the production of the automatic speech recognition (ASR) system due to the enormous errors in the source. To solve this problem, we propose a simple but effective method to improve the robustness of NMT in the case of speech translation. We simulate the noise existing in the realistic output of the ASR system and inject them into the clean parallel data so that NMT can work under similar word distributions during training and testing. Besides, we also incorporate the Chinese Pinyin feature which is easy to get in speech translation to further improve the translation performance. Experiment results show that our method has a more stable performance and outperforms the baseline by an average of 3.12 BLEU on multiple noisy test sets, even while achieves a generalization improvement on the WMT'17 Chinese-English test set.


  Click for Model/Code and Paper
Open DNN Box by Power Side-Channel Attack

Jul 21, 2019
Yun Xiang, Zhuangzhi Chen, Zuohui Chen, Zebin Fang, Haiyang Hao, Jinyin Chen, Yi Liu, Zhefu Wu, Qi Xuan, Xiaoniu Yang

Deep neural networks are becoming popular and important assets of many AI companies. However, recent studies indicate that they are also vulnerable to adversarial attacks. Adversarial attacks can be either white-box or black-box. The white-box attacks assume full knowledge of the models while the black-box ones assume none. In general, revealing more internal information can enable much more powerful and efficient attacks. However, in most real-world applications, the internal information of embedded AI devices is unavailable, i.e., they are black-box. Therefore, in this work, we propose a side-channel information based technique to reveal the internal information of black-box models. Specifically, we have made the following contributions: (1) we are the first to use side-channel information to reveal internal network architecture in embedded devices; (2) we are the first to construct models for internal parameter estimation; and (3) we validate our methods on real-world devices and applications. The experimental results show that our method can achieve 96.50\% accuracy on average. Such results suggest that we should pay strong attention to the security problem of many AI applications, and further propose corresponding defensive strategies in the future.


  Click for Model/Code and Paper