Models, code, and papers for "Yiping Peng":

Learning Alignment for Multimodal Emotion Recognition from Speech

Sep 06, 2019
Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, Xiangang Li

Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.

* InterSpeech 2019 

  Click for Model/Code and Paper
ThunderNet: Towards Real-time Generic Object Detection

Mar 28, 2019
Zheng Qin, Zeming Li, Zhaoning Zhang, Yiping Bao, Gang Yu, Yuxing Peng, Jian Sun

Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task. However, previous CNN-based detectors suffer from enormous computational cost, which hinders them from real-time inference in computation-constrained scenarios. In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, we analyze the drawbacks in previous lightweight backbones and present a lightweight backbone designed for object detection. In the detection part, we exploit an extremely efficient RPN and detection head design. To generate more discriminative feature representation, we design two efficient architecture blocks, Context Enhancement Module and Spatial Attention Module. At last, we investigate the balance between the input resolution, the backbone, and the detection head. Compared with lightweight one-stage detectors, ThunderNet achieves superior performance with only 40% of the computational cost on PASCAL VOC and COCO benchmarks. Without bells and whistles, our model runs at 24.1 fps on an ARM-based device. To the best of our knowledge, this is the first real-time detector reported on ARM platforms. Code will be released for paper reproduction.

* 10 pages, 6 figures 

  Click for Model/Code and Paper
DELTA: A DEep learning based Language Technology plAtform

Aug 02, 2019
Kun Han, Junwen Chen, Hui Zhang, Haiyang Xu, Yiping Peng, Yun Wang, Ning Ding, Hui Deng, Yonghu Gao, Tingwei Guo, Yi Zhang, Yahao He, Baochang Ma, Yulong Zhou, Kangli Zhang, Chao Liu, Ying Lyu, Chenxi Wang, Cheng Gong, Yunbo Wang, Wei Zou, Hui Song, Xiangang Li

In this paper we present DELTA, a deep learning based language technology platform. DELTA is an end-to-end platform designed to solve industry level natural language and speech processing problems. It integrates most popular neural network models for training as well as comprehensive deployment tools for production. DELTA aims to provide easy and fast experiences for using, deploying, and developing natural language processing and speech models for both academia and industry use cases. We demonstrate the reliable performance with DELTA on several natural language processing and speech tasks, including text classification, named entity recognition, natural language inference, speech recognition, speaker verification, etc. DELTA has been used for developing several state-of-the-art algorithms for publications and delivering real production to serve millions of users.

* White paper for an open source library: 13 pages, 3 figures 

  Click for Model/Code and Paper
PointNLM: Point Nonlocal-Means for vegetation segmentation based on middle echo point clouds

Jun 20, 2019
Jonathan Li, Rongren Wu, Yiping Chen, Qing Zhu, Zhipeng Luo, Cheng Wang

Middle-echo, which covers one or a few corresponding points, is a specific type of 3D point cloud acquired by a multi-echo laser scanner. In this paper, we propose a novel approach for automatic segmentation of trees that leverages middle-echo information from LiDAR point clouds. First, using a convolution classification method, the proposed type of point clouds reflected by the middle echoes are identified from all point clouds. The middle-echo point clouds are distinguished from the first and last echoes. Hence, the crown positions of the trees are quickly detected from the huge number of point clouds. Second, to accurately extract trees from all point clouds, we propose a 3D deep learning network, PointNLM, to semantically segment tree crowns. PointNLM captures the long-range relationship between the point clouds via a non-local branch and extracts high-level features via max-pooling applied to unordered points. The whole framework is evaluated using the Semantic 3D reduced-test set. The IoU of tree point cloud segmentation reached 0.864. In addition, the semantic segmentation network was tested using the Paris-Lille-3D dataset. The average IoU outperformed several other popular methods. The experimental results indicate that the proposed algorithm provides an excellent solution for vegetation segmentation from LiDAR point clouds.

  Click for Model/Code and Paper