Models, code, and papers for "Xia Liang":

MIDI-Sandwich2: RNN-based Hierarchical Multi-modal Fusion Generation VAE networks for multi-track symbolic music generation

Sep 08, 2019
Xia Liang, Junmin Wu, Jing Cao

Currently, almost all the multi-track music generation models use the Convolutional Neural Network (CNN) to build the generative model, while the Recurrent Neural Network (RNN) based models can not be applied in this task. In view of the above problem, this paper proposes a RNN-based Hierarchical Multi-modal Fusion Generation Variational Autoencoder (VAE) network, MIDI-Sandwich2, for multi-track symbolic music generation. Inspired by VQ-VAE2, MIDI-Sandwich2 expands the dimension of the original hierarchical model by using multiple independent Binary Variational Autoencoder (BVAE) models without sharing weights to process the information of each track. Then, with multi-modal fusion technology, the upper layer named Multi-modal Fusion Generation VAE (MFG-VAE) combines the latent space vectors generated by the respective tracks, and uses the decoder to perform the ascending dimension reconstruction to simulate the inverse operation of multi-modal fusion, multi-modal generation, so as to realize the RNN-based multi-track symbolic music generation. For the multi-track format pianoroll, we also improve the output binarization method of MuseGAN, which solves the problem that the refinement step of the original scheme is difficult to differentiate and the gradient is hard to descent, making the generated song more expressive. The model is validated on the Lakh Pianoroll Dataset (LPD) multi-track dataset. Compared to the MuseGAN, MIDI-Sandwich2 can not only generate harmonious multi-track music, the generation quality is also close to the state of the art level. At the same time, by using the VAE to restore songs, the semi-generated songs reproduced by the MIDI-Sandwich2 are more beautiful than the pure autogeneration music generated by MuseGAN. Both the code and the audition audio samples are open source on

  Click for Model/Code and Paper
MIDI-Sandwich: Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track Music Generation

Jul 04, 2019
Xia Liang, Junmin Wu, Yan Yin

Most existing neural network models for music generation explore how to generate music bars, then directly splice the music bars into a song. However, these methods do not explore the relationship between the bars, and the connected song as a whole has no musical form structure and sense of musical direction. To address this issue, we propose a Multi-model Multi-task Hierarchical Conditional VAE-GAN (Variational Autoencoder-Generative adversarial networks) networks, named MIDI-Sandwich, which combines musical knowledge, such as musical form, tonic, and melodic motion. The MIDI-Sandwich has two submodels: Hierarchical Conditional Variational Autoencoder (HCVAE) and Hierarchical Conditional Generative Adversarial Network (HCGAN). The HCVAE uses hierarchical structure. The underlying layer of HCVAE uses Local Conditional Variational Autoencoder (L-CVAE) to generate a music bar which is pre-specified by the First and Last Notes (FLN). The upper layer of HCVAE uses Global Variational Autoencoder(G-VAE) to analyze the latent vector sequence generated by the L-CVAE encoder, to explore the musical relationship between the bars, and to produce the song pieced together by multiple music bars generated by the L-CVAE decoder, which makes the song both have musical structure and sense of direction. At the same time, the HCVAE shares a part of itself with the HCGAN to further improve the performance of the generated music. The MIDI-Sandwich is validated on the Nottingham dataset and is able to generate a single-track melody sequence (17x8 beats), which is superior to the length of most of the generated models (8 to 32 beats). Meanwhile, by referring to the experimental methods of many classical kinds of literature, the quality evaluation of the generated music is performed. The above experiments prove the validity of the model.

* cast KSEM2019 on May 3, 2019 (weak rejected) 

  Click for Model/Code and Paper
QSAR Classification Modeling for Bioactivity of Molecular Structure via SPL-Logsum

Apr 28, 2018
Liang-Yong Xia, Qing-Yong Wang

Quantitative structure-activity relationship (QSAR) modelling is effective 'bridge' to search the reliable relationship related bioactivity to molecular structure. A QSAR classification model contains a lager number of redundant, noisy and irrelevant descriptors. To address this problem, various of methods have been proposed for descriptor selection. Generally, they can be grouped into three categories: filters, wrappers, and embedded methods. Regularization method is an important embedded technology, which can be used for continuous shrinkage and automatic descriptors selection. In recent years, the interest of researchers in the application of regularization techniques is increasing in descriptors selection , such as, logistic regression(LR) with $L_1$ penalty. In this paper, we proposed a novel descriptor selection method based on self-paced learning(SPL) with Logsum penalized LR for predicting the bioactivity of molecular structure. SPL inspired by the learning process of humans and animals that gradually learns from easy samples(smaller losses) to hard samples(bigger losses) samples into training and Logsum regularization has capacity to select few meaningful and significant molecular descriptors, respectively. Experimental results on simulation and three public QSAR datasets show that our proposed SPL-Logsum method outperforms other commonly used sparse methods in terms of classification performance and model interpretation.

  Click for Model/Code and Paper
Sensing Urban Land-Use Patterns By Integrating Google Tensorflow And Scene-Classification Models

Aug 04, 2017
Yao Yao, Haolin Liang, Xia Li, Jinbao Zhang, Jialv He

With the rapid progress of China's urbanization, research on the automatic detection of land-use patterns in Chinese cities is of substantial importance. Deep learning is an effective method to extract image features. To take advantage of the deep-learning method in detecting urban land-use patterns, we applied a transfer-learning-based remote-sensing image approach to extract and classify features. Using the Google Tensorflow framework, a powerful convolution neural network (CNN) library was created. First, the transferred model was previously trained on ImageNet, one of the largest object-image data sets, to fully develop the model's ability to generate feature vectors of standard remote-sensing land-cover data sets (UC Merced and WHU-SIRI). Then, a random-forest-based classifier was constructed and trained on these generated vectors to classify the actual urban land-use pattern on the scale of traffic analysis zones (TAZs). To avoid the multi-scale effect of remote-sensing imagery, a large random patch (LRP) method was used. The proposed method could efficiently obtain acceptable accuracy (OA = 0.794, Kappa = 0.737) for the study area. In addition, the results show that the proposed method can effectively overcome the multi-scale effect that occurs in urban land-use classification at the irregular land-parcel level. The proposed method can help planners monitor dynamic urban land use and evaluate the impact of urban-planning schemes.

* 8 pages, 8 figures, 2 tables 

  Click for Model/Code and Paper
A Multi-Scale Mapping Approach Based on a Deep Learning CNN Model for Reconstructing High-Resolution Urban DEMs

Jul 19, 2019
Ling Jiang, Yang Hu, Xilin Xia, Qiuhua Liang, Andrea Soltoggio

The shortage of high-resolution urban digital elevation model (DEM) datasets has been a challenge for modelling urban flood and managing its risk. A solution is to develop effective approaches to reconstruct high-resolution DEMs from their low-resolution equivalents that are more widely available. However, the current high-resolution DEM reconstruction approaches mainly focus on natural topography. Few attempts have been made for urban topography which is typically an integration of complex man-made and natural features. This study proposes a novel multi-scale mapping approach based on convolutional neural network (CNN) to deal with the complex characteristics of urban topography and reconstruct high-resolution urban DEMs. The proposed multi-scale CNN model is firstly trained using urban DEMs that contain topographic features at different resolutions, and then used to reconstruct the urban DEM at a specified (high) resolution from a low-resolution equivalent. A two-level accuracy assessment approach is also designed to evaluate the performance of the proposed urban DEM reconstruction method, in terms of numerical accuracy and morphological accuracy. The proposed DEM reconstruction approach is applied to a 121 km2 urbanized area in London, UK. Compared with other commonly used methods, the current CNN based approach produces superior results, providing a cost-effective innovative method to acquire high-resolution DEMs in other data-scarce environments.

  Click for Model/Code and Paper
Scale-Invariant Structure Saliency Selection for Fast Image Fusion

Oct 30, 2018
Yixiong Liang, Yuan Mao, Jiazhi Xia, Yao Xiang, Jianfeng Liu

In this paper, we present a fast yet effective method for pixel-level scale-invariant image fusion in spatial domain based on the scale-space theory. Specifically, we propose a scale-invariant structure saliency selection scheme based on the difference-of-Gaussian (DoG) pyramid of images to build the weights or activity map. Due to the scale-invariant structure saliency selection, our method can keep both details of small size objects and the integrity information of large size objects in images. In addition, our method is very efficient since there are no complex operation involved and easy to be implemented and therefore can be used for fast high resolution images fusion. Experimental results demonstrate the proposed method yields competitive or even better results comparing to state-of-the-art image fusion methods both in terms of visual quality and objective evaluation metrics. Furthermore, the proposed method is very fast and can be used to fuse the high resolution images in real-time. Code is available at

  Click for Model/Code and Paper
Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net

Mar 28, 2016
Fangting Xia, Peng Wang, Liang-Chieh Chen, Alan L. Yuille

Parsing articulated objects, e.g. humans and animals, into semantic parts (e.g. body, head and arms, etc.) from natural images is a challenging and fundamental problem for computer vision. A big difficulty is the large variability of scale and location for objects and their corresponding parts. Even limited mistakes in estimating scale and location will degrade the parsing output and cause errors in boundary details. To tackle these difficulties, we propose a "Hierarchical Auto-Zoom Net" (HAZN) for object part parsing which adapts to the local scales of objects and parts. HAZN is a sequence of two "Auto-Zoom Net" (AZNs), each employing fully convolutional networks that perform two tasks: (1) predict the locations and scales of object instances (the first AZN) or their parts (the second AZN); (2) estimate the part scores for predicted object instance or part regions. Our model can adaptively "zoom" (resize) predicted image regions into their proper scales to refine the parsing. We conduct extensive experiments over the PASCAL part datasets on humans, horses, and cows. For humans, our approach significantly outperforms the state-of-the-arts by 5% mIOU and is especially better at segmenting small instances and small parts. We obtain similar improvements for parsing cows and horses over alternative methods. In summary, our strategy of first zooming into objects and then zooming into parts is very effective. It also enables us to process different regions of the image at different scales adaptively so that, for example, we do not need to waste computational resources scaling the entire image.

* A shortened version has been submitted to ECCV 2016 

  Click for Model/Code and Paper
An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal

Jan 09, 2020
Sifeng Xia, Kunchangtai Liang, Wenhan Yang, Ling-Yu Duan, Jiaying Liu

In this paper, we study a new problem arising from the emerging MPEG standardization effort Video Coding for Machine (VCM), which aims to bridge the gap between visual feature compression and classical video coding. VCM is committed to address the requirement of compact signal representation for both machine and human vision in a more or less scalable way. To this end, we make endeavors in leveraging the strength of predictive and generative models to support advanced compression techniques for both machine and human vision tasks simultaneously, in which visual features serve as a bridge to connect signal-level and task-level compact representations in a scalable manner. Specifically, we employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames via a generative model, relying on the appearance of the coded key frames. Meanwhile, the sparse motion pattern is compact and highly effective for high-level vision tasks, e.g. action recognition. Experimental results demonstrate that our method yields much better reconstruction quality compared with the traditional video codecs (0.0063 gain in SSIM), as well as state-of-the-art action recognition performance over highly compressed videos (9.4% gain in recognition accuracy), which showcases a promising paradigm of coding signal for both human and machine vision.

  Click for Model/Code and Paper
Primitive-based 3D Building Modeling, Sensor Simulation, and Estimation

Jan 16, 2019
Xia Li, Yen-Liang Lin, James Miller, Alex Cheon, Walt Dixon

As we begin to consider modeling large, realistic 3D building scenes, it becomes necessary to consider a more compact representation over the polygonal mesh model. Due to the large amounts of annotated training data, which is costly to obtain, we leverage synthetic data to train our system for the satellite image domain. By utilizing the synthetic data, we formulate the building decomposition as an application of instance segmentation and primitive fitting to decompose a building into a set of primitive shapes. Experimental results on WorldView-3 satellite image dataset demonstrate the effectiveness of our 3D building modeling approach.

  Click for Model/Code and Paper
Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning

Aug 10, 2018
Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, Dawei Yin

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedback. Users' feedback can be positive and negative and both types of feedback have great potentials to boost recommendations. However, the number of negative feedback is much larger than that of positive one; thus incorporating them simultaneously is challenging since positive feedback could be buried by negative one. In this paper, we develop a novel approach to incorporate them into the proposed deep recommender system (DEERS) framework. The experimental results based on real-world e-commerce data demonstrate the effectiveness of the proposed framework. Further experiments have been conducted to understand the importance of both positive and negative feedback in recommendations.

* arXiv admin note: substantial text overlap with arXiv:1801.00209 

  Click for Model/Code and Paper
Efficient online learning for large-scale peptide identification

May 08, 2018
Xijun Liang, Zhonghang Xia, Yongxiang Wang, Ling Jian, Xinnan Niu, Andrew Link

Motivation: Post-database searching is a key procedure in peptide dentification with tandem mass spectrometry (MS/MS) strategies for refining peptide-spectrum matches (PSMs) generated by database search engines. Although many statistical and machine learning-based methods have been developed to improve the accuracy of peptide identification, the challenge remains on large-scale datasets and datasets with an extremely large proportion of false positives (hard datasets). A more efficient learning strategy is required for improving the performance of peptide identification on challenging datasets. Results: In this work, we present an online learning method to conquer the challenges remained for exiting peptide identification algorithms. We propose a cost-sensitive learning model by using different loss functions for decoy and target PSMs respectively. A larger penalty for wrongly selecting decoy PSMs than that for target PSMs, and thus the new model can reduce its false discovery rate on hard datasets. Also, we design an online learning algorithm, OLCS-Ranker, to solve the proposed learning model. Rather than taking all training data samples all at once, OLCS-Ranker iteratively feeds in only one training sample into the learning model at each round. As a result, the memory requirement is significantly reduced for large-scale problems. Experimental studies show that OLCS-Ranker outperforms benchmark methods, such as CRanker and Batch-CS-Ranker, in terms of accuracy and stability. Furthermore, OLCS-Ranker is 15--85 times faster than CRanker method on large datasets. Availability and implementation: OLCS-Ranker software is available at no charge for non-commercial use at

* 16 pages, 3 figures 

  Click for Model/Code and Paper
Judging Chemical Reaction Practicality From Positive Sample only Learning

Apr 22, 2019
Shu Jiang, Zhuosheng Zhang, Hai Zhao, Jiangtong Li, Yang Yang, Bao-Liang Lu, Ning Xia

Chemical reaction practicality is the core task among all symbol intelligence based chemical information processing, for example, it provides indispensable clue for further automatic synthesis route inference. Considering that chemical reactions have been represented in a language form, we propose a new solution to generally judge the practicality of organic reaction without considering complex quantum physical modeling or chemistry knowledge. While tackling the practicality judgment as a machine learning task from positive and negative (chemical reaction) samples, all existing studies have to carefully handle the serious insufficiency issue on the negative samples. We propose an auto-construction method to well solve the extensively existed long-term difficulty. Experimental results show our model can effectively predict the practicality of chemical reactions, which achieves a high accuracy of 99.76\% on real large-scale chemical lab reaction practicality judgment.

  Click for Model/Code and Paper
Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding

Nov 25, 2019
Lingfei Wu, Ian En-Hsu Yen, Zhen Zhang, Kun Xu, Liang Zhao, Xi Peng, Yinglong Xia, Charu Aggarwal

Graph kernels are widely used for measuring the similarity between graphs. Many existing graph kernels, which focus on local patterns within graphs rather than their global properties, suffer from significant structure information loss when representing graphs. Some recent global graph kernels, which utilizes the alignment of geometric node embeddings of graphs, yield state-of-the-art performance. However, these graph kernels are not necessarily positive-definite. More importantly, computing the graph kernel matrix will have at least quadratic {time} complexity in terms of the number and the size of the graphs. In this paper, we propose a new family of global alignment graph kernels, which take into account the global properties of graphs by using geometric node embeddings and an associated node transportation based on earth mover's distance. Compared to existing global kernels, the proposed kernel is positive-definite. Our graph kernel is obtained by defining a distribution over \emph{random graphs}, which can naturally yield random feature approximations. The random feature approximations lead to our graph embeddings, which is named as "random graph embeddings" (RGE). In particular, RGE is shown to achieve \emph{(quasi-)linear scalability} with respect to the number and the size of the graphs. The experimental results on nine benchmark datasets demonstrate that RGE outperforms or matches twelve state-of-the-art graph classification algorithms.

* KDD'19, Oral Paper, Data and Code link available in the paper 

  Click for Model/Code and Paper
Attention-Guided Lightweight Network for Real-Time Segmentation of Robotic Surgical Instruments

Oct 24, 2019
Zhen-Liang Ni, Gui-Bin Bian, Zeng-Guang Hou, Xiao-Hu Zhou, Xiao-Liang Xie, Zhen Li

Real-time segmentation of surgical instruments plays a crucial role in robot-assisted surgery. However, real-time segmentation of surgical instruments using current deep learning models is still a challenging task due to the high computational costs and slow inference speed. In this paper, an attention-guided lightweight network (LWANet), is proposed to segment surgical instruments in real-time. LWANet adopts the encoder-decoder architecture, where the encoder is the lightweight network MobileNetV2 and the decoder consists of depth-wise separable convolution, attention fusion block, and transposed convolution. Depth-wise separable convolution is used as the basic unit to construct the decoder, which can reduce the model size and computational costs. Attention fusion block captures global context and encodes semantic dependencies between channels to emphasize target regions, contributing to locating the surgical instrument. Transposed convolution is performed to upsample the feature map for acquiring refined edges. LWANet can segment surgical instruments in real-time, taking few computational costs. Based on 960*544 inputs, its inference speed can reach 39 fps with only 3.39 GFLOPs. Also, it has a small model size and the number of parameters is only 2.06 M. The proposed network is evaluated on two datasets. It achieves state-of-the-art performance 94.10% mean IOU on Cata7 and obtains a new record on EndoVis 2017 with 4.10% increase on mean mIOU.

  Click for Model/Code and Paper
RASNet: Segmentation for Tracking Surgical Instruments in Surgical Videos Using Refined Attention Segmentation Network

May 21, 2019
Zhen-Liang Ni, Gui-Bin Bian, Xiao-Liang Xie, Zeng-Guang Hou, Xiao-Hu Zhou, Yan-Jie Zhou

Segmentation for tracking surgical instruments plays an important role in robot-assisted surgery. Segmentation of surgical instruments contributes to capturing accurate spatial information for tracking. In this paper, a novel network, Refined Attention Segmentation Network, is proposed to simultaneously segment surgical instruments and identify their categories. The U-shape network which is popular in segmentation is used. Different from previous work, an attention module is adopted to help the network focus on key regions, which can improve the segmentation accuracy. To solve the class imbalance problem, the weighted sum of the cross entropy loss and the logarithm of the Jaccard index is used as loss function. Furthermore, transfer learning is adopted in our network. The encoder is pre-trained on ImageNet. The dataset from the MICCAI EndoVis Challenge 2017 is used to evaluate our network. Based on this dataset, our network achieves state-of-the-art performance 94.65% mean Dice and 90.33% mean IOU.

* This paper has been accepted by 2019 41st Annual International Conference of the IEEE Engineering in Medicine &Biology Society (EMBC) 

  Click for Model/Code and Paper
BARNet: Bilinear Attention Network with Adaptive Receptive Field for Surgical Instrument Segmentation

Jan 20, 2020
Zhen-Liang Ni, Gui-Bin Bian, Guan-An Wang, Xiao-Hu Zhou, Zeng-Guang Hou, Xiao-Liang Xie, Zhen Li, Yu-Han Wang

Surgical instrument segmentation is extremely important for computer-assisted surgery. Different from common object segmentation, it is more challenging due to the large illumination and scale variation caused by the special surgical scenes. In this paper, we propose a novel bilinear attention network with adaptive receptive field to solve these two challenges. For the illumination variation, the bilinear attention module can capture second-order statistics to encode global contexts and semantic dependencies between local pixels. With them, semantic features in challenging areas can be inferred from their neighbors and the distinction of various semantics can be boosted. For the scale variation, our adaptive receptive field module aggregates multi-scale features and automatically fuses them with different weights. Specifically, it encodes the semantic relationship between channels to emphasize feature maps with appropriate scales, changing the receptive field of subsequent convolutions. The proposed network achieves the best performance 97.47% mean IOU on Cata7 and comes first place on EndoVis 2017 by 10.10% IOU overtaking second-ranking method.

  Click for Model/Code and Paper
RAUNet: Residual Attention U-Net for Semantic Segmentation of Cataract Surgical Instruments

Oct 02, 2019
Zhen-Liang Ni, Gui-Bin Bian, Xiao-Hu Zhou, Zeng-Guang Hou, Xiao-Liang Xie, Chen Wang, Yan-Jie Zhou, Rui-Qi Li, Zhen Li

Semantic segmentation of surgical instruments plays a crucial role in robot-assisted surgery. However, accurate segmentation of cataract surgical instruments is still a challenge due to specular reflection and class imbalance issues. In this paper, an attention-guided network is proposed to segment the cataract surgical instrument. A new attention module is designed to learn discriminative features and address the specular reflection issue. It captures global context and encodes semantic dependencies to emphasize key semantic features, boosting the feature representation. This attention module has very few parameters, which helps to save memory. Thus, it can be flexibly plugged into other networks. Besides, a hybrid loss is introduced to train our network for addressing the class imbalance issue, which merges cross entropy and logarithms of Dice loss. A new dataset named Cata7 is constructed to evaluate our network. To the best of our knowledge, this is the first cataract surgical instrument dataset for semantic segmentation. Based on this dataset, RAUNet achieves state-of-the-art performance 97.71% mean Dice and 95.62% mean IOU.

* Accepted by the 26th International Conference on Neural Information Processing (ICONIP2019). arXiv admin note: cs.CV => eess.IV cs.CV 

  Click for Model/Code and Paper