Research papers and code for "Hao Li":
Traditional intelligent fault diagnosis of rolling bearings work well only under a common assumption that the labeled training data (source domain) and unlabeled testing data (target domain) are drawn from the same distribution. However, in many real-world applications, this assumption does not hold, especially when the working condition varies. In this paper, a new adversarial adaptive 1-D CNN called A2CNN is proposed to address this problem. A2CNN consists of four parts, namely, a source feature extractor, a target feature extractor, a label classifier and a domain discriminator. The layers between the source and target feature extractor are partially untied during the training stage to take both training efficiency and domain adaptation into consideration. Experiments show that A2CNN has strong fault-discriminative and domain-invariant capacity, and therefore can achieve high accuracy under different working conditions. We also visualize the learned features and the networks to explore the reasons behind the high performance of our proposed model.

Click to Read Paper and Get Code
Biologically inspired model (BIM) for image recognition is a robust computational architecture, which has attracted widespread attention. BIM can be described as a four-layer structure based on the mechanisms of the visual cortex. Although the performance of BIM for image recognition is robust, it takes the randomly selected ways for the patch selection, which is sightless, and results in heavy computing burden. To address this issue, we propose a novel patch selection method with oriented Gaussian-Hermite moment (PSGHM), and we enhanced the BIM based on the proposed PSGHM, named as PBIM. In contrast to the conventional BIM which adopts the random method to select patches within the feature representation layers processed by multi-scale Gabor filter banks, the proposed PBIM takes the PSGHM way to extract a small number of representation features while offering promising distinctiveness. To show the effectiveness of the proposed PBIM, experimental studies on object categorization are conducted on the CalTech05, TU Darmstadt (TUD), and GRAZ01 databases. Experimental results demonstrate that the performance of PBIM is a significant improvement on that of the conventional BIM.

Click to Read Paper and Get Code
Multi-person pose estimation is fundamental to many computer vision tasks and has made significant progress in recent years. However, few previous methods explored the problem of pose estimation in crowded scenes while it remains challenging and inevitable in many scenarios. Moreover, current benchmarks cannot provide an appropriate evaluation for such cases. In this paper, we propose a novel and efficient method to tackle the problem of pose estimation in the crowd and a new dataset to better evaluate algorithms. Our model consists of two key components: joint-candidate single person pose estimation (SPPE) and global maximum joints association. With multi-peak prediction for each joint and global association using graph model, our method is robust to inevitable interference in crowded scenes and very efficient in inference. The proposed method surpasses the state-of-the-art methods on CrowdPose dataset by 4.8 mAP and results on MSCOCO dataset demonstrate the generalization ability of our method. Source code and dataset will be made publicly available.

Click to Read Paper and Get Code
This paper presents design principles for comfort-centered wearable robots and their application in a lightweight and backdrivable knee exoskeleton. The mitigation of discomfort is treated as mechanical design and control issues and three solutions are proposed in this paper: 1) a new wearable structure optimizes the strap attachment configuration and suit layout to ameliorate excessive shear forces of conventional wearable structure design; 2) rolling knee joint and double-hinge mechanisms reduce the misalignment in the sagittal and frontal plane, without increasing the mechanical complexity and inertia, respectively; 3) a low impedance mechanical transmission reduces the reflected inertia and damping of the actuator to human, thus the exoskeleton is highly-backdrivable. Kinematic simulations demonstrate that misalignment between the robot joint and knee joint can be reduced by 74% at maximum knee flexion. In experiments, the exoskeleton in the unpowered mode exhibits 1.03 Nm root mean square (RMS) low resistive torque. The torque control experiments demonstrate 0.31 Nm RMS torque tracking error in three human subjects.

* 8 pages, 16figures, Journal
Click to Read Paper and Get Code
The "digital Michelangelo project" was a seminal computer vision project in the early 2000's that pushed the capabilities of acquisition systems and involved multiple people from diverse fields, many of whom are now leaders in industry and academia. Reviewing this project with modern eyes provides us with the opportunity to reflect on several issues, relevant now as then to the field of computer vision and research in general, that go beyond the technical aspects of the work. This article was written in the context of a reading group competition at the week-long International Computer Vision Summer School 2017 (ICVSS) on Sicily, Italy. To deepen the participants understanding of computer vision and to foster a sense of community, various reading groups were tasked to highlight important lessons which may be learned from provided literature, going beyond the contents of the paper. This report is the winning entry of this guided discourse (Fig. 1). The authors closely examined the origins, fruits and most importantly lessons about research in general which may be distilled from the "digital Michelangelo project". Discussions leading to this report were held within the group as well as with Hao Li, the group mentor.

* 5 pages. 3 figures
Click to Read Paper and Get Code
Segmentation of colorectal cancerous regions from Magnetic Resonance (MR) images is a crucial procedure for radiotherapy which conventionally requires accurate delineation of tumour boundaries at an expense of labor, time and reproducibility. To address this important yet challenging task within the framework of performance-leading deep learning methods, regions of interest (RoIs) localization from large whole volume 3D images serves as a preceding operation that brings about multiple benefits in terms of speed, target completeness and reduction of false positives. Distinct from sliding window or discrete localization-segmentation based models, we propose a novel multi-task framework referred to as 3D RoI-aware U-Net (3D RU-Net), for RoI localization and intra-RoI segmentation where the two tasks share one backbone encoder network. With the region proposals from the encoder, we crop multi-level feature maps from the backbone network to form a GPU memory-efficient decoder for detail-preserving intra-RoI segmentation. To effectively train the model, we designed a Dice formulated loss function for the global-to-local multi-task learning procedure. Based on the promising efficiency gains demonstrated by the proposed method, we went on to ensemble multiple models to achieve even higher performance costing minor extra computational expensiveness. Extensive experiments were subsequently conducted on 64 cancerous cases with a four-fold cross-validation, and the results showed significant superiority in terms of accuracy and efficiency over conventional state-of-the art frameworks. In conclusion, the proposed method has a huge potential for extension to other 3D object segmentation tasks from medical images due to its inherent generalizability. The code for the proposed method is publicly available.

Click to Read Paper and Get Code
Targeted sentiment analysis is the task of jointly predicting target entities and their associated sentiment information. Existing research efforts mostly regard this joint task as a sequence labeling problem, building models that can capture explicit structures in the output space. However, the importance of capturing implicit global structural information that resides in the input space is largely unexplored. In this work, we argue that both types of information (implicit and explicit structural information) are crucial for building a successful targeted sentiment analysis model. Our experimental results show that properly capturing both information is able to lead to better performance than competitive existing approaches. We also conduct extensive experiments to investigate our model's effectiveness and robustness.

Click to Read Paper and Get Code
Existing methods for AI-generated artworks still struggle with generating high-quality stylized content, where high-level semantics are preserved, or separating fine-grained styles from various artists. We propose a novel Generative Adversarial Disentanglement Network which can fully decompose complex anime illustrations into style and content. Training such model is challenging, since given a style, various content data may exist but not the other way round. In particular, we disentangle two complementary factors of variations, where one of the factors is labelled. Our approach is divided into two stages, one that encodes an input image into a style independent content, and one based on a dual-conditional generator. We demonstrate the ability to generate high-fidelity anime portraits with a fixed content and a large variety of styles from over a thousand artists, and vice versa, using a single end-to-end network and with applications in style transfer. We show this unique capability as well as superior output to the current state-of-the-art.

* Submitted to NeurIPS 2019
Click to Read Paper and Get Code
Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin's maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set. We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in low-memory devices.

Click to Read Paper and Get Code
Deep learning-based style transfer between images has recently become a popular area of research. A common way of encoding "style" is through a feature representation based on the Gram matrix of features extracted by some pre-trained neural network or some other form of feature statistics. Such a definition is based on an arbitrary human decision and may not best capture what a style really is. In trying to gain a better understanding of "style", we propose a metric learning-based method to explicitly encode the style of an artwork. In particular, our definition of style captures the differences between artists, as shown by classification performances, and such that the style representation can be interpreted, manipulated and visualized through style-conditioned image generation through a Generative Adversarial Network. We employ this method to explore the style space of anime portrait illustrations.

Click to Read Paper and Get Code
Generative adversarial networks (GANs) are highly effective unsupervised learning frameworks that can generate very sharp data, even for data such as images with complex, highly multimodal distributions. However GANs are known to be very hard to train, suffering from problems such as mode collapse and disturbing visual artifacts. Batch normalization (BN) techniques have been introduced to address the training. Though BN accelerates the training in the beginning, our experiments show that the use of BN can be unstable and negatively impact the quality of the trained model. The evaluation of BN and numerous other recent schemes for improving GAN training is hindered by the lack of an effective objective quality measure for GAN models. To address these issues, we first introduce a weight normalization (WN) approach for GAN training that significantly improves the stability, efficiency and the quality of the generated samples. To allow a methodical evaluation, we introduce squared Euclidean reconstruction error on a test set as a new objective measure, to assess training performance in terms of speed, stability, and quality of generated samples. Our experiments with a standard DCGAN architecture on commonly used datasets (CelebA, LSUN bedroom, and CIFAR-10) indicate that training using WN is generally superior to BN for GANs, achieving 10% lower mean squared loss for reconstruction and significantly better qualitative results than BN. We further demonstrate the stability of WN on a 21-layer ResNet trained with the CelebA data set. The code for this paper is available at https://github.com/stormraiser/gan-weightnorm-resnet

* v3 rejected by NIPS 2017, updated and re-submitted to CVPR 2018. v4: add experiments with ResNet and like to new code
Click to Read Paper and Get Code
Measuring the performance of solar energy and heat transfer systems requires a lot of time, economic cost and manpower. Meanwhile, directly predicting their performance is challenging due to the complicated internal structures. Fortunately, a knowledge-based machine learning method can provide a promising prediction and optimization strategy for the performance of energy systems. In this Chapter, the authors will show how they utilize the machine learning models trained from a large experimental database to perform precise prediction and optimization on a solar water heater (SWH) system. A new energy system optimization strategy based on a high-throughput screening (HTS) process is proposed. This Chapter consists of: i) Comparative studies on varieties of machine learning models (artificial neural networks (ANNs), support vector machine (SVM) and extreme learning machine (ELM)) to predict the performances of SWHs; ii) Development of an ANN-based software to assist the quick prediction and iii) Introduction of a computational HTS method to design a high-performance SWH system.

* 20 pages
Click to Read Paper and Get Code
Most video surveillance systems use both RGB and infrared cameras, making it a vital technique to re-identify a person cross the RGB and infrared modalities. This task can be challenging due to both the cross-modality variations caused by heterogeneous images in RGB and infrared, and the intra-modality variations caused by the heterogeneous human poses, camera views, light brightness, etc. To meet these challenges a novel feature learning framework, HPILN, is proposed. In the framework existing single-modality re-identification models are modified to fit for the cross-modality scenario, following which specifically designed hard pentaplet loss and identity loss are used to improve the performance of the modified cross-modality re-identification models. Based on the benchmark of the SYSU-MM01 dataset, extensive experiments have been conducted, which show that the proposed method outperforms all existing methods in terms of Cumulative Match Characteristic curve (CMC) and Mean Average Precision (MAP).

Click to Read Paper and Get Code
We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.

Click to Read Paper and Get Code
As wireless networks evolve towards high mobility and providing better support for connected vehicles, a number of new challenges arise due to the resulting high dynamics in vehicular environments and thus motive rethinking of traditional wireless design methodologies. Future intelligent vehicles, which are at the heart of high mobility networks, are increasingly equipped with multiple advanced onboard sensors and keep generating large volumes of data. Machine learning, as an effective approach to artificial intelligence, can provide a rich set of tools to exploit such data for the benefit of the networks. In this article, we first identify the distinctive characteristics of high mobility vehicular networks and motivate the use of machine learning to address the resulting challenges. After a brief introduction of the major concepts of machine learning, we discuss its applications to learn the dynamics of vehicular networks and make informed decisions to optimize network performance. In particular, we discuss in greater detail the application of reinforcement learning in managing network resources as an alternative to the prevalent optimization approach. Finally, some open issues worth further investigation are highlighted.

* Accepted by IEEE Internet of Things Journal
Click to Read Paper and Get Code
Repeated game has long been the touchstone model for agents' long-run relationships. Previous results suggest that it is particularly difficult for a repeated game player to exert an autocratic control on the payoffs since they are jointly determined by all participants. This work discovers that the scale of a player's capability to unilaterally influence the payoffs may have been much underestimated. Under the conventional iterated prisoner's dilemma, we develop a general framework for controlling the feasible region where the players' payoff pairs lie. A control strategy player is able to confine the payoff pairs in her objective region, as long as this region has feasible linear boundaries. With this framework, many well-known existing strategies can be categorized and various new strategies with nice properties can be further identified. We show that the control strategies perform well either in a tournament or against a human-like opponent.

* Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Main track. Pages 296-302. 2018
Click to Read Paper and Get Code
Supervised speech separation uses supervised learning algorithms to learn a mapping from an input noisy signal to an output target. With the fast development of deep learning, supervised separation has become the most important direction in speech separation area in recent years. For the supervised algorithm, training target has a significant impact on the performance. Ideal ratio mask is a commonly used training target, which can improve the speech intelligibility and quality of the separated speech. However, it does not take into account the correlation between noise and clean speech. In this paper, we use the optimal ratio mask as the training target of the deep neural network (DNN) for speech separation. The experiments are carried out under various noise environments and signal to noise ratio (SNR) conditions. The results show that the optimal ratio mask outperforms other training targets in general.

Click to Read Paper and Get Code
A new kind of geometric invariants is proposed in this paper, which is called affine weighted moment invariant (AWMI). By combination of local affine differential invariants and a framework of global integral, they can more effectively extract features of images and help to increase the number of low-order invariants and to decrease the calculating cost. The experimental results show that AWMIs have good stability and distinguishability and achieve better results in image retrieval than traditional moment invariants. An extension to 3D is straightforward.

Click to Read Paper and Get Code
We propose the general construction formula of shape-color primitives by using partial differentials of each color channel in this paper. By using all kinds of shape-color primitives, shape-color differential moment invariants can be constructed very easily, which are invariant to the shape affine and color affine transforms. 50 instances of SCDMIs are obtained finally. In experiments, several commonly used color descriptors and SCDMIs are used in image classification and retrieval of color images, respectively. By comparing the experimental results, we find that SCDMIs get better results.

* 13 pages, 4 figures
Click to Read Paper and Get Code
We present a novel Affine-Gradient based Local Binary Pattern (AGLBP) descriptor for texture classification. It is very hard to describe complicated texture using single type information, such as Local Binary Pattern (LBP), which just utilizes the sign information of the difference between the pixel and its local neighbors. Our descriptor has three characteristics: 1) In order to make full use of the information contained in the texture, the Affine-Gradient, which is different from Euclidean-Gradient and invariant to affine transformation is incorporated into AGLBP. 2) An improved method is proposed for rotation invariance, which depends on the reference direction calculating respect to local neighbors. 3) Feature selection method, considering both the statistical frequency and the intraclass variance of the training dataset, is also applied to reduce the dimensionality of descriptors. Experiments on three standard texture datasets, Outex12, Outex10 and KTH-TIPS2, are conducted to evaluate the performance of AGLBP. The results show that our proposed descriptor gets better performance comparing to some state-of-the-art rotation texture descriptors in texture classification.

* 11 pages,4 pages
Click to Read Paper and Get Code