Models, code, and papers for "Hao Li":
Traditional intelligent fault diagnosis of rolling bearings work well only under a common assumption that the labeled training data (source domain) and unlabeled testing data (target domain) are drawn from the same distribution. However, in many real-world applications, this assumption does not hold, especially when the working condition varies. In this paper, a new adversarial adaptive 1-D CNN called A2CNN is proposed to address this problem. A2CNN consists of four parts, namely, a source feature extractor, a target feature extractor, a label classifier and a domain discriminator. The layers between the source and target feature extractor are partially untied during the training stage to take both training efficiency and domain adaptation into consideration. Experiments show that A2CNN has strong fault-discriminative and domain-invariant capacity, and therefore can achieve high accuracy under different working conditions. We also visualize the learned features and the networks to explore the reasons behind the high performance of our proposed model.
Biologically inspired model (BIM) for image recognition is a robust computational architecture, which has attracted widespread attention. BIM can be described as a four-layer structure based on the mechanisms of the visual cortex. Although the performance of BIM for image recognition is robust, it takes the randomly selected ways for the patch selection, which is sightless, and results in heavy computing burden. To address this issue, we propose a novel patch selection method with oriented Gaussian-Hermite moment (PSGHM), and we enhanced the BIM based on the proposed PSGHM, named as PBIM. In contrast to the conventional BIM which adopts the random method to select patches within the feature representation layers processed by multi-scale Gabor filter banks, the proposed PBIM takes the PSGHM way to extract a small number of representation features while offering promising distinctiveness. To show the effectiveness of the proposed PBIM, experimental studies on object categorization are conducted on the CalTech05, TU Darmstadt (TUD), and GRAZ01 databases. Experimental results demonstrate that the performance of PBIM is a significant improvement on that of the conventional BIM.
Multi-person pose estimation is fundamental to many computer vision tasks and has made significant progress in recent years. However, few previous methods explored the problem of pose estimation in crowded scenes while it remains challenging and inevitable in many scenarios. Moreover, current benchmarks cannot provide an appropriate evaluation for such cases. In this paper, we propose a novel and efficient method to tackle the problem of pose estimation in the crowd and a new dataset to better evaluate algorithms. Our model consists of two key components: joint-candidate single person pose estimation (SPPE) and global maximum joints association. With multi-peak prediction for each joint and global association using graph model, our method is robust to inevitable interference in crowded scenes and very efficient in inference. The proposed method surpasses the state-of-the-art methods on CrowdPose dataset by 4.8 mAP and results on MSCOCO dataset demonstrate the generalization ability of our method. Source code and dataset will be made publicly available.
This paper presents design principles for comfort-centered wearable robots and their application in a lightweight and backdrivable knee exoskeleton. The mitigation of discomfort is treated as mechanical design and control issues and three solutions are proposed in this paper: 1) a new wearable structure optimizes the strap attachment configuration and suit layout to ameliorate excessive shear forces of conventional wearable structure design; 2) rolling knee joint and double-hinge mechanisms reduce the misalignment in the sagittal and frontal plane, without increasing the mechanical complexity and inertia, respectively; 3) a low impedance mechanical transmission reduces the reflected inertia and damping of the actuator to human, thus the exoskeleton is highly-backdrivable. Kinematic simulations demonstrate that misalignment between the robot joint and knee joint can be reduced by 74% at maximum knee flexion. In experiments, the exoskeleton in the unpowered mode exhibits 1.03 Nm root mean square (RMS) low resistive torque. The torque control experiments demonstrate 0.31 Nm RMS torque tracking error in three human subjects.
The "digital Michelangelo project" was a seminal computer vision project in the early 2000's that pushed the capabilities of acquisition systems and involved multiple people from diverse fields, many of whom are now leaders in industry and academia. Reviewing this project with modern eyes provides us with the opportunity to reflect on several issues, relevant now as then to the field of computer vision and research in general, that go beyond the technical aspects of the work. This article was written in the context of a reading group competition at the week-long International Computer Vision Summer School 2017 (ICVSS) on Sicily, Italy. To deepen the participants understanding of computer vision and to foster a sense of community, various reading groups were tasked to highlight important lessons which may be learned from provided literature, going beyond the contents of the paper. This report is the winning entry of this guided discourse (Fig. 1). The authors closely examined the origins, fruits and most importantly lessons about research in general which may be distilled from the "digital Michelangelo project". Discussions leading to this report were held within the group as well as with Hao Li, the group mentor.
Segmentation of colorectal cancerous regions from Magnetic Resonance (MR) images is a crucial procedure for radiotherapy which conventionally requires accurate delineation of tumour boundaries at an expense of labor, time and reproducibility. To address this important yet challenging task within the framework of performance-leading deep learning methods, regions of interest (RoIs) localization from large whole volume 3D images serves as a preceding operation that brings about multiple benefits in terms of speed, target completeness and reduction of false positives. Distinct from sliding window or discrete localization-segmentation based models, we propose a novel multi-task framework referred to as 3D RoI-aware U-Net (3D RU-Net), for RoI localization and intra-RoI segmentation where the two tasks share one backbone encoder network. With the region proposals from the encoder, we crop multi-level feature maps from the backbone network to form a GPU memory-efficient decoder for detail-preserving intra-RoI segmentation. To effectively train the model, we designed a Dice formulated loss function for the global-to-local multi-task learning procedure. Based on the promising efficiency gains demonstrated by the proposed method, we went on to ensemble multiple models to achieve even higher performance costing minor extra computational expensiveness. Extensive experiments were subsequently conducted on 64 cancerous cases with a four-fold cross-validation, and the results showed significant superiority in terms of accuracy and efficiency over conventional state-of-the art frameworks. In conclusion, the proposed method has a huge potential for extension to other 3D object segmentation tasks from medical images due to its inherent generalizability. The code for the proposed method is publicly available.
As the first step in automated natural language processing, representing words and sentences is of central importance and has attracted significant research attention. Different approaches, from the early one-hot and bag-of-words representation to more recent distributional dense and sparse representations, were proposed. Despite the successful results that have been achieved, such vectors tend to consist of uninterpretable components and face nontrivial challenge in both memory and computational requirement in practical applications. In this paper, we designed a novel representation model that projects dense word vectors into a higher dimensional space and favors a highly sparse and binary representation of word vectors with potentially interpretable components, while trying to maintain pairwise inner products between original vectors as much as possible. Computationally, our model is relaxed as a symmetric non-negative matrix factorization problem which admits a fast yet effective solution. In a series of empirical evaluations, the proposed model exhibited consistent improvement and high potential in practical applications.
The goal of this work is to present a systematic solution for RGB-D salient object detection, which addresses the following three aspects with a unified framework: modal-specific representation learning, complementary cue selection and cross-modal complement fusion. To learn discriminative modal-specific features, we propose a hierarchical cross-modal distillation scheme, in which the well-learned source modality provides supervisory signals to facilitate the learning process for the new modality. To better extract the complementary cues, we formulate a residual function to incorporate complements from the paired modality adaptively. Furthermore, a top-down fusion structure is constructed for sufficient cross-modal interactions and cross-level transmissions. The experimental results demonstrate the effectiveness of the proposed cross-modal distillation scheme in zero-shot saliency detection and pre-training on a new modality, as well as the advantages in selecting and fusing cross-modal/cross-level complements.
Targeted sentiment analysis is the task of jointly predicting target entities and their associated sentiment information. Existing research efforts mostly regard this joint task as a sequence labeling problem, building models that can capture explicit structures in the output space. However, the importance of capturing implicit global structural information that resides in the input space is largely unexplored. In this work, we argue that both types of information (implicit and explicit structural information) are crucial for building a successful targeted sentiment analysis model. Our experimental results show that properly capturing both information is able to lead to better performance than competitive existing approaches. We also conduct extensive experiments to investigate our model's effectiveness and robustness.
Existing methods for AI-generated artworks still struggle with generating high-quality stylized content, where high-level semantics are preserved, or separating fine-grained styles from various artists. We propose a novel Generative Adversarial Disentanglement Network which can fully decompose complex anime illustrations into style and content. Training such model is challenging, since given a style, various content data may exist but not the other way round. In particular, we disentangle two complementary factors of variations, where one of the factors is labelled. Our approach is divided into two stages, one that encodes an input image into a style independent content, and one based on a dual-conditional generator. We demonstrate the ability to generate high-fidelity anime portraits with a fixed content and a large variety of styles from over a thousand artists, and vice versa, using a single end-to-end network and with applications in style transfer. We show this unique capability as well as superior output to the current state-of-the-art.
Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin's maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set. We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in low-memory devices.
Deep learning-based style transfer between images has recently become a popular area of research. A common way of encoding "style" is through a feature representation based on the Gram matrix of features extracted by some pre-trained neural network or some other form of feature statistics. Such a definition is based on an arbitrary human decision and may not best capture what a style really is. In trying to gain a better understanding of "style", we propose a metric learning-based method to explicitly encode the style of an artwork. In particular, our definition of style captures the differences between artists, as shown by classification performances, and such that the style representation can be interpreted, manipulated and visualized through style-conditioned image generation through a Generative Adversarial Network. We employ this method to explore the style space of anime portrait illustrations.
Generative adversarial networks (GANs) are highly effective unsupervised learning frameworks that can generate very sharp data, even for data such as images with complex, highly multimodal distributions. However GANs are known to be very hard to train, suffering from problems such as mode collapse and disturbing visual artifacts. Batch normalization (BN) techniques have been introduced to address the training. Though BN accelerates the training in the beginning, our experiments show that the use of BN can be unstable and negatively impact the quality of the trained model. The evaluation of BN and numerous other recent schemes for improving GAN training is hindered by the lack of an effective objective quality measure for GAN models. To address these issues, we first introduce a weight normalization (WN) approach for GAN training that significantly improves the stability, efficiency and the quality of the generated samples. To allow a methodical evaluation, we introduce squared Euclidean reconstruction error on a test set as a new objective measure, to assess training performance in terms of speed, stability, and quality of generated samples. Our experiments with a standard DCGAN architecture on commonly used datasets (CelebA, LSUN bedroom, and CIFAR-10) indicate that training using WN is generally superior to BN for GANs, achieving 10% lower mean squared loss for reconstruction and significantly better qualitative results than BN. We further demonstrate the stability of WN on a 21-layer ResNet trained with the CelebA data set. The code for this paper is available at https://github.com/stormraiser/gan-weightnorm-resnet
Measuring the performance of solar energy and heat transfer systems requires a lot of time, economic cost and manpower. Meanwhile, directly predicting their performance is challenging due to the complicated internal structures. Fortunately, a knowledge-based machine learning method can provide a promising prediction and optimization strategy for the performance of energy systems. In this Chapter, the authors will show how they utilize the machine learning models trained from a large experimental database to perform precise prediction and optimization on a solar water heater (SWH) system. A new energy system optimization strategy based on a high-throughput screening (HTS) process is proposed. This Chapter consists of: i) Comparative studies on varieties of machine learning models (artificial neural networks (ANNs), support vector machine (SVM) and extreme learning machine (ELM)) to predict the performances of SWHs; ii) Development of an ANN-based software to assist the quick prediction and iii) Introduction of a computational HTS method to design a high-performance SWH system.
Most video surveillance systems use both RGB and infrared cameras, making it a vital technique to re-identify a person cross the RGB and infrared modalities. This task can be challenging due to both the cross-modality variations caused by heterogeneous images in RGB and infrared, and the intra-modality variations caused by the heterogeneous human poses, camera views, light brightness, etc. To meet these challenges a novel feature learning framework, HPILN, is proposed. In the framework existing single-modality re-identification models are modified to fit for the cross-modality scenario, following which specifically designed hard pentaplet loss and identity loss are used to improve the performance of the modified cross-modality re-identification models. Based on the benchmark of the SYSU-MM01 dataset, extensive experiments have been conducted, which show that the proposed method outperforms all existing methods in terms of Cumulative Match Characteristic curve (CMC) and Mean Average Precision (MAP).
We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.
With the tremendous success of deep learning in visual tasks, the representations extracted from intermediate layers of learned models, that is, deep features, attract much attention of researchers. The previous analysis shows that those features include appropriate semantic information. By training the deep models on a large-scale benchmark data set (e.g., ImageNet), the features can work well on other tasks. In this work, we investigate this phenomenon and demonstrate that deep features can fail due to the fact that they are learned by minimizing empirical risk. When the distribution of data is different from that of the benchmark data set, the performance of deep features can degrade. Hence, we propose a hierarchically robust optimization to learn more generic features. Considering the example-level and concept-level robustness simultaneously, we formulate the problem as a distributionally robust optimization problem with Wasserstein ambiguity set constraints. An efficient algorithm with the conventional training pipeline is proposed. Experiments on benchmark data sets confirm our claim and demonstrate the effectiveness of the robust deep representations.
As wireless networks evolve towards high mobility and providing better support for connected vehicles, a number of new challenges arise due to the resulting high dynamics in vehicular environments and thus motive rethinking of traditional wireless design methodologies. Future intelligent vehicles, which are at the heart of high mobility networks, are increasingly equipped with multiple advanced onboard sensors and keep generating large volumes of data. Machine learning, as an effective approach to artificial intelligence, can provide a rich set of tools to exploit such data for the benefit of the networks. In this article, we first identify the distinctive characteristics of high mobility vehicular networks and motivate the use of machine learning to address the resulting challenges. After a brief introduction of the major concepts of machine learning, we discuss its applications to learn the dynamics of vehicular networks and make informed decisions to optimize network performance. In particular, we discuss in greater detail the application of reinforcement learning in managing network resources as an alternative to the prevalent optimization approach. Finally, some open issues worth further investigation are highlighted.
Repeated game has long been the touchstone model for agents' long-run relationships. Previous results suggest that it is particularly difficult for a repeated game player to exert an autocratic control on the payoffs since they are jointly determined by all participants. This work discovers that the scale of a player's capability to unilaterally influence the payoffs may have been much underestimated. Under the conventional iterated prisoner's dilemma, we develop a general framework for controlling the feasible region where the players' payoff pairs lie. A control strategy player is able to confine the payoff pairs in her objective region, as long as this region has feasible linear boundaries. With this framework, many well-known existing strategies can be categorized and various new strategies with nice properties can be further identified. We show that the control strategies perform well either in a tournament or against a human-like opponent.
Supervised speech separation uses supervised learning algorithms to learn a mapping from an input noisy signal to an output target. With the fast development of deep learning, supervised separation has become the most important direction in speech separation area in recent years. For the supervised algorithm, training target has a significant impact on the performance. Ideal ratio mask is a commonly used training target, which can improve the speech intelligibility and quality of the separated speech. However, it does not take into account the correlation between noise and clean speech. In this paper, we use the optimal ratio mask as the training target of the deep neural network (DNN) for speech separation. The experiments are carried out under various noise environments and signal to noise ratio (SNR) conditions. The results show that the optimal ratio mask outperforms other training targets in general.