Models, code, and papers for "Ying Huang":
Optimal biomarker combinations for treatment-selection can be derived by minimizing total burden to the population caused by the targeted disease and its treatment. However, when multiple biomarkers are present, including all in the model can be expensive and hurt model performance. To remedy this, we consider feature selection in optimization by minimizing an extended total burden that additionally incorporates biomarker measurement costs. Formulating it as a 0-norm penalized weighted classification, we develop various procedures for estimating linear and nonlinear combinations. Through simulations and a real data example, we demonstrate the importance of incorporating feature-selection and marker cost when deriving treatment-selection rules.
Face anti-spoofing is crucial for the security of face recognition system, by avoiding invaded with presentation attack. Previous works have shown the effectiveness of using depth and temporal supervision for this task. However, depth supervision is often considered only in a single frame, and temporal supervision is explored by utilizing certain signals which is not robust to the change of scenes. In this work, motivated by two stream ConvNets, we propose a novel two stream FreqSaptialTemporalNet for face anti-spoofing which simultaneously takes advantage of frequent, spatial and temporal information. Compared with existing methods which mine spoofing cues in multi-frame RGB image, we make multi-frame spectrum image as one input stream for the discriminative deep neural network, encouraging the primary difference between live and fake video to be automatically unearthed. Extensive experiments show promising improvement results using the proposed architecture. Meanwhile, we proposed a concise method to obtain a large amount of spoofing training data by utilizing a frequent augmentation pipeline, which contributes detail visualization between live and fake images as well as data insufficiency issue when training large networks.
In multi-person pose estimation, the left/right joint type discrimination is always a hard problem because of the similar appearance. Traditionally, we solve this problem by stacking multiple refinement modules to increase network's receptive fields and capture more global context, which can also increase a great amount of computation. In this paper, we propose a Multi-level Network (MLN) that learns to aggregate features from lower-level (left/right information), upper-level (localization information), joint-limb level (complementary information) and global-level (context) information for discrimination of joint type. Through feature reuse and its intra-relation, MLN can attain comparable performance to other conventional methods while runtime speed retains at 42.2 FPS.
A new method is developed to deal with the problem that a complex decentralized control system needs to keep centralized control performance. The systematic procedure emphasizes quickly finding the decentralized subcontrollers that matching the closed-loop performance and robustness characteristics of the centralized controller, which is featured by the fact that GA is used to optimize the design of centralized H-infinity controller K(s) and decentralized engine subcontroller KT(s), and that only one interface variable needs to satisfy decentralized control system requirement according to the proposed selection principle. The optimization design is motivated by the implementation issues where it is desirable to reduce the time in trial and error process and accurately find the best decentralized subcontrollers. The method is applied to decentralized control system design for a short takeoff and landing fighter. By comparing the simulation results of the decentralized control system with those of the centralized control system, the target of the decentralized control attains the performance and robustness of centralized control is validated.
This paper studies face recognition (FR) and normalization in surveillance imagery. Surveillance FR is a challenging problem that has great values in law enforcement. Despite recent progress in conventional FR, less effort has been devoted to surveillance FR. To bridge this gap, we propose a Feature Adaptation Network (FAN) to jointly perform surveillance FR and normalization. Our face normalization mainly acts on the aspect of image resolution, closely related to face super-resolution. However, previous face super-resolution methods require paired training data with pixel-to-pixel correspondence, which is typically unavailable between real low- and high-resolution faces. Our FAN can leverage both paired and unpaired data as we disentangle the features into identity and non-identity components and adapt the distribution of the identity features, which breaks the limit of current face super-resolution methods. We further propose a random scale augmentation scheme to learn resolution robust identity features, with advantages over previous fixed scale augmentation. Extensive experiments on LFW, WIDER FACE, QUML-SurvFace and SCface datasets have demonstrated the superiority of our proposed method compared to the state of the arts on surveillance face recognition and normalization.
Graph Attention Networks (GATs) are the state-of-the-art neural architecture for representation learning with graphs. GATs learn attention functions that assign weights to nodes so that different nodes have different influences in the feature aggregation steps. In practice, however, induced attention functions are prone to over-fitting due to the increasing number of parameters and the lack of direct supervision on attention weights. GATs also suffer from over-smoothing at the decision boundary of nodes. Here we propose a framework to address their weaknesses via margin-based constraints on attention during training. We first theoretically demonstrate the over-smoothing behavior of GATs and then develop an approach using constraint on the attention weights according to the class boundary and feature aggregation pattern. Furthermore, to alleviate the over-fitting problem, we propose additional constraints on the graph structure. Extensive experiments and ablation studies on common benchmark datasets demonstrate the effectiveness of our method, which leads to significant improvements over the previous state-of-the-art graph attention methods on all datasets.
Automated machine learning aims to automate the whole process of machine learning, including model configuration. In this paper, we focus on automated hyperparameter optimization (HPO) based on sequential model-based optimization (SMBO). Though conventional SMBO algorithms work well when abundant HPO trials are available, they are far from satisfactory in practical applications where a trial on a huge dataset may be so costly that an optimal hyperparameter configuration is expected to return in as few trials as possible. Observing that human experts draw on their expertise in a machine learning model by trying configurations that once performed well on other datasets, we are inspired to speed up HPO by transferring knowledge from historical HPO trials on other datasets. We propose an end-to-end and efficient HPO algorithm named as Transfer Neural Processes (TNP), which achieves transfer learning by incorporating trials on other datasets, initializing the model with well-generalized parameters, and learning an initial set of hyperparameters to evaluate. Experiments on extensive OpenML datasets and three computer vision datasets show that the proposed model can achieve state-of-the-art performance in at least one order of magnitude less trials.
Timely, accurate and automatic detection of pavement cracks is necessary for making cost-effective decisions concerning road maintenance. Conventional crack detection algorithms focus on the design of single or multiple crack features and classifiers. However, complicated topological structures, varying degrees of damage and oil stains make the design of crack features difficult. In addition, the contextual information around a crack is not investigated extensively in the design process. Accordingly, these design features have limited discriminative adaptability and cannot fuse effectively with the classifiers. To solve these problems, this paper proposes a deep learning network for pavement crack detection. Using the Encoder-Decoder structure, crack characteristics with multiple contexts are automatically learned, and end-to-end crack detection is achieved. Specifically, we first propose the Multi-Dilation (MD) module, which can synthesize the crack features of multiple context sizes via dilated convolution with multiple rates. The crack MD features obtained in this module can describe cracks of different widths and topologies. Next, we propose the SE-Upsampling (SEU) module, which uses the Squeeze-and-Excitation learning operation to optimize the MD features. Finally, the above two modules are integrated to develop the fast crack detection network, namely, FPCNet. This network continuously optimizes the MD features step-by-step to realize fast pixel-level crack detection. Experiments are conducted on challenging public CFD datasets and G45 crack datasets involving various crack types under different shooting conditions. The distinct performance and speed improvements over all the datasets demonstrate that the proposed method outperforms other state-of-the-art crack detection methods.
In order to learn quickly with few samples, meta-learning utilizes prior knowledge learned from previous tasks. However, a critical challenge in meta-learning is task uncertainty and heterogeneity, which can not be handled via globally sharing knowledge among tasks. In this paper, based on gradient-based meta-learning, we propose a hierarchically structured meta-learning (HSML) algorithm that explicitly tailors the transferable knowledge to different clusters of tasks. Inspired by the way human beings organize knowledge, we resort to a hierarchical task clustering structure to cluster tasks. As a result, the proposed approach not only addresses the challenge via the knowledge customization to different clusters of tasks, but also preserves knowledge generalization among a cluster of similar tasks. To tackle the changing of task relationship, in addition, we extend the hierarchical structure to a continual learning environment. The experimental results show that our approach can achieve state-of-the-art performance in both toy-regression and few-shot image classification problems.
A series of methods have been proposed to reconstruct an image from compressively sensed random measurement, but most of them have high time complexity and are inappropriate for patch-based compressed sensing capture, because of their serious blocky artifacts in the restoration results. In this paper, we present a non-iterative image reconstruction method from patch-based compressively sensed random measurement. Our method features two cascaded networks based on residual convolution neural network to learn the end-to-end full image restoration, which is capable of reconstructing image patches and removing the blocky effect with low time cost. Experimental results on synthetic and real data show that our method outperforms state-of-the-art compressive sensing (CS) reconstruction methods with patch-based CS measurement. To demonstrate the effectiveness of our method in more general setting, we apply the de-block process in our method to JPEG compression artifacts removal and achieve outstanding performance as well.
In this paper, we propose a two-timescale delay-optimal dynamic clustering and power allocation design for downlink network MIMO systems. The dynamic clustering control is adaptive to the global queue state information (GQSI) only and computed at the base station controller (BSC) over a longer time scale. On the other hand, the power allocations of all the BSs in one cluster are adaptive to both intra-cluster channel state information (CCSI) and intra-cluster queue state information (CQSI), and computed at the cluster manager (CM) over a shorter time scale. We show that the two-timescale delay-optimal control can be formulated as an infinite-horizon average cost Constrained Partially Observed Markov Decision Process (CPOMDP). By exploiting the special problem structure, we shall derive an equivalent Bellman equation in terms of Pattern Selection Q-factor to solve the CPOMDP. To address the distributive requirement and the issue of exponential memory requirement and computational complexity, we approximate the Pattern Selection Q-factor by the sum of Per-cluster Potential functions and propose a novel distributive online learning algorithm to estimate the Per-cluster Potential functions (at each CM) as well as the Lagrange multipliers (LM) (at each BS). We show that the proposed distributive online learning algorithm converges almost surely (with probability 1). By exploiting the birth-death structure of the queue dynamics, we further decompose the Per-cluster Potential function into sum of Per-cluster Per-user Potential functions and formulate the instantaneous power allocation as a Per-stage QSI-aware Interference Game played among all the CMs. We also propose a QSI-aware Simultaneous Iterative Water-filling Algorithm (QSIWFA) and show that it can achieve the Nash Equilibrium (NE).
Recently, deep learning based facial expression recognition (FER) methods have attracted considerable attention and they usually require large-scale labelled training data. Nonetheless, the publicly available facial expression databases typically contain a small amount of labelled data. In this paper, to overcome the above issue, we propose a novel joint deep learning of facial expression synthesis and recognition method for effective FER. More specifically, the proposed method involves a two-stage learning procedure. Firstly, a facial expression synthesis generative adversarial network (FESGAN) is pre-trained to generate facial images with different facial expressions. To increase the diversity of the training images, FESGAN is elaborately designed to generate images with new identities from a prior distribution. Secondly, an expression recognition network is jointly learned with the pre-trained FESGAN in a unified framework. In particular, the classification loss computed from the recognition network is used to simultaneously optimize the performance of both the recognition network and the generator of FESGAN. Moreover, in order to alleviate the problem of data bias between the real images and the synthetic images, we propose an intra-class loss with a novel real data-guided back-propagation (RDBP) algorithm to reduce the intra-class variations of images from the same class, which can significantly improve the final performance. Extensive experimental results on public facial expression databases demonstrate the superiority of the proposed method compared with several state-of-the-art FER methods.
Supervised machine learning methods usually require a large set of labeled examples for model training. However, in many real applications, there are plentiful unlabeled data but limited labeled data; and the acquisition of labels is costly. Active learning (AL) reduces the labeling cost by iteratively selecting the most valuable data to query their labels from the annotator. This article introduces a Python toobox ALiPy for active learning. ALiPy provides a module based implementation of active learning framework, which allows users to conveniently evaluate, compare and analyze the performance of active learning methods. In the toolbox, multiple options are available for each component of the learning framework, including data process, active selection, label query, results visualization, etc. In addition to the implementations of more than 20 state-of-the-art active learning algorithms, ALiPy also supports users to easily configure and implement their own approaches under different active learning settings, such as AL for multi-label data, AL with noisy annotators, AL with different costs and so on. The toolbox is well-documented and open-source on Github, and can be easily installed through PyPI.
Context enhancement is critical for night vision (NV) applications, especially for the dark night situation without any artificial lights. In this paper, we present the infrared-to-visual (IR2VI) algorithm, a novel unsupervised thermal-to-visible image translation framework based on generative adversarial networks (GANs). IR2VI is able to learn the intrinsic characteristics from VI images and integrate them into IR images. Since the existing unsupervised GAN-based image translation approaches face several challenges, such as incorrect mapping and lack of fine details, we propose a structure connection module and a region-of-interest (ROI) focal loss method to address the current limitations. Experimental results show the superiority of the IR2VI algorithm over baseline methods.
Current instance segmentation methods can be categorized into segmentation-based methods that segment first then do clustering, and proposal-based methods that detect first then predict masks for each instance proposal using repooling. In this work, we propose a one-stage method, named EmbedMask, that unifies both methods by taking advantages of them. Like proposal-based methods, EmbedMask builds on top of detection models making it strong in detection capability. Meanwhile, EmbedMask applies extra embedding modules to generate embeddings for pixels and proposals, where pixel embeddings are guided by proposal embeddings if they belong to the same instance. Through this embedding coupling process, pixels are assigned to the mask of the proposal if their embeddings are similar. The pixel-level clustering enables EmbedMask to generate high-resolution masks without missing details from repooling, and the existence of proposal embedding simplifies and strengthens the clustering procedure to achieve high speed with higher performance than segmentation-based methods. Without any bells and whistles, EmbedMask achieves comparable performance as Mask R-CNN, which is the representative two-stage method, and can produce more detailed masks at a higher speed. Code is available at github.com/yinghdb/EmbedMask.
Human pose estimation has made significant advancement in recent years. However, the existing datasets are limited in their coverage of pose variety. In this paper, we introduce a novel benchmark FollowMeUp Sports that makes an important advance in terms of specific postures, self-occlusion and class balance, a contribution that we feel is required for future development in human body models. This comprehensive dataset was collected using an established taxonomy of over 200 standard workout activities with three different shot angles. The collected videos cover a wider variety of specific workout activities than previous datasets including push-up, squat and body moving near the ground with severe self-occlusion or occluded by some sport equipment and outfits. Given these rich images, we perform a detailed analysis of the leading human pose estimation approaches gaining insights for the success and failures of these methods.
How to incorporate external knowledge into a neural dialogue model is critically important for dialogue systems to behave like real humans. To handle this problem, memory networks are usually a great choice and a promising way. However, existing memory networks do not perform well when leveraging heterogeneous information from different sources. In this paper, we propose a novel and versatile external memory networks called Heterogeneous Memory Networks (HMNs), to simultaneously utilize user utterances, dialogue history and background knowledge tuples. In our method, historical sequential dialogues are encoded and stored into the context-aware memory enhanced by gating mechanism while grounding knowledge tuples are encoded and stored into the context-free memory. During decoding, the decoder augmented with HMNs recurrently selects each word in one response utterance from these two memories and a general vocabulary. Experimental results on multiple real-world datasets show that HMNs significantly outperform the state-of-the-art data-driven task-oriented dialogue models in most domains.
Removing undesirable reflections from a single image captured through a glass window is of practical importance to visual computing systems. Although state-of-the-art methods can obtain decent results in certain situations, performance declines significantly when tackling more general real-world cases. These failures stem from the intrinsic difficulty of single image reflection removal -- the fundamental ill-posedness of the problem, and the insufficiency of densely-labeled training data needed for resolving this ambiguity within learning-based neural network pipelines. In this paper, we address these issues by exploiting targeted network enhancements and the novel use of misaligned data. For the former, we augment a baseline network architecture by embedding context encoding modules that are capable of leveraging high-level contextual clues to reduce indeterminacy within areas containing strong reflections. For the latter, we introduce an alignment-invariant loss function that facilitates exploiting misaligned real-world training data that is much easier to collect. Experimental results collectively show that our method outperforms the state-of-the-art with aligned data, and that significant improvements are possible when using additional misaligned data.
Existing camera-projector calibration methods typically warp feature points from a camera image to a projector image using estimated homographies, and often suffer from errors in camera parameters and noise due to imperfect planarity of the calibration target. In this paper we propose a simple yet robust solution that explicitly deals with these challenges. Following the structured light (SL) camera-project calibration framework, a carefully designed correspondence algorithm is built on top of the De Bruijn patterns. Such correspondence is then used for initial camera-projector calibration. Then, to gain more robustness against noises, especially those from an imperfect planar calibration board, a bundle adjustment algorithm is developed to jointly optimize the estimated camera and projector models. Aside from the robustness, our solution requires only one shot of SL pattern for each calibration board pose, which is much more convenient than multi-shot solutions in practice. Data validations are conducted on both synthetic and real datasets, and our method shows clear advantages over existing methods in all experiments.