Models, code, and papers for "Yao Xiang":
This paper introduces a novel deep learning based method, named bridge neural network (BNN) to dig the potential relationship between two given data sources task by task. The proposed approach employs two convolutional neural networks that project the two data sources into a feature space to learn the desired common representation required by the specific task. The training objective with artificial negative samples is introduced with the ability of mini-batch training and it's asymptotically equivalent to maximizing the total correlation of the two data sources, which is verified by the theoretical analysis. The experiments on the tasks, including pair matching, canonical correlation analysis, transfer learning, and reconstruction demonstrate the state-of-the-art performance of BNN, which may provide new insights into the aspect of common representation learning.
Semi-supervised learning is sought for leveraging the unlabelled data when labelled data is difficult or expensive to acquire. Deep generative models (e.g., Variational Autoencoder (VAE)) and semisupervised Generative Adversarial Networks (GANs) have recently shown promising performance in semi-supervised classification for the excellent discriminative representing ability. However, the latent code learned by the traditional VAE is not exclusive (repeatable) for a specific input sample, which prevents it from excellent classification performance. In particular, the learned latent representation depends on a non-exclusive component which is stochastically sampled from the prior distribution. Moreover, the semi-supervised GAN models generate data from pre-defined distribution (e.g., Gaussian noises) which is independent of the input data distribution and may obstruct the convergence and is difficult to control the distribution of the generated data. To address the aforementioned issues, we propose a novel Adversarial Variational Embedding (AVAE) framework for robust and effective semi-supervised learning to leverage both the advantage of GAN as a high quality generative model and VAE as a posterior distribution learner. The proposed approach first produces an exclusive latent code by the model which we call VAE++, and meanwhile, provides a meaningful prior distribution for the generator of GAN. The proposed approach is evaluated over four different real-world applications and we show that our method outperforms the state-of-the-art models, which confirms that the combination of VAE++ and GAN can provide significant improvements in semisupervised classification.
One of the fundamental problems in crowdsourcing is the trade-off between the number of the workers needed for high-accuracy aggregation and the budget to pay. For saving budget, it is important to ensure high quality of the crowd-sourced labels, hence the total cost on label collection will be reduced. Since the self-confidence of the workers often has a close relationship with their abilities, a possible way for quality control is to request the workers to return the labels only when they feel confident, by means of providing unsure option to them. On the other hand, allowing workers to choose unsure option also leads to the potential danger of budget waste. In this work, we propose the analysis towards understanding when providing the unsure option indeed leads to significant cost reduction, as well as how the confidence threshold is set. We also propose an online mechanism, which is alternative for threshold selection when the estimation of the crowd ability distribution is difficult.
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
Multiple-instance learning (MIL) has served as an important tool for a wide range of vision applications, for instance, image classification, object detection, and visual tracking. In this paper, we propose a novel method to solve the classical MIL problem, named relaxed multiple-instance SVM (RMI-SVM). We treat the positiveness of instance as a continuous variable, use Noisy-OR model to enforce the MIL constraints, and jointly optimize the bag label and instance label in a unified framework. The optimization problem can be efficiently solved using stochastic gradient decent. The extensive experiments demonstrate that RMI-SVM consistently achieves superior performance on various benchmarks for MIL. Moreover, we simply applied RMI-SVM to a challenging vision task, common object discovery. The state-of-the-art results of object discovery on Pascal VOC datasets further confirm the advantages of the proposed method.
Inspired by biological vision systems, the over-complete local features with huge cardinality are increasingly used for face recognition during the last decades. Accordingly, feature selection has become more and more important and plays a critical role for face data description and recognition. In this paper, we propose a trainable feature selection algorithm based on the regularized frame for face recognition. By enforcing a sparsity penalty term on the minimum squared error (MSE) criterion, we cast the feature selection problem into a combinatorial sparse approximation problem, which can be solved by greedy methods or convex relaxation methods. Moreover, based on the same frame, we propose a sparse Ho-Kashyap (HK) procedure to obtain simultaneously the optimal sparse solution and the corresponding margin vector of the MSE criterion. The proposed methods are used for selecting the most informative Gabor features of face images for recognition and the experimental results on benchmark face databases demonstrate the effectiveness of the proposed methods.
Several approaches to image stitching use different constraints to estimate the motion model between image pairs. These constraints can be roughly divided into two categories: geometric constraints and photometric constraints. In this paper, geometric and photometric constraints are combined to improve the alignment quality, which is based on the observation that these two kinds of constraints are complementary. On the one hand, geometric constraints (e.g., point and line correspondences) are usually spatially biased and are insufficient in some extreme scenes, while photometric constraints are always evenly and densely distributed. On the other hand, photometric constraints are sensitive to displacements and are not suitable for images with large parallaxes, while geometric constraints are usually imposed by feature matching and are more robust to handle parallaxes. The proposed method therefore combines them together in an efficient mesh-based image warping framework. It achieves better alignment quality than methods only with geometric constraints, and can handle larger parallax than photometric-constraint-based method. Experimental results on various images illustrate that the proposed method outperforms representative state-of-the-art image stitching methods reported in the literature.
Identifying user's identity is a key problem in many data mining applications, such as product recommendation, customized content delivery and criminal identification. Given a set of accounts from the same or different social network platforms, user identification attempts to identify all accounts belonging to the same person. A commonly used solution is to build the relationship among different accounts by exploring their collective patterns, e.g., user profile, writing style, similar comments. However, this kind of method doesn't work well in many practical scenarios, since the information posted explicitly by users may be false due to various reasons. In this paper, we re-inspect the user identification problem from a novel perspective, i.e., identifying user's identity by matching his/her cameras. The underlying assumption is that multiple accounts belonging to the same person contain the same or similar camera fingerprint information. The proposed framework, called User Camera Identification (UCI), is based on camera fingerprints, which takes fully into account the problems of multiple cameras and reposting behaviors.
We study the problem of how to build a deep learning representation for 3D shape. Deep learning has shown to be very effective in variety of visual applications, such as image classification and object detection. However, it has not been successfully applied to 3D shape recognition. This is because 3D shape has complex structure in 3D space and there are limited number of 3D shapes for feature learning. To address these problems, we project 3D shapes into 2D space and use autoencoder for feature learning on the 2D images. High accuracy 3D shape retrieval performance is obtained by aggregating the features learned on 2D images. In addition, we show the proposed deep learning feature is complementary to conventional local image descriptors. By combing the global deep learning representation and the local descriptor representation, our method can obtain the state-of-the-art performance on 3D shape retrieval benchmarks.
Objective: Brain is a fantastic organ that helps creature adapting to the environment. Network is the most essential structure of brain, but the capability of a simple network is still not very clear. In this study, we try to expound some brain functions only by the network property. Methods: Every network can be equivalent to a simplified network, which is expressed by an equation set. The dynamic of the equation set can be described by some basic equations, which is based on the mathematical derivation. Results (1) In a closed network, the stability is based on the excitatory/inhibitory synapse proportion. Spike probabilities in the assembly can meet the solution of a nonlinear equation set. (2) Network activity can spontaneously evolve into a certain distribution under different stimulation, which is closely related to decision making. (3) Short memory can be formed by coupling of network assemblies. Conclusion: The essential property of a network may contribute to some important brain functions.
Deep learning algorithms have achieved excellent performance lately in a wide range of fields (e.g., computer version). However, a severe challenge faced by deep learning is the high dependency on hyper-parameters. The algorithm results may fluctuate dramatically under the different configuration of hyper-parameters. Addressing the above issue, this paper presents an efficient Orthogonal Array Tuning Method (OATM) for deep learning hyper-parameter tuning. We describe the OATM approach in five detailed steps and elaborate on it using two widely used deep neural network structures (Recurrent Neural Networks and Convolutional Neural Networks). The proposed method is compared to the state-of-the-art hyper-parameter tuning methods including manually (e.g., grid search and random search) and automatically (e.g., Bayesian Optimization) ones. The experiment results state that OATM can significantly save the tuning time compared to the state-of-the-art methods while preserving the satisfying performance.
Scene text recognition has been an important, active research topic in computer vision for years. Previous approaches mainly consider text as 1D signals and cast scene text recognition as a sequence prediction problem, by feat of CTC or attention based encoder-decoder framework, which is originally designed for speech recognition. However, different from speech voices, which are 1D signals, text instances are essentially distributed in 2D image spaces. To adhere to and make use of the 2D nature of text for higher recognition accuracy, we extend the vanilla CTC model to a second dimension, thus creating 2D-CTC. 2D-CTC can adaptively concentrate on most relevant features while excluding the impact from clutters and noises in the background; It can also naturally handle text instances with various forms (horizontal, oriented and curved) while giving more interpretable intermediate predictions. The experiments on standard benchmarks for scene text recognition, such as IIIT-5K, ICDAR 2015, SVP-Perspective, and CUTE80, demonstrate that the proposed 2D-CTC model outperforms state-of-the-art methods on the text of both regular and irregular shapes. Moreover, 2D-CTC exhibits its superiority over prior art on training and testing speed. Our implementation and models of 2D-CTC will be made publicly available soon later.
Brain-Computer Interface (BCI) bridges the human's neural world and the outer physical world by decoding individuals' brain signals into commands recognizable by computer devices. Deep learning has lifted the performance of brain-computer interface systems significantly in recent years. In this article, we systematically investigate brain signal types for BCI and related deep learning concepts for brain signal analysis. We then present a comprehensive survey of deep learning techniques used for BCI, by summarizing over 230 contributions most published in the past five years. Finally, we discuss the applied areas, opening challenges, and future directions for deep learning-based BCI.
In this paper, we present a fast yet effective method for pixel-level scale-invariant image fusion in spatial domain based on the scale-space theory. Specifically, we propose a scale-invariant structure saliency selection scheme based on the difference-of-Gaussian (DoG) pyramid of images to build the weights or activity map. Due to the scale-invariant structure saliency selection, our method can keep both details of small size objects and the integrity information of large size objects in images. In addition, our method is very efficient since there are no complex operation involved and easy to be implemented and therefore can be used for fast high resolution images fusion. Experimental results demonstrate the proposed method yields competitive or even better results comparing to state-of-the-art image fusion methods both in terms of visual quality and objective evaluation metrics. Furthermore, the proposed method is very fast and can be used to fuse the high resolution images in real-time. Code is available at https://github.com/yiqingmy/Fusion.
Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks.
Previous deep learning based state-of-the-art scene text detection methods can be roughly classified into two categories. The first category treats scene text as a type of general objects and follows general object detection paradigm to localize scene text by regressing the text box locations, but troubled by the arbitrary-orientation and large aspect ratios of scene text. The second one segments text regions directly, but mostly needs complex post processing. In this paper, we present a method that combines the ideas of the two types of methods while avoiding their shortcomings. We propose to detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. In inference stage, candidate boxes are generated by sampling and grouping corner points, which are further scored by segmentation maps and suppressed by NMS. Compared with previous methods, our method can handle long oriented text naturally and doesn't need complex post processing. The experiments on ICDAR2013, ICDAR2015, MSRA-TD500, MLT and COCO-Text demonstrate that the proposed algorithm achieves better or comparable results in both accuracy and efficiency. Based on VGG16, it achieves an F-measure of 84.3% on ICDAR2015 and 81.5% on MSRA-TD500.
Recognizing text in natural images is a challenging task with many unsolved problems. Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text. RARE is a specially-designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN). In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more "readable" image for the following SRN, which recognizes text through a sequence recognition approach. We show that the model is able to recognize several types of irregular text, including perspective text and curved text. RARE is end-to-end trainable, requiring only images and associated text labels, making it convenient to train and deploy the model in practical systems. State-of-the-art or highly-competitive performance achieved on several benchmarks well demonstrates the effectiveness of the proposed model.
In this paper, we propose a novel age estimation method based on GLOH feature descriptor and multi-task learning (MTL). The GLOH feature descriptor, one of the state-of-the-art feature descriptor, is used to capture the age-related local and spatial information of face image. As the exacted GLOH features are often redundant, MTL is designed to select the most informative feature bins for age estimation problem, while the corresponding weights are determined by ridge regression. This approach largely reduces the dimensions of feature, which can not only improve performance but also decrease the computational burden. Experiments on the public available FG-NET database show that the proposed method can achieve comparable performance over previous approaches while using much fewer features.
Objective: Epilepsy is a chronic neurological disorder characterized by the occurrence of spontaneous seizures, which affects about one percent of the world's population. Most of the current seizure detection approaches strongly rely on patient history records and thus fail in the patient-independent situation of detecting the new patients. To overcome such limitation, we propose a robust and explainable epileptic seizure detection model that effectively learns from seizure states while eliminates the inter-patient noises. Methods: A complex deep neural network model is proposed to learn the pure seizure-specific representation from the raw non-invasive electroencephalography (EEG) signals through adversarial training. Furthermore, to enhance the explainability, we develop an attention mechanism to automatically learn the importance of each EEG channels in the seizure diagnosis procedure. Results: The proposed approach is evaluated over the Temple University Hospital EEG (TUH EEG) database. The experimental results illustrate that our model outperforms the competitive state-of-the-art baselines with low latency. Moreover, the designed attention mechanism is demonstrated ables to provide fine-grained information for pathological analysis. Conclusion and significance: We propose an effective and efficient patient-independent diagnosis approach of epileptic seizure based on raw EEG signals without manually feature engineering, which is a step toward the development of large-scale deployment for real-life use.