Models, code, and papers for "Yanwei Fu":

Semi-supervised Vocabulary-informed Learning

Apr 24, 2016
Yanwei Fu, Leonid Sigal

Despite significant progress in object categorization, in recent years, a number of important challenges remain, mainly, ability to learn from limited labeled data and ability to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of semi-supervised vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot and open set recognition using a unified framework. Specifically, we propose a maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms, ensuring that labeled samples are projected closest to their correct prototypes, in the embedding space, than to others. We show that resulting model shows improvements in supervised, zero-shot, and large open set recognition, with up to 310K class vocabulary on AwA and ImageNet datasets.

* 10 pages, Accepted by CVPR 2016 as an oral presentation 

  Click for Model/Code and Paper
A Jointly Learned Deep Architecture for Facial Attribute Analysis and Face Detection in the Wild

Jul 27, 2017
Keke He, Yanwei Fu, Xiangyang Xue

Facial attribute analysis in the real world scenario is very challenging mainly because of complex face variations. Existing works of analyzing face attributes are mostly based on the cropped and aligned face images. However, this result in the capability of attribute prediction heavily relies on the preprocessing of face detector. To address this problem, we present a novel jointly learned deep architecture for both facial attribute analysis and face detection. Our framework can process the natural images in the wild and our experiments on CelebA and LFWA datasets clearly show that the state-of-the-art performance is obtained.

  Click for Model/Code and Paper
Meta-Reinforced Synthetic Data for One-Shot Fine-Grained Visual Recognition

Nov 17, 2019
Satoshi Tsutsui, Yanwei Fu, David Crandall

One-shot fine-grained visual recognition often suffers from the problem of training data scarcity for new fine-grained classes. To alleviate this problem, an off-the-shelf image generator can be applied to synthesize additional training images, but these synthesized images are often not helpful for actually improving the accuracy of one-shot fine-grained recognition. This paper proposes a meta-learning framework to combine generated images with original images, so that the resulting ``hybrid'' training images can improve one-shot learning. Specifically, the generic image generator is updated by a few training instances of novel classes, and a Meta Image Reinforcing Network (MetaIRNet) is proposed to conduct one-shot fine-grained recognition as well as image reinforcement. The model is trained in an end-to-end manner, and our experiments demonstrate consistent improvement over baselines on one-shot fine-grained image classification benchmarks.

* Accepted by Conference on Neural Information Processing System 2019 

  Click for Model/Code and Paper
Detecting Tiny Moving Vehicles in Satellite Videos

Jul 05, 2018
Wei Ao, Yanwei Fu, Feng Xu

In recent years, the satellite videos have been captured by a moving satellite platform. In contrast to consumer, movie, and common surveillance videos, satellite video can record the snapshot of the city-scale scene. In a broad field-of-view of satellite videos, each moving target would be very tiny and usually composed of several pixels in frames. Even worse, the noise signals also existed in the video frames, since the background of the video frame has the subpixel-level and uneven moving thanks to the motion of satellites. We argue that this is a new type of computer vision task since previous technologies are unable to detect such tiny vehicles efficiently. This paper proposes a novel framework that can identify the small moving vehicles in satellite videos. In particular, we offer a novel detecting algorithm based on the local noise modeling. We differentiate the potential vehicle targets from noise patterns by an exponential probability distribution. Subsequently, a multi-morphological-cue based discrimination strategy is designed to distinguish correct vehicle targets from a few existing noises further. Another significant contribution is to introduce a series of evaluation protocols to measure the performance of tiny moving vehicle detection systematically. We annotate a satellite video manually and use it to test our algorithms under different evaluation criterion. The proposed algorithm is also compared with the state-of-the-art baselines, and demonstrates the advantages of our framework over the benchmarks.

  Click for Model/Code and Paper
Multi-view Metric Learning for Multi-view Video Summarization

Nov 25, 2015
Yanwei Fu, Lingbo Wang, Yanwen Guo

Traditional methods on video summarization are designed to generate summaries for single-view video records; and thus they cannot fully exploit the redundancy in multi-view video records. In this paper, we present a multi-view metric learning framework for multi-view video summarization that combines the advantages of maximum margin clustering with the disagreement minimization criterion. The learning framework thus has the ability to find a metric that best separates the data, and meanwhile to force the learned metric to maintain original intrinsic information between data points, for example geometric information. Facilitated by such a framework, a systematic solution to the multi-view video summarization problem is developed. To the best of our knowledge, it is the first time to address multi-view video summarization from the viewpoint of metric learning. The effectiveness of the proposed method is demonstrated by experiments.

  Click for Model/Code and Paper
Robust Classification by Pre-conditioned LASSO and Transductive Diffusion Component Analysis

Nov 19, 2015
Yanwei Fu, De-An Huang, Leonid Sigal

Modern machine learning-based recognition approaches require large-scale datasets with large number of labelled training images. However, such datasets are inherently difficult and costly to collect and annotate. Hence there is a great and growing interest in automatic dataset collection methods that can leverage the web. % which are collected % in a cheap, efficient and yet unreliable way. Collecting datasets in this way, however, requires robust and efficient ways for detecting and excluding outliers that are common and prevalent. % Outliers are thus a % prominent treat of using these dataset. So far, there have been a limited effort in machine learning community to directly detect outliers for robust classification. Inspired by the recent work on Pre-conditioned LASSO, this paper formulates the outlier detection task using Pre-conditioned LASSO and employs \red{unsupervised} transductive diffusion component analysis to both integrate the topological structure of the data manifold, from labeled and unlabeled instances, and reduce the feature dimensionality. Synthetic experiments as well as results on two real-world classification tasks show that our framework can robustly detect the outliers and improve classification.

  Click for Model/Code and Paper
Pixel2Mesh++: Multi-View 3D Mesh Generation via Deformation

Aug 16, 2019
Chao Wen, Yinda Zhang, Zhuwen Li, Yanwei Fu

We study the problem of shape generation in 3D mesh representation from a few color images with known camera poses. While many previous works learn to hallucinate the shape directly from priors, we resort to further improving the shape quality by leveraging cross-view information with a graph convolutional network. Instead of building a direct mapping function from images to 3D shape, our model learns to predict series of deformations to improve a coarse shape iteratively. Inspired by traditional multiple view geometry methods, our network samples nearby area around the initial mesh's vertex locations and reasons an optimal deformation using perceptual feature statistics built from multiple input images. Extensive experiments show that our model produces accurate 3D shape that are not only visually plausible from the input perspectives, but also well aligned to arbitrary viewpoints. With the help of physically driven architecture, our model also exhibits generalization capability across different semantic categories, number of input images, and quality of mesh initialization.

* Accepted by ICCV 2019 

  Click for Model/Code and Paper
Multi-level Semantic Feature Augmentation for One-shot Learning

Oct 02, 2018
Zitian Chen, Yanwei Fu, Yinda Zhang, Leonid Sigal

The ability to quickly recognize and learn new visual concepts from limited samples enables humans to swiftly adapt to new environments. This ability is enabled by semantic associations of novel concepts with those that have already been learned and stored in memory. Computers can start to ascertain similar abilities by utilizing a semantic concept space. A concept space is a high-dimensional semantic space in which similar abstract concepts appear close and dissimilar ones far apart. In this paper, we propose a novel approach to one-shot learning that builds on this idea. Our approach learns to map a novel sample instance to a concept, relates that concept to the existing ones in the concept space and generates new instances, by interpolating among the concepts, to help learning. Instead of synthesizing new image instance, we propose to directly synthesize instance features by leveraging semantics using a novel auto-encoder network we call dual TriNet. The encoder part of the TriNet learns to map multi-layer visual features of deep CNNs, that is, multi-level concepts, to a semantic vector. In semantic space, we search for related concepts, which are then projected back into the image feature spaces by the decoder portion of the TriNet. Two strategies in the semantic space are explored. Notably, this seemingly simple strategy results in complex augmented feature distributions in the image feature space, leading to substantially better performance.

  Click for Model/Code and Paper
Progressive Deep Neural Networks Acceleration via Soft Filter Pruning

Aug 24, 2018
Yang He, Xuanyi Dong, Guoliang Kang, Yanwei Fu, Yi Yang

This paper proposed a Progressive Soft Filter Pruning method (PSFP) to prune the filters of deep Neural Networks which can thus be accelerated in the inference. Specifically, the proposed PSFP method prunes the network progressively and enables the pruned filters to be updated when training the model after pruning. PSFP has three advantages over previous works: 1) Larger model capacity. Updating previously pruned filters provides our approach with larger optimization space than fixing the filters to zero. Therefore, the network trained by our method has a larger model capacity to learn from the training data. 2) Less dependence on the pre-trained model. Large capacity enables our method to train from scratch and prune the model simultaneously. In contrast, previous filter pruning methods should be conducted on the basis of the pre-trained model to guarantee their performance. Empirically, PSFP from scratch outperforms the previous filter pruning methods. 3) Pruning the neural network progressively makes the selection of low-norm filters much more stable, which has a potential to get a better performance. Moreover, our approach has been demonstrated effective for many advanced CNN architectures. Notably, on ILSCRC-2012, our method reduces more than 42% FLOPs on ResNet-101 with even 0.2% top-5 accuracy improvement, which has advanced the state-of-the-art. On ResNet-50, our progressive pruning method have 1.08% top-1 accuracy improvement over the pruning method without progressive pruning.

* Extended Journal Version of arXiv:1808.06866 

  Click for Model/Code and Paper
Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks

Aug 21, 2018
Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, Yi Yang

This paper proposed a Soft Filter Pruning (SFP) method to accelerate the inference procedure of deep Convolutional Neural Networks (CNNs). Specifically, the proposed SFP enables the pruned filters to be updated when training the model after pruning. SFP has two advantages over previous works: (1) Larger model capacity. Updating previously pruned filters provides our approach with larger optimization space than fixing the filters to zero. Therefore, the network trained by our method has a larger model capacity to learn from the training data. (2) Less dependence on the pre-trained model. Large capacity enables SFP to train from scratch and prune the model simultaneously. In contrast, previous filter pruning methods should be conducted on the basis of the pre-trained model to guarantee their performance. Empirically, SFP from scratch outperforms the previous filter pruning methods. Moreover, our approach has been demonstrated effective for many advanced CNN architectures. Notably, on ILSCRC-2012, SFP reduces more than 42% FLOPs on ResNet-101 with even 0.2% top-5 accuracy improvement, which has advanced the state-of-the-art. Code is publicly available on GitHub:

* Accepted to IJCAI 2018 

  Click for Model/Code and Paper
Semi-Latent GAN: Learning to generate and modify facial images from attributes

Apr 07, 2017
Weidong Yin, Yanwei Fu, Leonid Sigal, Xiangyang Xue

Generating and manipulating human facial images using high-level attributal controls are important and interesting problems. The models proposed in previous work can solve one of these two problems (generation or manipulation), but not both coherently. This paper proposes a novel model that learns how to both generate and modify the facial image from high-level semantic attributes. Our key idea is to formulate a Semi-Latent Facial Attribute Space (SL-FAS) to systematically learn relationship between user-defined and latent attributes, as well as between those attributes and RGB imagery. As part of this newly formulated space, we propose a new model --- SL-GAN which is a specific form of Generative Adversarial Network. Finally, we present an iterative training algorithm for SL-GAN. The experiments on recent CelebA and CASIA-WebFace datasets validate the effectiveness of our proposed framework. We will also make data, pre-trained models and code available.

* 10 pages, submitted to ICCV 2017 

  Click for Model/Code and Paper
Deep Learning for Video Classification and Captioning

Feb 22, 2018
Zuxuan Wu, Ting Yao, Yanwei Fu, Yu-Gang Jiang

Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification concentrates on automatically labeling video clips based on their semantic contents like human actions or complex events, video captioning attempts to generate a complete and natural sentence, enriching the single label as in video classification, to capture the most informative dynamics in videos. In addition, we also provide a review of popular benchmarks and competitions, which are critical for evaluating the technical progress of this vibrant field.

* Book chapter in Frontiers of Multimedia Research 

  Click for Model/Code and Paper
Transductive Multi-view Zero-Shot Learning

Mar 03, 2015
Yanwei Fu, Timothy M. Hospedales, Tao Xiang, Shaogang Gong

Most existing zero-shot learning approaches exploit transfer learning via an intermediate-level semantic representation shared between an annotated auxiliary dataset and a target dataset with different classes and no annotation. A projection from a low-level feature space to the semantic representation space is learned from the auxiliary dataset and is applied without adaptation to the target dataset. In this paper we identify two inherent limitations with these approaches. First, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the projection domain shift problem and propose a novel framework, transductive multi-view embedding, to solve it. The second limitation is the prototype sparsity problem which refers to the fact that for each target class, only a single prototype is available for zero-shot learning given a semantic representation. To overcome this problem, a novel heterogeneous multi-view hypergraph label propagation method is formulated for zero-shot learning in the transductive embedding space. It effectively exploits the complementary information offered by different semantic representations and takes advantage of the manifold structures of multiple representation spaces in a coherent manner. We demonstrate through extensive experiments that the proposed approach (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complementarity of multiple semantic representations, (3) significantly outperforms existing methods for both zero-shot and N-shot recognition on three image and video benchmark datasets, and (4) enables novel cross-view annotation tasks.

* accepted by IEEE TPAMI, more info and longer report will be available in : 

  Click for Model/Code and Paper
Learning Large Euclidean Margin for Sketch-based Image Retrieval

Dec 11, 2018
Peng Lu, Gao Huang, Yanwei Fu, Guodong Guo, Hangyu Lin

This paper addresses the problem of Sketch-Based Image Retrieval (SBIR), for which bridge the gap between the data representations of sketch images and photo images is considered as the key. Previous works mostly focus on learning a feature space to minimize intra-class distances for both sketches and photos. In contrast, we propose a novel loss function, named Euclidean Margin Softmax (EMS), that not only minimizes intra-class distances but also maximizes inter-class distances simultaneously. It enables us to learn a feature space with high discriminability, leading to highly accurate retrieval. In addition, this loss function is applied to a conditional network architecture, which could incorporate the prior knowledge of whether a sample is a sketch or a photo. We show that the conditional information can be conveniently incorporated to the recently proposed Squeeze and Excitation (SE) module, lead to a conditional SE (CSE) module. Extensive experiments are conducted on two widely used SBIR benchmark datasets. Our approach, although being very simple, achieved new state-of-the-art on both datasets, surpassing existing methods by a large margin.

* 13 pages, 6 figures 

  Click for Model/Code and Paper
SCSP: Spectral Clustering Filter Pruning with Soft Self-adaption Manners

Jun 14, 2018
Huiyuan Zhuo, Xuelin Qian, Yanwei Fu, Heng Yang, Xiangyang Xue

Deep Convolutional Neural Networks (CNN) has achieved significant success in computer vision field. However, the high computational cost of the deep complex models prevents the deployment on edge devices with limited memory and computational resource. In this paper, we proposed a novel filter pruning for convolutional neural networks compression, namely spectral clustering filter pruning with soft self-adaption manners (SCSP). We first apply spectral clustering on filters layer by layer to explore their intrinsic connections and only count on efficient groups. By self-adaption manners, the pruning operations can be done in few epochs to let the network gradually choose meaningful groups. According to this strategy, we not only achieve model compression while keeping considerable performance, but also find a novel angle to interpret the model compression process.

  Click for Model/Code and Paper
MSplit LBI: Realizing Feature Selection and Dense Estimation Simultaneously in Few-shot and Zero-shot Learning

Jun 12, 2018
Bo Zhao, Xinwei Sun, Yanwei Fu, Yuan Yao, Yizhou Wang

It is one typical and general topic of learning a good embedding model to efficiently learn the representation coefficients between two spaces/subspaces. To solve this task, $L_{1}$ regularization is widely used for the pursuit of feature selection and avoiding overfitting, and yet the sparse estimation of features in $L_{1}$ regularization may cause the underfitting of training data. $L_{2}$ regularization is also frequently used, but it is a biased estimator. In this paper, we propose the idea that the features consist of three orthogonal parts, \emph{namely} sparse strong signals, dense weak signals and random noise, in which both strong and weak signals contribute to the fitting of data. To facilitate such novel decomposition, \emph{MSplit} LBI is for the first time proposed to realize feature selection and dense estimation simultaneously. We provide theoretical and simulational verification that our method exceeds $L_{1}$ and $L_{2}$ regularization, and extensive experimental results show that our method achieves state-of-the-art performance in the few-shot and zero-shot learning.

* Accepted by the 35th International Conference on Machine Learning 

  Click for Model/Code and Paper
A Cascaded Convolutional Neural Network for Single Image Dehazing

Mar 21, 2018
Chongyi Li, Jichang Guo, Fatih Porikli, Huazhu Fu, Yanwei Pang

Images captured under outdoor scenes usually suffer from low contrast and limited visibility due to suspended atmospheric particles, which directly affects the quality of photos. Despite numerous image dehazing methods have been proposed, effective hazy image restoration remains a challenging problem. Existing learning-based methods usually predict the medium transmission by Convolutional Neural Networks (CNNs), but ignore the key global atmospheric light. Different from previous learning-based methods, we propose a flexible cascaded CNN for single hazy image restoration, which considers the medium transmission and global atmospheric light jointly by two task-driven subnetworks. Specifically, the medium transmission estimation subnetwork is inspired by the densely connected CNN while the global atmospheric light estimation subnetwork is a light-weight CNN. Besides, these two subnetworks are cascaded by sharing the common features. Finally, with the estimated model parameters, the haze-free image is obtained by the atmospheric scattering model inversion, which achieves more accurate and effective restoration performance. Qualitatively and quantitatively experimental results on the synthetic and real-world hazy images demonstrate that the proposed method effectively removes haze from such images, and outperforms several state-of-the-art dehazing methods.

* This manuscript is accepted by IEEE ACCESS 

  Click for Model/Code and Paper
Learning from Synthetic Data Using a Stacked Multichannel Autoencoder

Sep 21, 2015
Xi Zhang, Yanwei Fu, Shanshan Jiang, Leonid Sigal, Gady Agam

Learning from synthetic data has many important and practical applications. An example of application is photo-sketch recognition. Using synthetic data is challenging due to the differences in feature distributions between synthetic and real data, a phenomenon we term synthetic gap. In this paper, we investigate and formalize a general framework-Stacked Multichannel Autoencoder (SMCAE) that enables bridging the synthetic gap and learning from synthetic data more efficiently. In particular, we show that our SMCAE can not only transform and use synthetic data on the challenging face-sketch recognition task, but that it can also help simulate real images, which can be used for training classifiers for recognition. Preliminary experiments validate the effectiveness of the framework.

  Click for Model/Code and Paper
Transductive Multi-label Zero-shot Learning

Mar 26, 2015
Yanwei Fu, Yongxin Yang, Tim Hospedales, Tao Xiang, Shaogang Gong

Zero-shot learning has received increasing interest as a means to alleviate the often prohibitive expense of annotating training data for large scale recognition problems. These methods have achieved great success via learning intermediate semantic representations in the form of attributes and more recently, semantic word vectors. However, they have thus far been constrained to the single-label case, in contrast to the growing popularity and importance of more realistic multi-label data. In this paper, for the first time, we investigate and formalise a general framework for multi-label zero-shot learning, addressing the unique challenge therein: how to exploit multi-label correlation at test time with no training data for those classes? In particular, we propose (1) a multi-output deep regression model to project an image into a semantic word space, which explicitly exploits the correlations in the intermediate semantic layer of word vectors; (2) a novel zero-shot learning algorithm for multi-label data that exploits the unique compositionality property of semantic word vector representations; and (3) a transductive learning strategy to enable the regression model learned from seen classes to generalise well to unseen classes. Our zero-shot learning experiments on a number of standard multi-label datasets demonstrate that our method outperforms a variety of baselines.

* 12 pages, 6 figures, Accepted to BMVC 2014 (oral) 

  Click for Model/Code and Paper