Models, code, and papers for "Nicu Sebe":

Whitening and Coloring transform for GANs

Jun 01, 2018
Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe

Batch Normalization (BN) is a common technique used both in discriminative and generative networks in order to speed-up training. On the other hand, the learnable parameters of BN are commonly used in conditional Generative Adversarial Networks for representing class-specific information using conditional Batch Normalization (cBN). In this paper we propose to generalize both BN and cBN using a Whitening and Coloring based batch normalization. We apply our method to conditional and unconditional image generation tasks and we show that replacing the BN feature standardization and scaling with our feature whitening and coloring improves the final qualitative results and the training speed. We test our approach on different datasets and we show a consistent improvement orthogonal to different GAN frameworks. Our CIFAR-10 supervised results are higher than all previous works on this dataset.

  Click for Model/Code and Paper
Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation

Apr 14, 2019
Hao Tang, Dan Xu, Nicu Sebe, Yan Yan

The state-of-the-art approaches in Generative Adversarial Networks (GANs) are able to learn a mapping function from one image domain to another with unpaired image data. However, these methods often produce artifacts and can only be able to convert low-level information, but fail to transfer high-level semantic part of images. The reason is mainly that generators do not have the ability to detect the most discriminative semantic part of images, which thus makes the generated images with low-quality. To handle the limitation, in this paper we propose a novel Attention-Guided Generative Adversarial Network (AGGAN), which can detect the most discriminative semantic object and minimize changes of unwanted part for semantic manipulation problems without using extra data and models. The attention-guided generators in AGGAN are able to produce attention masks via a built-in attention mechanism, and then fuse the input image with the attention mask to obtain a target image with high-quality. Moreover, we propose a novel attention-guided discriminator which only considers attended regions. The proposed AGGAN is trained by an end-to-end fashion with an adversarial loss, cycle-consistency loss, pixel loss and attention loss. Both qualitative and quantitative results demonstrate that our approach is effective to generate sharper and more accurate images than existing models.

* 8 pages, 7 figures, Accepted to IJCNN 2019 

  Click for Model/Code and Paper
Regularized Evolutionary Algorithm for Dynamic Neural Topology Search

May 15, 2019
Cristiano Saltori, Subhankar Roy, Nicu Sebe, Giovanni Iacca

Designing neural networks for object recognition requires considerable architecture engineering. As a remedy, neuro-evolutionary network architecture search, which automatically searches for optimal network architectures using evolutionary algorithms, has recently become very popular. Although very effective, evolutionary algorithms rely heavily on having a large population of individuals (i.e., network architectures) and is therefore memory expensive. In this work, we propose a Regularized Evolutionary Algorithm with low memory footprint to evolve a dynamic image classifier. In details, we introduce novel custom operators that regularize the evolutionary process of a micro-population of 10 individuals. We conduct experiments on three different digits datasets (MNIST, USPS, SVHN) and show that our evolutionary method obtains competitive results with the current state-of-the-art.

  Click for Model/Code and Paper
Attention-based Fusion for Multi-source Human Image Generation

May 07, 2019
Stéphane Lathuilière, Enver Sangineto, Aliaksandr Siarohin, Nicu Sebe

We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images. In this way, we can exploit multiple, possibly complementary images of the same person which are usually available at training and at testing time. The solution we propose is mainly based on a local attention mechanism which selects relevant information from different source image regions, avoiding the necessity to build specific generators for each specific cardinality of X. The empirical evaluation of our method shows the practical interest of addressing the person-image generation problem in a multi-source setting.

* 10 pages 

  Click for Model/Code and Paper
Appearance and Pose-Conditioned Human Image Generation using Deformable GANs

Apr 30, 2019
Aliaksandr Siarohin, Stéphane Lathuilière, Enver Sangineto, Nicu Sebe

In this paper, we address the problem of generating person images conditioned on both pose and appearance information. Specifically, given an image xa of a person and a target pose P(xb), extracted from a different image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa. In order to deal with pixel-to-pixel misalignments caused by the pose differences between P(xa) and P(xb), we introduce deformable skip connections in the generator of our Generative Adversarial Network. Moreover, a nearest-neighbour loss is proposed instead of the common L1 and L2 losses in order to match the details of the generated image with the target image. Quantitative and qualitative results, using common datasets and protocols recently proposed for this task, show that our approach is competitive with respect to the state of the art. Moreover, we conduct an extensive evaluation using off-the-shell person re-identification (Re-ID) systems trained with person-generation based augmented data, which is one of the main important applications for this task. Our experiments show that our Deformable GANs can significantly boost the Re-ID accuracy and are even better than data-augmentation methods specifically trained using Re-ID losses.

* submitted to TPAMI. arXiv admin note: substantial text overlap with arXiv:1801.00055 

  Click for Model/Code and Paper
Metric-Learning based Deep Hashing Network for Content Based Retrieval of Remote Sensing Images

Apr 02, 2019
Subhankar Roy, Enver Sangineto, Begüm Demir, Nicu Sebe

Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. The traditional hashing methods in RS usually exploit hand-crafted features to learn hash functions to obtain binary codes, which can be insufficient to optimally represent the information content of RS images. To overcome this problem, in this paper we introduce a metric-learning based hashing network, which learns: 1) a semantic-based metric space for effective feature representation; and 2) compact binary hash codes for fast archive search. Our network considers an interplay of multiple loss functions that allows to jointly learn a metric based semantic space facilitating similar images to be clustered together in that target space and at the same time producing compact final activations that lose negligible information when binarized. Experiments carried out on two benchmark RS archives point out that the proposed network significantly improves the retrieval performance under the same retrieval time when compared to the state-of-the-art hashing methods in RS.

* Submitted to IEEE Geoscience and Remote Sensing Letters 

  Click for Model/Code and Paper
Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation

Mar 11, 2019
Andrea Pilzer, Stéphane Lathuilière, Nicu Sebe, Elisa Ricci

Nowadays, the majority of state of the art monocular depth estimation techniques are based on supervised deep learning models. However, collecting RGB images with associated depth maps is a very time consuming procedure. Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth. Following these works, we propose a novel self-supervised deep model for estimating depth maps. Our framework exploits two main strategies: refinement via cycle-inconsistency and distillation. Specifically, first a \emph{student} network is trained to predict a disparity map such as to recover from a frame in a camera view the associated image in the opposite view. Then, a backward cycle network is applied to the generated image to re-synthesize back the input image, estimating the opposite disparity. A third network exploits the inconsistency between the original and the reconstructed input frame in order to output a refined depth map. Finally, knowledge distillation is exploited, such as to transfer information from the refinement network to the student. Our extensive experimental evaluation demonstrate the effectiveness of the proposed framework which outperforms state of the art unsupervised methods on the KITTI benchmark.

* Accepted at CVPR2019 

  Click for Model/Code and Paper
Fast and Robust Dynamic Hand Gesture Recognition via Key Frames Extraction and Feature Fusion

Jan 15, 2019
Hao Tang, Hong Liu, Wei Xiao, Nicu Sebe

Gesture recognition is a hot topic in computer vision and pattern recognition, which plays a vitally important role in natural human-computer interface. Although great progress has been made recently, fast and robust hand gesture recognition remains an open problem, since the existing methods have not well balanced the performance and the efficiency simultaneously. To bridge it, this work combines image entropy and density clustering to exploit the key frames from hand gesture video for further feature extraction, which can improve the efficiency of recognition. Moreover, a feature fusion strategy is also proposed to further improve feature representation, which elevates the performance of recognition. To validate our approach in a "wild" environment, we also introduce two new datasets called HandGesture and Action3D datasets. Experiments consistently demonstrate that our strategy achieves competitive results on Northwestern University, Cambridge, HandGesture and Action3D hand gesture datasets. Our code and datasets will release at

* 11 pages, 3 figures, accepted to NeuroComputing 

  Click for Model/Code and Paper
Predicting Group Cohesiveness in Images

Jan 05, 2019
Shreya Ghosh, Abhinav Dhall, Nicu Sebe, Tom Gedeon

Cohesiveness of a group is an essential indicator of emotional state, structure and success of a group of people. We study the factors that influence the perception of group level cohesion and propose methods for estimating the human-perceived cohesion on the Group Cohesiveness Scale (GCS). Image analysis is performed at a group level via a multi-task convolutional neural network. For analyzing the contribution of facial expressions of the group members for predicting GCS, capsule network is explored. In order to identify the visual cues (attributes) for cohesion, we conducted a user survey. Based on the Group Affect database, we add GCS and propose the `GAF-Cohesion database'. The proposed model performs well on the database and is able to achieve near human-level performance in predicting group's cohesion score. It is interesting to note that GCS as an attribute, when jointly trained for group level emotion prediction, helps in increasing the performance for the later task. This suggests that group level emotion and GCS are correlated.

  Click for Model/Code and Paper
Enhancing Perceptual Attributes with Bayesian Style Generation

Dec 03, 2018
Aliaksandr Siarohin, Gloria Zen, Nicu Sebe, Elisa Ricci

Deep learning has brought an unprecedented progress in computer vision and significant advances have been made in predicting subjective properties inherent to visual data (e.g., memorability, aesthetic quality, evoked emotions, etc.). Recently, some research works have even proposed deep learning approaches to modify images such as to appropriately alter these properties. Following this research line, this paper introduces a novel deep learning framework for synthesizing images in order to enhance a predefined perceptual attribute. Our approach takes as input a natural image and exploits recent models for deep style transfer and generative adversarial networks to change its style in order to modify a specific high-level attribute. Differently from previous works focusing on enhancing a specific property of a visual content, we propose a general framework and demonstrate its effectiveness in two use cases, i.e. increasing image memorability and generating scary pictures. We evaluate the proposed approach on publicly available benchmarks, demonstrating its advantages over state of the art methods.

* ACCV-2018 

  Click for Model/Code and Paper
PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing

May 11, 2018
Dan Xu, Wanli Ouyang, Xiaogang Wang, Nicu Sebe

Depth estimation and scene parsing are two particularly important tasks in visual scene understanding. In this paper we tackle the problem of simultaneous depth estimation and scene parsing in a joint CNN. The task can be typically treated as a deep multi-task learning problem [42]. Different from previous methods directly optimizing multiple tasks given the input training data, this paper proposes a novel multi-task guided prediction-and-distillation network (PAD-Net), which first predicts a set of intermediate auxiliary tasks ranging from low level to high level, and then the predictions from these intermediate auxiliary tasks are utilized as multi-modal input via our proposed multi-modal distillation modules for the final tasks. During the joint learning, the intermediate tasks not only act as supervision for learning more robust deep representations but also provide rich multi-modal information for improving the final tasks. Extensive experiments are conducted on two challenging datasets (i.e. NYUD-v2 and Cityscapes) for both the depth estimation and scene parsing tasks, demonstrating the effectiveness of the proposed approach.

* Accepted at CVPR 2018 

  Click for Model/Code and Paper
Deformable GANs for Pose-based Human Image Generation

Apr 06, 2018
Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere, Nicu Sebe

In this paper we address the problem of generating person images conditioned on a given pose. Specifically, given an image of a person and a target pose, we synthesize a new image of that person in the novel pose. In order to deal with pixel-to-pixel misalignments caused by the pose differences, we introduce deformable skip connections in the generator of our Generative Adversarial Network. Moreover, a nearest-neighbour loss is proposed instead of the common L1 and L2 losses in order to match the details of the generated image with the target image. We test our approach using photos of persons in different poses and we compare our method with previous work in this area showing state-of-the-art results in two benchmarks. Our method can be applied to the wider field of deformable object generation, provided that the pose of the articulated object can be extracted using a keypoint detector.

* CVPR 2018 version 

  Click for Model/Code and Paper
Self Paced Deep Learning for Weakly Supervised Object Detection

Feb 21, 2018
Enver Sangineto, Moin Nabi, Dubravko Culibrk, Nicu Sebe

In a weakly-supervised scenario object detectors need to be trained using image-level annotation alone. Since bounding-box-level ground truth is not available, most of the solutions proposed so far are based on an iterative, Multiple Instance Learning framework in which the current classifier is used to select the highest-confidence boxes in each image, which are treated as pseudo-ground truth in the next training iteration. However, the errors of an immature classifier can make the process drift, usually introducing many of false positives in the training dataset. To alleviate this problem, we propose in this paper a training protocol based on the self-paced learning paradigm. The main idea is to iteratively select a subset of images and boxes that are the most reliable, and use them for training. While in the past few years similar strategies have been adopted for SVMs and other classifiers, we are the first showing that a self-paced approach can be used with deep-network-based classifiers in an end-to-end training pipeline. The method we propose is built on the fully-supervised Fast-RCNN architecture and can be applied to similar architectures which represent the input image as a bag of boxes. We show state-of-the-art results on Pascal VOC 2007, Pascal VOC 2010 and ILSVRC 2013. On ILSVRC 2013 our results based on a low-capacity AlexNet network outperform even those weakly-supervised approaches which are based on much higher-capacity networks.

* To appear at IEEE Transactions on PAMI 

  Click for Model/Code and Paper
Training Adversarial Discriminators for Cross-channel Abnormal Event Detection in Crowds

Jun 23, 2017
Mahdyar Ravanbakhsh, Enver Sangineto, Moin Nabi, Nicu Sebe

Abnormal crowd behaviour detection attracts a large interest due to its importance in video surveillance scenarios. However, the ambiguity and the lack of sufficient "abnormal" ground truth data makes end-to-end training of large deep networks hard in this domain. In this paper we propose to use Generative Adversarial Nets (GANs), which are trained to generate only the "normal" distribution of the data. During the adversarial GAN training, a discriminator "D" is used as a supervisor for the generator network "G" and vice versa. At testing time we use "D" to solve our discriminative task (abnormality detection), where "D" has been trained without the need of manually-annotated abnormal data. Moreover, in order to prevent "G" learn a trivial identity function, we use a cross-channel approach, forcing "G" to transform raw-pixel data in motion information and vice versa. The quantitative results on standard benchmarks show that our method outperforms previous state-of-the-art methods in both the frame-level and the pixel-level evaluation.

  Click for Model/Code and Paper
Structured Coupled Generative Adversarial Networks for Unsupervised Monocular Depth Estimation

Aug 15, 2019
Mihai Marian Puscas, Dan Xu, Andrea Pilzer, Nicu Sebe

Inspired by the success of adversarial learning, we propose a new end-to-end unsupervised deep learning framework for monocular depth estimation consisting of two Generative Adversarial Networks (GAN), deeply coupled with a structured Conditional Random Field (CRF) model. The two GANs aim at generating distinct and complementary disparity maps and at improving the generation quality via exploiting the adversarial learning strategy. The deep CRF coupling model is proposed to fuse the generative and discriminative outputs from the dual GAN nets. As such, the model implicitly constructs mutual constraints on the two network branches and between the generator and discriminator. This facilitates the optimization of the whole network for better disparity generation. Extensive experiments on the KITTI, Cityscapes, and Make3D datasets clearly demonstrate the effectiveness of the proposed approach and show superior performance compared to state of the art methods. The code and models are available at 3dv---coupled-crf-disparity.

* Accepted at 3DV 2019 as ORAL 

  Click for Model/Code and Paper
Dual Generator Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Jan 14, 2019
Hao Tang, Dan Xu, Wei Wang, Yan Yan, Nicu Sebe

State-of-the-art methods for image-to-image translation with Generative Adversarial Networks (GANs) can learn a mapping from one domain to another domain using unpaired image data. However, these methods require the training of one specific model for every pair of image domains, which limits the scalability in dealing with more than two image domains. In addition, the training stage of these methods has the common problem of model collapse that degrades the quality of the generated images. To tackle these issues, we propose a Dual Generator Generative Adversarial Network (G$^2$GAN), which is a robust and scalable approach allowing to perform unpaired image-to-image translation for multiple domains using only dual generators within a single model. Moreover, we explore different optimization losses for better training of G$^2$GAN, and thus make unpaired image-to-image translation with higher consistency and better stability. Extensive experiments on six publicly available datasets with different scenarios, i.e., architectural buildings, seasons, landscape and human faces, demonstrate that the proposed G$^2$GAN achieves superior model capacity and better generation performance comparing with existing image-to-image translation GAN models.

* 16 pages, 7 figures, accepted to ACCV 2018 

  Click for Model/Code and Paper
GestureGAN for Hand Gesture-to-Gesture Translation in the Wild

Aug 14, 2018
Hao Tang, Wei Wang, Dan Xu, Yan Yan, Nicu Sebe

Hand gesture-to-gesture translation in the wild is a challenging task since hand gestures can have arbitrary poses, sizes, locations and self-occlusions. Therefore, this task requires a high-level understanding of the mapping between the input source gesture and the output target gesture. To tackle this problem, we propose a novel hand Gesture Generative Adversarial Network (GestureGAN). GestureGAN consists of a single generator $G$ and a discriminator $D$, which takes as input a conditional hand image and a target hand skeleton image. GestureGAN utilizes the hand skeleton information explicitly, and learns the gesture-to-gesture mapping through two novel losses, the color loss and the cycle-consistency loss. The proposed color loss handles the issue of "channel pollution" while back-propagating the gradients. In addition, we present the Fr\'echet ResNet Distance (FRD) to evaluate the quality of generated images. Extensive experiments on two widely used benchmark datasets demonstrate that the proposed GestureGAN achieves state-of-the-art performance on the unconstrained hand gesture-to-gesture translation task. Meanwhile, the generated images are in high-quality and are photo-realistic, allowing them to be used as data augmentation to improve the performance of a hand gesture classifier. Our model and code are available at

* 9 pages, 7 figures, ACM MM 2018 oral 

  Click for Model/Code and Paper
Learning Deep Representations of Appearance and Motion for Anomalous Event Detection

Oct 06, 2015
Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, Nicu Sebe

We present a novel unsupervised deep learning framework for anomalous event detection in complex video scenes. While most existing works merely use hand-crafted appearance and motion features, we propose Appearance and Motion DeepNet (AMDN) which utilizes deep neural networks to automatically learn feature representations. To exploit the complementary information of both appearance and motion patterns, we introduce a novel double fusion framework, combining both the benefits of traditional early fusion and late fusion strategies. Specifically, stacked denoising autoencoders are proposed to separately learn both appearance and motion features as well as a joint representation (early fusion). Based on the learned representations, multiple one-class SVM models are used to predict the anomaly scores of each input, which are then integrated with a late fusion strategy for final anomaly detection. We evaluate the proposed method on two publicly available video surveillance datasets, showing competitive performance with respect to state of the art approaches.

* Oral paper in BMVC 2015 

  Click for Model/Code and Paper
Animating Arbitrary Objects via Deep Motion Transfer

Dec 24, 2018
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, Nicu Sebe

This paper introduces a novel deep learning framework for image animation. Given an input image with a target object and a driving video sequence depicting a moving object, our framework generates a video in which the target object is animated according to the driving sequence. This is achieved through a deep architecture that decouples appearance and motion information. Our framework consists of three main modules: (i) a Keypoint Detector unsupervisely trained to extract object keypoints, (ii) a Dense Motion prediction network for generating dense heatmaps from sparse keypoints, in order to better encode motion information and (iii) a Motion Transfer Network, which uses the motion heatmaps and appearance information extracted from the input image to synthesize the output frames. We demonstrate the effectiveness of our method on several benchmark datasets, spanning a wide variety of object appearances, and show that our approach outperforms state-of-the-art image animation and video generation methods.

  Click for Model/Code and Paper
Monocular Depth Estimation using Multi-Scale Continuous CRFs as Sequential Deep Networks

Mar 01, 2018
Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, Nicu Sebe

Depth cues have been proved very useful in various computer vision and robotic tasks. This paper addresses the problem of monocular depth estimation from a single still image. Inspired by the effectiveness of recent works on multi-scale convolutional neural networks (CNN), we propose a deep model which fuses complementary information derived from multiple CNN side outputs. Different from previous methods using concatenation or weighted average schemes, the integration is obtained by means of continuous Conditional Random Fields (CRFs). In particular, we propose two different variations, one based on a cascade of multiple CRFs, the other on a unified graphical model. By designing a novel CNN implementation of mean-field updates for continuous CRFs, we show that both proposed models can be regarded as sequential deep networks and that training can be performed end-to-end. Through an extensive experimental evaluation, we demonstrate the effectiveness of the proposed approach and establish new state of the art results for the monocular depth estimation task on three publicly available datasets, i.e. NYUD-V2, Make3D and KITTI.

* arXiv admin note: substantial text overlap with arXiv:1704.02157 

  Click for Model/Code and Paper