Models, code, and papers for "Simone Calderara":

Generative Adversarial Models for People Attribute Recognition in Surveillance

Jul 07, 2017
Matteo Fabbri, Simone Calderara, Rita Cucchiara

In this paper we propose a deep architecture for detecting people attributes (e.g. gender, race, clothing ...) in surveillance contexts. Our proposal explicitly deal with poor resolution and occlusion issues that often occur in surveillance footages by enhancing the images by means of Deep Convolutional Generative Adversarial Networks (DCGAN). Experiments show that by combining both our Generative Reconstruction and Deep Attribute Classification Network we can effectively extract attributes even when resolution is poor and in presence of strong occlusions up to 80\% of the whole person figure.

* Accepted as oral presentation at AVSS 2017 

  Click for Model/Code and Paper
Learning to Divide and Conquer for Online Multi-Target Tracking

Sep 14, 2015
Francesco Solera, Simone Calderara, Rita Cucchiara

Online Multiple Target Tracking (MTT) is often addressed within the tracking-by-detection paradigm. Detections are previously extracted independently in each frame and then objects trajectories are built by maximizing specifically designed coherence functions. Nevertheless, ambiguities arise in presence of occlusions or detection errors. In this paper we claim that the ambiguities in tracking could be solved by a selective use of the features, by working with more reliable features if possible and exploiting a deeper representation of the target only if necessary. To this end, we propose an online divide and conquer tracker for static camera scenes, which partitions the assignment problem in local subproblems and solves them by selectively choosing and combining the best features. The complete framework is cast as a structural learning task that unifies these phases and learns tracker parameters from examples. Experiments on two different datasets highlights a significant improvement of tracking performances (MOTA +10%) over the state of the art.


  Click for Model/Code and Paper
Socially Constrained Structural Learning for Groups Detection in Crowd

Aug 06, 2015
Francesco Solera, Simone Calderara, Rita Cucchiara

Modern crowd theories agree that collective behavior is the result of the underlying interactions among small groups of individuals. In this work, we propose a novel algorithm for detecting social groups in crowds by means of a Correlation Clustering procedure on people trajectories. The affinity between crowd members is learned through an online formulation of the Structural SVM framework and a set of specifically designed features characterizing both their physical and social identity, inspired by Proxemic theory, Granger causality, DTW and Heat-maps. To adhere to sociological observations, we introduce a loss function (G-MITRE) able to deal with the complexity of evaluating group detection performances. We show our algorithm achieves state-of-the-art results when relying on both ground truth trajectories and tracklets previously extracted by available detector/tracker systems.


  Click for Model/Code and Paper
Semi-parametric Object Synthesis

Jul 24, 2019
Andrea Palazzi, Luca Bergamini, Simone Calderara, Rita Cucchiara

We present a new semi-parametric approach to synthesize novel views of an object from a single monocular image. First, we exploit man-made object symmetry and piece-wise planarity to integrate rich a-priori visual information into the novel viewpoint synthesis process. An Image Completion Network (ICN) then leverages 2.5D sketches rendered from a 3D CAD as guidance to generate a realistic image. In contrast to concurrent works, we do not rely solely on synthetic data but leverage instead existing datasets for 3D object detection to operate in a real-world scenario. Differently from competitors, our semi-parametric framework allows the handling of a wide range of 3D transformations. Thorough experimental analysis against state-of-the-art baselines shows the efficacy of our method both from a quantitative and a perceptive point of view. Code and supplementary material are available at: https://github.com/ndrplz/semiparametric


  Click for Model/Code and Paper
A Deep Learning based approach to VM behavior identification in cloud systems

Mar 05, 2019
Matteo Stefanini, Riccardo Lancellotti, Lorenzo Baraldi, Simone Calderara

Cloud computing data centers are growing in size and complexity to the point where monitoring and management of the infrastructure become a challenge due to scalability issues. A possible approach to cope with the size of such data centers is to identify VMs exhibiting a similar behavior. Existing literature demonstrated that clustering together VMs that show a similar behavior may improve the scalability of both monitoring andmanagement of a data center. However, available techniques suffer from a trade-off between accuracy and time to achieve this result. Throughout this paper we propose a different approach where, instead of an unsupervised clustering, we rely on classifiers based on deep learning techniques to assigna newly deployed VMs to a cluster of already-known VMs. The two proposed classifiers, namely DeepConv and DeepFFT use a convolution neural network and (in the latter model) exploits Fast Fourier Transformation to classify the VMs. Our proposal is validated using a set of traces describing the behavior of VMs from a realcloud data center. The experiments compare our proposal with state-of-the-art solutions available in literature, demonstrating that our proposal achieve better performance. Furthermore, we show that our solution issignificantly faster than the alternatives as it can produce a perfect classification even with just a few samples of data, making our proposal viable also toclassify on-demand VMs that are characterized by a short life span.

* Accepted at CLOSER2019 

  Click for Model/Code and Paper
Classifying Signals on Irregular Domains via Convolutional Cluster Pooling

Feb 13, 2019
Angelo Porrello, Davide Abati, Simone Calderara, Rita Cucchiara

We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset. To this end, we introduce a Convolutional Cluster Pooling layer exploiting a multi-scale clustering in order to highlight, at different resolutions, locally connected regions on the input graph. Our proposal generalises well-established neural models such as Convolutional Neural Networks (CNNs) on irregular and complex domains, by means of the exploitation of the weight sharing property in a graph-oriented architecture. In this work, such property is based on the centrality of each vertex within its soft-assigned cluster. Extensive experiments on NTU RGB+D, CIFAR-10 and 20NEWS demonstrate the effectiveness of the proposed technique in capturing both local and global patterns in graph-structured data out of different domains.

* 12 pages, 6 figures. To appear in the Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89 

  Click for Model/Code and Paper
AND: Autoregressive Novelty Detectors

Jul 04, 2018
Davide Abati, Angelo Porrello, Simone Calderara, Rita Cucchiara

We propose an unsupervised model for novelty detection. The subject is treated as a density estimation problem, in which a deep neural network is employed to learn a parametric function that maximizes probabilities of training samples. This is achieved by equipping an autoencoder with a novel module, responsible for the maximization of compressed codes' likelihood by means of autoregression. We illustrate design choices and proper layers to perform autoregressive density estimation when dealing with both image and video inputs. Despite a very general formulation, our model shows promising results in diverse one-class novelty detection and video anomaly detection benchmarks.


  Click for Model/Code and Paper
Can Adversarial Networks Hallucinate Occluded People With a Plausible Aspect?

Jan 23, 2019
Federico Fulgeri, Matteo Fabbri, Stefano Alletto, Simone Calderara, Rita Cucchiara

When you see a person in a crowd, occluded by other persons, you miss visual information that can be used to recognize, re-identify or simply classify him or her. You can imagine its appearance given your experience, nothing more. Similarly, AI solutions can try to hallucinate missing information with specific deep learning architectures, suitably trained with people with and without occlusions. The goal of this work is to generate a complete image of a person, given an occluded version in input, that should be a) without occlusion b) similar at pixel level to a completely visible people shape c) capable to conserve similar visual attributes (e.g. male/female) of the original one. For the purpose, we propose a new approach by integrating the state-of-the-art of neural network architectures, namely U-nets and GANs, as well as discriminative attribute classification nets, with an architecture specifically designed to de-occlude people shapes. The network is trained to optimize a Loss function which could take into account the aforementioned objectives. As well we propose two datasets for testing our solution: the first one, occluded RAP, created automatically by occluding real shapes of the RAP dataset (which collects also attributes of the people aspect); the second is a large synthetic dataset, AiC, generated in computer graphics with data extracted from the GTA video game, that contains 3D data of occluded objects by construction. Results are impressive and outperform any other previous proposal. This result could be an initial step to many further researches to recognize people and their behavior in an open crowded world.

* Under review at CVIU 

  Click for Model/Code and Paper
Face-from-Depth for Head Pose Estimation on Depth Images

Aug 30, 2018
Guido Borghi, Matteo Fabbri, Roberto Vezzani, Simone Calderara, Rita Cucchiara

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second.

* Submitted to IEEE Transactions on PAMI, updated version (second round). arXiv admin note: substantial text overlap with arXiv:1611.10195 

  Click for Model/Code and Paper
Predicting the Driver's Focus of Attention: the DR(eye)VE Project

Jun 06, 2018
Andrea Palazzi, Davide Abati, Simone Calderara, Francesco Solera, Rita Cucchiara

In this work we aim to predict the driver's focus of attention. The goal is to estimate what a person would pay attention to while driving, and which part of the scene around the vehicle is more critical for the task. To this end we propose a new computer vision model based on a multi-branch deep architecture that integrates three sources of information: raw video, motion and scene semantics. We also introduce DR(eye)VE, the largest dataset of driving scenes for which eye-tracking annotations are available. This dataset features more than 500,000 registered frames, matching ego-centric views (from glasses worn by drivers) and car-centric views (from roof-mounted camera), further enriched by other sensors measurements. Results highlight that several attention patterns are shared across drivers and can be reproduced to some extent. The indication of which elements in the scene are likely to capture the driver's attention may benefit several applications in the context of human-vehicle interaction and driver attention analysis.

* IEEE Transactions on Pattern Analysis and Machine Intelligence 

  Click for Model/Code and Paper
TransFlow: Unsupervised Motion Flow by Joint Geometric and Pixel-level Estimation

Oct 30, 2017
Stefano Alletto, Davide Abati, Simone Calderara, Rita Cucchiara, Luca Rigazio

We address unsupervised optical flow estimation for ego-centric motion. We argue that optical flow can be cast as a geometrical warping between two successive video frames and devise a deep architecture to estimate such transformation in two stages. First, a dense pixel-level flow is computed with a geometric prior imposing strong spatial constraints. Such prior is typical of driving scenes, where the point of view is coherent with the vehicle motion. We show how such global transformation can be approximated with an homography and how spatial transformer layers can be employed to compute the flow field implied by such transformation. The second stage then refines the prediction feeding a second deeper network. A final reconstruction loss compares the warping of frame X(t) with the subsequent frame X(t+1) and guides both estimates. The model, which we named TransFlow, performs favorably compared to other unsupervised algorithms, and shows better generalization compared to supervised methods with a 3x reduction in error on unseen data.

* We have found a bug in the flow evaluation code compromising the experimental evaluation and the results provided in the paper are no longer correct. We are currently working on a new experimental campaign but we estimate that results will be available in a few weeks and will drastically change the paper, hence the withdraw request 

  Click for Model/Code and Paper
Learning to Map Vehicles into Bird's Eye View

Jun 26, 2017
Andrea Palazzi, Guido Borghi, Davide Abati, Simone Calderara, Rita Cucchiara

Awareness of the road scene is an essential component for both autonomous vehicles and Advances Driver Assistance Systems and is gaining importance both for the academia and car companies. This paper presents a way to learn a semantic-aware transformation which maps detections from a dashboard camera view onto a broader bird's eye occupancy map of the scene. To this end, a huge synthetic dataset featuring 1M couples of frames, taken from both car dashboard and bird's eye view, has been collected and automatically annotated. A deep-network is then trained to warp detections from the first to the second view. We demonstrate the effectiveness of our model against several baselines and observe that is able to generalize on real-world data despite having been trained solely on synthetic ones.

* Accepted to International Conference on Image Analysis and Processing (ICIAP) 2017 

  Click for Model/Code and Paper
Learning Where to Attend Like a Human Driver

May 09, 2017
Andrea Palazzi, Francesco Solera, Simone Calderara, Stefano Alletto, Rita Cucchiara

Despite the advent of autonomous cars, it's likely - at least in the near future - that human attention will still maintain a central role as a guarantee in terms of legal responsibility during the driving task. In this paper we study the dynamics of the driver's gaze and use it as a proxy to understand related attentional mechanisms. First, we build our analysis upon two questions: where and what the driver is looking at? Second, we model the driver's gaze by training a coarse-to-fine convolutional network on short sequences extracted from the DR(eye)VE dataset. Experimental comparison against different baselines reveal that the driver's gaze can indeed be learnt to some extent, despite i) being highly subjective and ii) having only one driver's gaze available for each sequence due to the irreproducibility of the scene. Eventually, we advocate for a new assisted driving paradigm which suggests to the driver, with no intervention, where she should focus her attention.

* To appear in IEEE Intelligent Vehicles Symposium 2017 

  Click for Model/Code and Paper
Domain Translation with Conditional GANs: from Depth to RGB Face-to-Face

Jan 23, 2019
Matteo Fabbri, Guido Borghi, Fabio Lanzi, Roberto Vezzani, Simone Calderara, Rita Cucchiara

Can faces acquired by low-cost depth sensors be useful to catch some characteristic details of the face? Typically the answer is no. However, new deep architectures can generate RGB images from data acquired in a different modality, such as depth data. In this paper, we propose a new \textit{Deterministic Conditional GAN}, trained on annotated RGB-D face datasets, effective for a face-to-face translation from depth to RGB. Although the network cannot reconstruct the exact somatic features for unknown individual faces, it is capable to reconstruct plausible faces; their appearance is accurate enough to be used in many pattern recognition tasks. In fact, we test the network capability to hallucinate with some \textit{Perceptual Probes}, as for instance face aspect classification or landmark detection. Depth face can be used in spite of the correspondent RGB images, that often are not available due to difficult luminance conditions. Experimental results are very promising and are as far as better than previously proposed approaches: this domain translation can constitute a new way to exploit depth data in new future applications.

* Accepted at ICPR 2018 

  Click for Model/Code and Paper
Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World

Sep 18, 2018
Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, Rita Cucchiara

Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.

* Accepted at ECCV 2018 

  Click for Model/Code and Paper
Multi-views Embedding for Cattle Re-identification

Feb 13, 2019
Luca Bergamini, Angelo Porrello, Andrea Capobianco Dondona, Ercole Del Negro, Mauro Mattioli, Nicola D'Alterio, Simone Calderara

People re-identification task has seen enormous improvements in the latest years, mainly due to the development of better image features extraction from deep Convolutional Neural Networks (CNN) and the availability of large datasets. However, little research has been conducted on animal identification and re-identification, even if this knowledge may be useful in a rich variety of different scenarios. Here, we tackle cattle re-identification exploiting deep CNN and show how this task is poorly related with the human one, presenting unique challenges that makes it far from being solved. We present various baselines, both based on deep architectures or on standard machine learning algorithms, and compared them with our solution. Finally, a rich ablation study has been conducted to further investigate the unique peculiarities of this task.

* 8 pages, 3 figures. Accepted in the 14th International Conference on SIGNAL IMAGE TECHNOLOGY & INTERNET BASED SYSTEMS (SITIS-2018) 

  Click for Model/Code and Paper