Models, code, and papers for "Simone Bianco":
This paper introduces a new method for face verification across large age gaps and also a dataset containing variations of age in the wild, the Large Age-Gap (LAG) dataset, with images ranging from child/young to adult/old. The proposed method exploits a deep convolutional neural network (DCNN) pre-trained for the face recognition task on a large dataset and then fine-tuned for the large age-gap face verification task. Finetuning is performed in a Siamese architecture using a contrastive loss function. A feature injection layer is introduced to boost verification accuracy, showing the ability of the DCNN to learn a similarity metric leveraging external features. Experimental results on the LAG dataset show that our method is able to outperform the face verification solutions in the state of the art considered.
In this paper we consider how to automatically create pleasing photo collages created by placing a set of images on a limited canvas area. The task is formulated as an optimization problem. Differently from existing state-of-the-art approaches, we here exploit subjective experiments to model and learn pleasantness from user preferences. To this end, we design an experimental framework for the identification of the criteria that need to be taken into account to generate a pleasing photo collage. Five different thematic photo datasets are used to create collages using state-of-the-art criteria. A first subjective experiment where several subjects evaluated the collages, emphasizes that different criteria are involved in the subjective definition of pleasantness. We then identify new global and local criteria and design algorithms to quantify them. The relative importance of these criteria are automatically learned by exploiting the user preferences, and new collages are generated. To validate our framework, we performed several psycho-visual experiments involving different users. The results shows that the proposed framework allows to learn a novel computational model which effectively encodes an inter-user definition of pleasantness. The learned definition of pleasantness generalizes well to new photo datasets of different themes and sizes not used in the learning. Moreover, compared with two state of the art approaches, the collages created using our framework are preferred by the majority of the users.
In this paper we propose a deep residual autoencoder exploiting Residual-in-Residual Dense Blocks (RRDB) to remove artifacts in JPEG compressed images that is independent from the Quality Factor (QF) used. The proposed approach leverages both the learning capacity of deep residual networks and prior knowledge of the JPEG compression pipeline. The proposed model operates in the YCbCr color space and performs JPEG artifact restoration in two phases using two different autoencoders: the first one restores the luma channel exploiting 2D convolutions; the second one, using the restored luma channel as a guide, restores the chroma channels explotining 3D convolutions. Extensive experimental results on three widely used benchmark datasets (i.e. LIVE1, BDS500, and CLASSIC-5) show that our model is able to outperform the state of the art with respect to all the evaluation metrics considered (i.e. PSNR, PSNR-B, and SSIM). This results is remarkable since the approaches in the state of the art use a different set of weights for each compression quality, while the proposed model uses the same weights for all of them, making it applicable to images in the wild where the QF used for compression is unkwnown. Furthermore, the proposed model shows a greater robustness than state-of-the-art methods when applied to compression qualities not seen during training.
Recent research has widely explored the problem of aesthetics assessment of images with generic content. However, few approaches have been specifically designed to predict the aesthetic quality of images containing human faces, which make up a massive portion of photos in the web. This paper introduces a method for aesthetic quality assessment of images with faces. We exploit three different Convolutional Neural Networks to encode information regarding perceptual quality, global image aesthetics, and facial attributes; then, a model is trained to combine these features to explicitly predict the aesthetics of images containing faces. Experimental results show that our approach outperforms existing methods for both binary, i.e. low/high, and continuous aesthetic score prediction on four different databases in the state-of-the-art.
Given the existence of many change detection algorithms, each with its own peculiarities and strengths, we propose a combination strategy, that we termed IUTIS (In Unity There Is Strength), based on a genetic Programming framework. This combination strategy is aimed at leveraging the strengths of the algorithms and compensate for their weakness. In this paper we show our findings in applying the proposed strategy in two different scenarios. The first scenario is purely performance-based. The second scenario performance and efficiency must be balanced. Results demonstrate that starting from simple algorithms we can achieve comparable results with respect to more complex state-of-the-art change detection algorithms, while keeping the computational complexity affordable for real-time applications.
In this paper we present a method for the estimation of the color of the illuminant in RAW images. The method includes a Convolutional Neural Network that has been specially designed to produce multiple local estimates. A multiple illuminant detector determines whether or not the local outputs of the network must be aggregated into a single estimate. We evaluated our method on standard datasets with single and multiple illuminants, obtaining lower estimation errors with respect to those obtained by other general purpose methods in the state of the art.
In this paper we propose a strategy for semi-supervised image classification that leverages unsupervised representation learning and co-training. The strategy, that is called CURL from Co-trained Unsupervised Representation Learning, iteratively builds two classifiers on two different views of the data. The two views correspond to different representations learned from both labeled and unlabeled data and differ in the fusion scheme used to combine the image features. To assess the performance of our proposal, we conducted several experiments on widely used data sets for scene and object recognition. We considered three scenarios (inductive, transductive and self-taught learning) that differ in the strategy followed to exploit the unlabeled data. As image features we considered a combination of GIST, PHOG, and LBP as well as features extracted from a Convolutional Neural Network. Moreover, two embodiments of CURL are investigated: one using Ensemble Projection as unsupervised representation learning coupled with Logistic Regression, and one based on LapSVM. The results show that CURL clearly outperforms other supervised and semi-supervised learning methods in the state of the art.
In this work we describe a Convolutional Neural Network (CNN) to accurately predict the scene illumination. Taking image patches as input, the CNN works in the spatial domain without using hand-crafted features that are employed by most previous methods. The network consists of one convolutional layer with max pooling, one fully connected layer and three output nodes. Within the network structure, feature learning and regression are integrated into one optimization process, which leads to a more effective model for estimating scene illumination. This approach achieves state-of-the-art performance on a standard dataset of RAW images. Preliminary experiments on images with spatially varying illumination demonstrate the stability of the local illuminant estimation ability of our CNN.
In this work we propose a new deep multibranch neural network to solve the tasks of artist, style, and genre categorization in a multitask formulation. In order to gather clues from low-level texture details and, at the same time, exploit the coarse layout of the painting, the branches of the proposed networks are fed with crops at different resolutions. We propose and compare two different crop strategies: the first one is a random-crop strategy that permits to manage the tradeoff between accuracy and speed; the second one is a smart extractor based on Spatial Transformer Networks trained to extract the most representative subregions. Furthermore, inspired by the results obtained in other domains, we experiment the joint use of hand-crafted features directly computed on the input images along with neural ones. Experiments are performed on a new dataset originally sourced from wikiart.org and hosted by Kaggle, and made suitable for artist, style and genre multitask learning. The dataset here proposed, named MultitaskPainting100k, is composed by 100K paintings, 1508 artists, 125 styles and 41 genres. Our best method, tested on the MultitaskPainting100k dataset, achieves accuracy levels of 56.5%, 57.2%, and 63.6% on the tasks of artist, style and genre prediction respectively.
This work presents an in-depth analysis of the majority of the deep neural networks (DNNs) proposed in the state of the art for image recognition. For each DNN multiple performance indices are observed, such as recognition accuracy, model complexity, computational complexity, memory usage, and inference time. The behavior of such performance indices and some combinations of them are analyzed and discussed. To measure the indices we experiment the use of DNNs on two different computer architectures, a workstation equipped with a NVIDIA Titan X Pascal and an embedded system based on a NVIDIA Jetson TX1 board. This experimentation allows a direct comparison between DNNs running on machines with very different computational capacity. This study is useful for researchers to have a complete view of what solutions have been explored so far and in which research directions are worth exploring in the future; and for practitioners to select the DNN architecture(s) that better fit the resource constraints of practical deployments and applications. To complete this work, all the DNNs, as well as the software used for the analysis, are available online.
In this paper we propose a method for logo recognition using deep learning. Our recognition pipeline is composed of a logo region proposal followed by a Convolutional Neural Network (CNN) specifically trained for logo classification, even if they are not precisely localized. Experiments are carried out on the FlickrLogos-32 database, and we evaluate the effect on recognition performance of synthetic versus real data augmentation, and image pre-processing. Moreover, we systematically investigate the benefits of different training choices such as class-balancing, sample-weighting and explicit modeling the background class (i.e. no-logo regions). Experimental results confirm the feasibility of the proposed method, that outperforms the methods in the state of the art.
In this work we investigate the use of deep learning for distortion-generic blind image quality assessment. We report on different design choices, ranging from the use of features extracted from pre-trained Convolutional Neural Networks (CNNs) as a generic image description, to the use of features extracted from a CNN fine-tuned for the image quality task. Our best proposal, named DeepBIQ, estimates the image quality by average pooling the scores predicted on multiple sub-regions of the original image. The score of each sub-region is computed using a Support Vector Regression (SVR) machine taking as input features extracted using a CNN fine-tuned for category-based image quality assessment. Experimental results on the LIVE In the Wild Image Quality Challenge Database and on the LIVE Image Quality Assessment Database show that DeepBIQ outperforms the state-of-the-art methods compared, having a Linear Correlation Coefficient (LCC) with human subjective scores of almost 0.91 and 0.98 respectively. Furthermore, in most of the cases, the quality score predictions of DeepBIQ are closer to the average observer than those of a generic human observer.
In this work we present a method to improve the pruning step of the current state-of-the-art methodology to compress neural networks. The novelty of the proposed pruning technique is in its differentiability, which allows pruning to be performed during the backpropagation phase of the network training. This enables an end-to-end learning and strongly reduces the training time. The technique is based on a family of differentiable pruning functions and a new regularizer specifically designed to enforce pruning. The experimental results show that the joint optimization of both the thresholds and the network weights permits to reach a higher compression rate, reducing the number of weights of the pruned network by a further 14% to 33% with respect to the current state-of-the-art. Furthermore, we believe that this is the first study where the generalization capabilities in transfer learning tasks of the features extracted by a pruned network are analyzed. To achieve this goal, we show that the representations learned using the proposed pruning methodology maintain the same effectiveness and generality of those learned by the corresponding non-compressed network on a set of different recognition tasks.
In a previous work, it was shown that there is a curious problem with the benchmark ColorChecker dataset for illuminant estimation. To wit, this dataset has at least 3 different sets of ground-truths. Typically, for a single algorithm a single ground-truth is used. But then different algorithms, whose performance is measured with respect to different ground-truths, are compared against each other and then ranked. This makes no sense. We show in this paper that there are also errors in how each ground-truth set was calculated. As a result, all performance rankings based on the ColorChecker dataset - and there are scores of these - are inaccurate. In this paper, we re-generate a new 'recommended' set of ground-truth based on the calculation methodology described by Shi and Funt. We then review the performance evaluation of a range of illuminant estimation algorithms. Compared with the legacy ground-truths, we find that the difference in how algorithms perform can be large, with many local rankings of algorithms being reversed. Finally, we draw the readers attention to our new 'open' data repository which, we hope, will allow the ColorChecker set to be rehabilitated and once again to become a useful benchmark for illuminant estimation algorithms.
We propose a toolkit to generate structured synthetic documents emulating the actual document production process. Synthetic documents can be used to train systems to perform document analysis tasks. In our case we address the record counting task on handwritten structured collections containing a limited number of examples. Using the DocEmul toolkit we can generate a larger dataset to train a deep architecture to predict the number of records for each page. The toolkit is able to generate synthetic collections and also perform data augmentation to create a larger trainable dataset. It includes one method to extract the page background from real pages which can be used as a substrate where records can be written on the basis of variable structures and using cursive fonts. Moreover, it is possible to extend the synthetic collection by adding random noise, page rotations, and other visual variations. We performed some experiments on two different handwritten collections using the toolkit to generate synthetic data to train a Convolutional Neural Network able to count the number of records in the real collections.
In this paper, we investigate the use of Convolutional Neural Networks for counting the number of records in historical handwritten documents. With this work we demonstrate that training the networks only with synthetic images allows us to perform a near perfect evaluation of the number of records printed on historical documents. The experiments have been performed on a benchmark dataset composed by marriage records and outperform previous results on this dataset.
This paper investigates the processing of Frequency Modulated-Continuos Wave (FM-CW) radar signals for vehicle classification. In the last years deep learning has gained interest in several scientific fields and signal processing is not one exception. In this work we address the recognition of the vehicle category using a Convolutional Neural Network (CNN) applied to range Doppler signature. The developed system first transforms the 1-dimensional signal into a 3-dimensional signal that is subsequently used as input to the CNN. When using the trained model to predict the vehicle category we obtain good performance.
People re-identification task has seen enormous improvements in the latest years, mainly due to the development of better image features extraction from deep Convolutional Neural Networks (CNN) and the availability of large datasets. However, little research has been conducted on animal identification and re-identification, even if this knowledge may be useful in a rich variety of different scenarios. Here, we tackle cattle re-identification exploiting deep CNN and show how this task is poorly related with the human one, presenting unique challenges that makes it far from being solved. We present various baselines, both based on deep architectures or on standard machine learning algorithms, and compared them with our solution. Finally, a rich ablation study has been conducted to further investigate the unique peculiarities of this task.