The objective of this work is to infer the 3D shape of an object from a single image. We use sculptures as our training and test bed, as these have great variety in shape and appearance. To achieve this we build on the success of multiple view geometry (MVG) which is able to accurately provide correspondences between images of 3D objects under varying viewpoint and illumination conditions, and make the following contributions: first, we introduce a new loss function that can harness image-to-image correspondences to provide a supervisory signal to train a deep network to infer a depth map. The network is trained end-to-end by differentiating through the camera. Second, we develop a processing pipeline to automatically generate a large scale multi-view set of correspondences for training the network. Finally, we demonstrate that we can indeed obtain a depth map of a novel object from a single image for a variety of sculptures with varying shape/texture, and that the network generalises at test time to new domains (e.g. synthetic images).

* ECCV workshop on Geometry meets Deep Learning
Click to Read Paper
In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.

* Appears in: European Conference on Computer Vision (ECCV) 2018
Click to Read Paper
The objective of this work is set-based face recognition, i.e. to decide if two sets of images of a face are of the same person or not. Conventionally, the set-wise feature descriptor is computed as an average of the descriptors from individual face images within the set. In this paper, we design a neural network architecture that learns to aggregate based on both "visual" quality (resolution, illumination), and "content" quality (relative importance for discriminative classification). To this end, we propose a Multicolumn Network (MN) that takes a set of images (the number in the set can vary) as input, and learns to compute a fix-sized feature descriptor for the entire set. To encourage high-quality representations, each individual input image is first weighted by its "visual" quality, determined by a self-quality assessment module, and followed by a dynamic recalibration based on "content" qualities relative to the other images within the set. Both of these qualities are learnt implicitly during training for set-wise classification. Comparing with the previous state-of-the-art architectures trained with the same dataset (VGGFace2), our Multicolumn Networks show an improvement of between 2-6% on the IARPA IJB face recognition benchmarks, and exceed the state of the art for all methods on these benchmarks.

* To appear in BMVC2018
Click to Read Paper
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.

* Removed references to mini-kinetics dataset that was never made publicly available and repeated all experiments on the full Kinetics dataset
Click to Read Paper
The goal of this paper is the automatic identification of characters in TV and feature film material. In contrast to standard approaches to this task, which rely on the weak supervision afforded by transcripts and subtitles, we propose a new method requiring only a cast list. This list is used to obtain images of actors from freely available sources on the web, providing a form of partial supervision for this task. In using images of actors to recognize characters, we make the following three contributions: (i) We demonstrate that an automated semi-supervised learning approach is able to adapt from the actor's face to the character's face, including the face context of the hair; (ii) By building voice models for every character, we provide a bridge between frontal faces (for which there is plenty of actor-level supervision) and profile (for which there is very little or none); and (iii) by combining face context and speaker identification, we are able to identify characters with partially occluded faces and extreme facial poses. Results are presented on the TV series 'Sherlock' and the feature film 'Casablanca'. We achieve the state-of-the-art on the Casablanca benchmark, surpassing previous methods that have used the stronger supervision available from transcripts.

Click to Read Paper
The objective of this paper is 3D shape understanding from single and multiple images. To this end, we introduce a new deep-learning architecture and loss function, SilNet, that can handle multiple views in an order-agnostic manner. The architecture is fully convolutional, and for training we use a proxy task of silhouette prediction, rather than directly learning a mapping from 2D images to 3D shape as has been the target in most recent work. We demonstrate that with the SilNet architecture there is generalisation over the number of views -- for example, SilNet trained on 2 views can be used with 3 or 4 views at test-time; and performance improves with more views. We introduce two new synthetics datasets: a blobby object dataset useful for pre-training, and a challenging and realistic sculpture dataset; and demonstrate on these datasets that SilNet has indeed learnt 3D shape. Finally, we show that SilNet exceeds the state of the art on the ShapeNet benchmark dataset, and use SilNet to generate novel views of the sculpture dataset.

* BMVC 2017; Best Poster
Click to Read Paper
We investigate methods for combining multiple self-supervised tasks--i.e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation. First, we provide an apples-to-apples comparison of four different self-supervised tasks using the very deep ResNet-101 architecture. We then combine tasks to jointly train a network. We also explore lasso regularization to encourage the network to factorize the information in its representation, and methods for "harmonizing" network inputs in order to learn a more unified representation. We evaluate all methods on ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our results show that deeper networks work better, and that combining tasks--even via a naive multi-head architecture--always improves performance. Our best joint network nearly matches the PASCAL performance of a model pre-trained on ImageNet classification, and matches the ImageNet network on NYU depth prediction.

* Published at ICCV 2017
Click to Read Paper
We propose a novel ConvNet model for predicting 2D human body poses in an image. The model regresses a heatmap representation for each body keypoint, and is able to learn and represent both the part appearances and the context of the part configuration. We make the following three contributions: (i) an architecture combining a feed forward module with a recurrent module, where the recurrent module can be run iteratively to improve the performance, (ii) the model can be trained end-to-end and from scratch, with auxiliary losses incorporated to improve performance, (iii) we investigate whether keypoint visibility can also be predicted. The model is evaluated on two benchmark datasets. The result is a simple architecture that achieves performance on par with the state of the art, but without the complexity of a graphical model stage (or layers).

* FG 2017, More Info and Demo: http://www.robots.ox.ac.uk/~vgg/software/keypoint_detection/
Click to Read Paper
We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

* Appears in: IEEE International Conference on Computer Vision (ICCV) 2017
Click to Read Paper
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

Click to Read Paper
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

Click to Read Paper
The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training. To this end we propose an image encoding that concisely represents human motion in a video sequence in a form that is suitable for learning with a ConvNet. The encoding reduces the pose information from an image to a single column, dramatically diminishing the input requirements for the network, but retaining the essential information for recognition. The encoding is applied to the task of recognizing and localizing signed gestures in British Sign Language (BSL) videos. We demonstrate that using the proposed encoding, signs as short as 10 frames duration can be learnt from clips lasting hundreds of frames using only weak (clip level) supervision and with considerable label noise.

Click to Read Paper
With the propensity for deep learning models to learn unintended signals from data sets there is always the possibility that the network can `cheat' in order to solve a task. In the instance of data sets for visual kinship verification, one such unintended signal could be that the faces are cropped from the same photograph, since faces from the same photograph are more likely to be from the same family. In this paper we investigate the influence of this artefactual data inference in published data sets for kinship verification. To this end, we obtain a large dataset, and train a CNN classifier to determine if two faces are from the same photograph or not. Using this classifier alone as a naive classifier of kinship, we demonstrate near state of the art results on five public benchmark data sets for kinship verification - achieving over 90% accuracy on one of them. Thus, we conclude that faces derived from the same photograph are a strong inadvertent signal in all the data sets we examined, and it is likely that the fraction of kinship explained by existing kinship models is small.

* Accepted to ACCV 2018
Click to Read Paper
Nearly all existing counting methods are designed for a specific object class. Our work, however, aims to create a counting model able to count any class of object. To achieve this goal, we formulate counting as a matching problem, enabling us to exploit the image self-similarity property that naturally exists in object counting problems. We make the following three contributions: first, a Generic Matching Network (GMN) architecture that can potentially count any object in a class-agnostic manner; second, by reformulating the counting problem as one of matching objects, we can take advantage of the abundance of video data labeled for tracking, which contains natural repetitions suitable for training a counting model. Such data enables us to train the GMN. Third, to customize the GMN to different user requirements, an adapter module is used to specialize the model with minimal effort, i.e. using a few labeled examples, and adapting only a small fraction of the trained parameters. This is a form of few-shot learning, which is practical for domains where labels are limited due to requiring expert knowledge (e.g. microbiology). We demonstrate the flexibility of our method on a diverse set of existing counting benchmarks: specifically cells, cars, and human crowds. The model achieves competitive performance on cell and crowd counting datasets, and surpasses the state-of-the-art on the car dataset using only three training images. When training on the entire dataset, the proposed method outperforms all previous methods by a large margin.

* Asian Conference on Computer Vision (ACCV), 2018
Click to Read Paper
The objective of this paper is to learn a compact representation of image sets for template-based face recognition. We make the following contributions: first, we propose a network architecture which aggregates and embeds the face descriptors produced by deep convolutional neural networks into a compact fixed-length representation. This compact representation requires minimal memory storage and enables efficient similarity computation. Second, we propose a novel GhostVLAD layer that includes {\em ghost clusters}, that do not contribute to the aggregation. We show that a quality weighting on the input faces emerges automatically such that informative images contribute more than those with low quality, and that the ghost clusters enhance the network's ability to deal with poor quality images. Third, we explore how input feature dimension, number of clusters and different training techniques affect the recognition performance. Given this analysis, we train a network that far exceeds the state-of-the-art on the IJB-B face recognition dataset. This is currently one of the most challenging public benchmarks, and we surpass the state-of-the-art on both the identification and verification protocols.

* Accepted by ACCV 2018
Click to Read Paper
Neural networks achieve the state-of-the-art in image classification tasks. However, they can encode spurious variations or biases that may be present in the training data. For example, training an age predictor on a dataset that is not balanced for gender can lead to gender biased predicitons (e.g. wrongly predicting that males are older if only elderly males are in the training set). We present two distinct contributions: 1) An algorithm that can remove multiple sources of variation from the feature representation of a network. We demonstrate that this algorithm can be used to remove biases from the feature representation, and thereby improve classification accuracies, when training networks on extremely biased datasets. 2) An ancestral origin database of 14,000 images of individuals from East Asia, the Indian subcontinent, sub-Saharan Africa, and Western Europe. We demonstrate on this dataset, for a number of facial attribute classification tasks, that we are able to remove racial biases from the network feature representation.

* Will appear in Workshop on Bias Estimation in Face Analytics, ECCV 2018
Click to Read Paper
This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets. We present detailed analysis for various aspects of the proposed method, namely - (1) the impact of the length of training sequences on convergence, (2) relation between character frequencies and the order in which they are learnt, and (3) demonstrate the generalisation ability of our recognition network to inputs of arbitrary lengths. Finally, we demonstrate excellent text recognition accuracy on both synthetically generated text images, and scanned images of real printed books, using no labelled training examples.

Click to Read Paper
We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.

* To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.html
Click to Read Paper
The objective of this work is set-based verification, e.g. to decide if two sets of images of a face are of the same person or not. The traditional approach to this problem is to learn to generate a feature vector per image, aggregate them into one vector to represent the set, and then compute the cosine similarity between sets. Instead, we design a neural network architecture that can directly learn set-wise verification. Our contributions are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of sets (each may contain a variable number of images) as inputs, and compute a similarity between the pair--this involves attending to multiple discriminative local regions (landmarks), and comparing local descriptors between pairs of faces; (ii) To encourage high-quality representations for each set, internal competition is introduced for recalibration based on the landmark score; (iii) Inspired by image retrieval, a novel hard sample mining regime is proposed to control the sampling process, such that the DCN is complementary to the standard image classification models. Evaluations on the IARPA Janus face recognition benchmarks show that the comparator networks outperform the previous state-of-the-art results by a large margin.

* To appear in ECCV 2018
Click to Read Paper
The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing. We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another face in a driving frame to produce a generated frame with the identity of the source frame but the pose and expression of the face in the driving frame. Second, we propose a method for training the network fully self-supervised using a large collection of video data. Third, we show that the generation process can be driven by other modalities, such as audio or pose codes, without any further training of the network. The generation results for driving a face with another face are compared to state-of-the-art self-supervised/supervised methods. We show that our approach is more robust than other methods, as it makes fewer assumptions about the input data. We also show examples of using our framework for video face editing.

* To appear in ECCV 2018. Accompanying video: http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/x2face.html
Click to Read Paper