Models, code, and papers for "Andrew Zisserman":

Sim2real transfer learning for 3D pose estimation: motion to the rescue

Jul 04, 2019
Carl Doersch, Andrew Zisserman

Simulation is an anonymous, low-bias source of data where annotation can often be done automatically; however, for some tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human pose estimation is a particularly interesting example of this sim2real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person's motion, notably as optical flow and the motion of 2D keypoints. Therefore, our results suggest that motion can be a simple way to bridge a sim2real gap when video is available. We evaluate on the 3D Poses in the Wild dataset, the most challenging modern standard of 3D pose estimation, where we show full 3D mesh recovery that is on par with state-of-the-art methods trained on real 3D sequences, despite training only on synthetic humans from the SURREAL dataset.

  Click for Model/Code and Paper
The VIA Annotation Software for Images, Audio and Video

May 31, 2019
Abhishek Dutta, Andrew Zisserman

In this paper, we introduce a simple and standalone manual annotation tool for images, audio and video: the VGG Image Annotator (VIA). This is a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser. Due to its lightness and flexibility, the VIA software has quickly become an essential and invaluable research support tool in many academic disciplines. Furthermore, it has also been very popular in several industrial sectors which have invested in adapting this open source software to their requirements. Since its public release in 2017, the VIA software has been used more than 600,000 times and has nurtured a large and thriving open source community.

* VIA now supports manual annotation of Images, Audio and Video. This article describes the VIA suite of open source applications that can be downloaded from 

  Click for Model/Code and Paper
Object Discovery with a Copy-Pasting GAN

May 27, 2019
Relja Arandjelović, Andrew Zisserman

We tackle the problem of object discovery, where objects are segmented for a given input image, and the system is trained without using any direct supervision whatsoever. A novel copy-pasting GAN framework is proposed, where the generator learns to discover an object in one image by compositing it into another image such that the discriminator cannot tell that the resulting image is fake. After carefully addressing subtle issues, such as preventing the generator from `cheating', this game results in the generator learning to select objects, as copy-pasting objects is most likely to fool the discriminator. The system is shown to work well on four very different datasets, including large object appearance variations in challenging cluttered backgrounds.

  Click for Model/Code and Paper
A Geometric Approach to Obtain a Bird's Eye View from an Image

May 06, 2019
Ammar Abbas, Andrew Zisserman

The objective of this paper is to rectify any monocular image by computing a homography matrix that transforms it to a bird's eye (overhead) view. We make the following contributions: (i) we show that the homography matrix can be parameterised with only four parameters that specify the horizon line and the vertical vanishing point, or only two if the field of view or focal length is known; (ii) We introduce a novel representation for the geometry of a line or point (which can be at infinity) that is suitable for regression with a convolutional neural network (CNN); (iii) We introduce a large synthetic image dataset with ground truth for the orthogonal vanishing points, that can be used for training a CNN to predict these geometric entities; and finally (iv) We achieve state-of-the-art results on horizon detection, with 74.52% AUC on the Horizon Lines in the Wild dataset. Our method is fast and robust, and can be used to remove perspective distortion from videos in real time.

  Click for Model/Code and Paper
The VGG Image Annotator (VIA)

Apr 24, 2019
Abhishek Dutta, Andrew Zisserman

Manual image annotation, such as defining and labelling regions of interest, is a fundamental processing stage of many research projects and industrial applications. In this paper, we introduce a simple and standalone manual image annotation tool: the VGG Image Annotator (\href{}{VIA}). This is a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser. Due to its lightness and flexibility, the VIA software has quickly become an essential and invaluable research support tool in many academic disciplines. Furthermore, it has also been immensely popular in several industrial sectors which have invested in adapting this open source software to their requirements. Since its public release in 2017, the VIA software has been used more than $500,000$ times and has nurtured a large and thriving open source community.

* Describes the VGG Image Annotator version 2.x.y which can be downloaded from 

  Click for Model/Code and Paper
3D Surface Reconstruction by Pointillism

Oct 04, 2018
Olivia Wiles, Andrew Zisserman

The objective of this work is to infer the 3D shape of an object from a single image. We use sculptures as our training and test bed, as these have great variety in shape and appearance. To achieve this we build on the success of multiple view geometry (MVG) which is able to accurately provide correspondences between images of 3D objects under varying viewpoint and illumination conditions, and make the following contributions: first, we introduce a new loss function that can harness image-to-image correspondences to provide a supervisory signal to train a deep network to infer a depth map. The network is trained end-to-end by differentiating through the camera. Second, we develop a processing pipeline to automatically generate a large scale multi-view set of correspondences for training the network. Finally, we demonstrate that we can indeed obtain a depth map of a novel object from a single image for a variety of sculptures with varying shape/texture, and that the network generalises at test time to new domains (e.g. synthetic images).

* ECCV workshop on Geometry meets Deep Learning 

  Click for Model/Code and Paper
Objects that Sound

Jul 25, 2018
Relja Arandjelović, Andrew Zisserman

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.

* Appears in: European Conference on Computer Vision (ECCV) 2018 

  Click for Model/Code and Paper
Multicolumn Networks for Face Recognition

Jul 24, 2018
Weidi Xie, Andrew Zisserman

The objective of this work is set-based face recognition, i.e. to decide if two sets of images of a face are of the same person or not. Conventionally, the set-wise feature descriptor is computed as an average of the descriptors from individual face images within the set. In this paper, we design a neural network architecture that learns to aggregate based on both "visual" quality (resolution, illumination), and "content" quality (relative importance for discriminative classification). To this end, we propose a Multicolumn Network (MN) that takes a set of images (the number in the set can vary) as input, and learns to compute a fix-sized feature descriptor for the entire set. To encourage high-quality representations, each individual input image is first weighted by its "visual" quality, determined by a self-quality assessment module, and followed by a dynamic recalibration based on "content" qualities relative to the other images within the set. Both of these qualities are learnt implicitly during training for set-wise classification. Comparing with the previous state-of-the-art architectures trained with the same dataset (VGGFace2), our Multicolumn Networks show an improvement of between 2-6% on the IARPA IJB face recognition benchmarks, and exceed the state of the art for all methods on these benchmarks.

* To appear in BMVC2018 

  Click for Model/Code and Paper
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Feb 12, 2018
Joao Carreira, Andrew Zisserman

The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.

* Removed references to mini-kinetics dataset that was never made publicly available and repeated all experiments on the full Kinetics dataset 

  Click for Model/Code and Paper
From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script

Jan 31, 2018
Arsha Nagrani, Andrew Zisserman

The goal of this paper is the automatic identification of characters in TV and feature film material. In contrast to standard approaches to this task, which rely on the weak supervision afforded by transcripts and subtitles, we propose a new method requiring only a cast list. This list is used to obtain images of actors from freely available sources on the web, providing a form of partial supervision for this task. In using images of actors to recognize characters, we make the following three contributions: (i) We demonstrate that an automated semi-supervised learning approach is able to adapt from the actor's face to the character's face, including the face context of the hair; (ii) By building voice models for every character, we provide a bridge between frontal faces (for which there is plenty of actor-level supervision) and profile (for which there is very little or none); and (iii) by combining face context and speaker identification, we are able to identify characters with partially occluded faces and extreme facial poses. Results are presented on the TV series 'Sherlock' and the feature film 'Casablanca'. We achieve the state-of-the-art on the Casablanca benchmark, surpassing previous methods that have used the stronger supervision available from transcripts.

  Click for Model/Code and Paper
SilNet : Single- and Multi-View Reconstruction by Learning from Silhouettes

Nov 21, 2017
Olivia Wiles, Andrew Zisserman

The objective of this paper is 3D shape understanding from single and multiple images. To this end, we introduce a new deep-learning architecture and loss function, SilNet, that can handle multiple views in an order-agnostic manner. The architecture is fully convolutional, and for training we use a proxy task of silhouette prediction, rather than directly learning a mapping from 2D images to 3D shape as has been the target in most recent work. We demonstrate that with the SilNet architecture there is generalisation over the number of views -- for example, SilNet trained on 2 views can be used with 3 or 4 views at test-time; and performance improves with more views. We introduce two new synthetics datasets: a blobby object dataset useful for pre-training, and a challenging and realistic sculpture dataset; and demonstrate on these datasets that SilNet has indeed learnt 3D shape. Finally, we show that SilNet exceeds the state of the art on the ShapeNet benchmark dataset, and use SilNet to generate novel views of the sculpture dataset.

* BMVC 2017; Best Poster 

  Click for Model/Code and Paper
Multi-task Self-Supervised Visual Learning

Aug 25, 2017
Carl Doersch, Andrew Zisserman

We investigate methods for combining multiple self-supervised tasks--i.e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation. First, we provide an apples-to-apples comparison of four different self-supervised tasks using the very deep ResNet-101 architecture. We then combine tasks to jointly train a network. We also explore lasso regularization to encourage the network to factorize the information in its representation, and methods for "harmonizing" network inputs in order to learn a more unified representation. We evaluate all methods on ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our results show that deeper networks work better, and that combining tasks--even via a naive multi-head architecture--always improves performance. Our best joint network nearly matches the PASCAL performance of a model pre-trained on ImageNet classification, and matches the ImageNet network on NYU depth prediction.

* Published at ICCV 2017 

  Click for Model/Code and Paper
Recurrent Human Pose Estimation

Aug 05, 2017
Vasileios Belagiannis, Andrew Zisserman

We propose a novel ConvNet model for predicting 2D human body poses in an image. The model regresses a heatmap representation for each body keypoint, and is able to learn and represent both the part appearances and the context of the part configuration. We make the following three contributions: (i) an architecture combining a feed forward module with a recurrent module, where the recurrent module can be run iteratively to improve the performance, (ii) the model can be trained end-to-end and from scratch, with auxiliary losses incorporated to improve performance, (iii) we investigate whether keypoint visibility can also be predicted. The model is evaluated on two benchmark datasets. The result is a simple architecture that achieves performance on par with the state of the art, but without the complexity of a graphical model stage (or layers).

* FG 2017, More Info and Demo: 

  Click for Model/Code and Paper
Look, Listen and Learn

Aug 01, 2017
Relja Arandjelović, Andrew Zisserman

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

* Appears in: IEEE International Conference on Computer Vision (ICCV) 2017 

  Click for Model/Code and Paper
Very Deep Convolutional Networks for Large-Scale Image Recognition

Apr 10, 2015
Karen Simonyan, Andrew Zisserman

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

  Click for Model/Code and Paper
Two-Stream Convolutional Networks for Action Recognition in Videos

Nov 12, 2014
Karen Simonyan, Andrew Zisserman

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

  Click for Model/Code and Paper
Signs in time: Encoding human motion as a temporal image

Aug 06, 2016
Joon Son Chung, Andrew Zisserman

The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training. To this end we propose an image encoding that concisely represents human motion in a video sequence in a form that is suitable for learning with a ConvNet. The encoding reduces the pose information from an image to a single column, dramatically diminishing the input requirements for the network, but retaining the essential information for recognition. The encoding is applied to the task of recognizing and localizing signed gestures in British Sign Language (BSL) videos. We demonstrate that using the proposed encoding, signs as short as 10 frames duration can be learnt from clips lasting hundreds of frames using only weak (clip level) supervision and with considerable label noise.

  Click for Model/Code and Paper
Video Representation Learning by Dense Predictive Coding

Sep 27, 2019
Tengda Han, Weidi Xie, Andrew Zisserman

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

  Click for Model/Code and Paper
Geometry-Aware Video Object Detection for Static Cameras

Sep 06, 2019
Dan Xu, Weidi Xie, Andrew Zisserman

In this paper we propose a geometry-aware model for video object detection. Specifically, we consider the setting that cameras can be well approximated as static, e.g. in video surveillance scenarios, and scene pseudo depth maps can therefore be inferred easily from the object scale on the image plane. We make the following contributions: First, we extend the recent anchor-free detector (CornerNet [17]) to video object detections. In order to exploit the spatial-temporal information while maintaining high efficiency, the proposed model accepts video clips as input, and only makes predictions for the starting and the ending frames, i.e. heatmaps of object bounding box corners and the corresponding embeddings for grouping. Second, to tackle the challenge from scale variations in object detection, scene geometry information, e.g. derived depth maps, is explicitly incorporated into deep networks for multi-scale feature selection and for the network prediction. Third, we validate the proposed architectures on an autonomous driving dataset generated from the Carla simulator [5], and on a real dataset for human detection (DukeMTMC dataset [28]). When comparing with the existing competitive single-stage or two-stage detectors, the proposed geometry-aware spatio-temporal network achieves significantly better results.

* Accepted at BMVC 2019 as ORAL 

  Click for Model/Code and Paper