Models, code, and papers for "William T. Freeman":

Physical Primitive Decomposition

Sep 13, 2018
Zhijian Liu, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu

Objects are made of parts, each with distinct geometry, physics, functionality, and affordances. Developing such a distributed, physical, interpretable representation of objects will facilitate intelligent agents to better explore and interact with the world. In this paper, we study physical primitive decomposition---understanding an object through its components, each with physical and geometric attributes. As annotated data for object parts and physics are rare, we propose a novel formulation that learns physical primitives by explaining both an object's appearance and its behaviors in physical events. Our model performs well on block towers and tools in both synthetic and real scenarios; we also demonstrate that visual and physical observations often provide complementary signals. We further present ablation and behavioral studies to better understand our model and contrast it with human performance.

* ECCV 2018. Project page: http://ppd.csail.mit.edu/ 

  Click for Model/Code and Paper
Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks

Jul 25, 2018
Tianfan Xue, Jiajun Wu, Katherine L. Bouman, William T. Freeman

We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods that have tackled this problem in a deterministic or non-parametric way, we propose to model future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. To synthesize realistic movement of objects, we propose a novel network structure, namely a Cross Convolutional Network; this network encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, and on real-world video frames. We present analyses of the learned network representations, showing it is implicitly learning a compact encoding of object appearance and motion. We also demonstrate a few of its applications, including visual analogy-making and video extrapolation.

* IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018 
* Journal preprint of arXiv:1607.02586 (IEEE TPAMI, in press). The first two authors contributed equally to this work 

  Click for Model/Code and Paper
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks

Jul 09, 2016
Tianfan Xue, Jiajun Wu, Katherine L. Bouman, William T. Freeman

We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods, which have tackled this problem in a deterministic or non-parametric way, we propose a novel approach that models future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. Future frame synthesis is challenging, as it involves low- and high-level image and motion understanding. We propose a novel network structure, namely a Cross Convolutional Network to aid in synthesizing future frames; this network structure encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, as well as on real-wold videos. We also show that our model can be applied to tasks such as visual analogy-making, and present an analysis of the learned network representations.

* The first two authors contributed equally to this work 

  Click for Model/Code and Paper
Smart, Sparse Contours to Represent and Edit Images

Apr 09, 2018
Tali Dekel, Chuang Gan, Dilip Krishnan, Ce Liu, William T. Freeman

We study the problem of reconstructing an image from information stored at contour locations. We show that high-quality reconstructions with high fidelity to the source image can be obtained from sparse input, e.g., comprising less than $6\%$ of image pixels. This is a significant improvement over existing contour-based reconstruction methods that require much denser input to capture subtle texture information and to ensure image quality. Our model, based on generative adversarial networks, synthesizes texture and details in regions where no input information is provided. The semantic knowledge encoded into our model and the sparsity of the input allows to use contours as an intuitive interface for semantically-aware image manipulation: local edits in contour domain translate to long-range and coherent changes in pixel space. We can perform complex structural changes such as changing facial expression by simple edits of contours. Our experiments demonstrate that humans as well as a face recognition system mostly cannot distinguish between our reconstructions and the source images.

* Accepted to CVPR'18; Project page: contour2im.github.io 

  Click for Model/Code and Paper
Video Enhancement with Task-Oriented Flow

Nov 24, 2017
Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, William T. Freeman

Many video processing algorithms rely on optical flow to register different frames within a sequence. However, a precise estimation of optical flow is often neither tractable nor optimal for a particular task. In this paper, we propose task-oriented flow (TOFlow), a flow representation tailored for specific video processing tasks. We design a neural network with a motion estimation component and a video processing component. These two parts can be jointly trained in a self-supervised manner to facilitate learning of the proposed TOFlow. We demonstrate that TOFlow outperforms the traditional optical flow on three different video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution. We also introduce Vimeo-90K, a large-scale, high-quality video dataset for video processing to better evaluate the proposed algorithm.

* Project page: http://toflow.csail.mit.edu 

  Click for Model/Code and Paper
A Comparative Evaluation of Approximate Probabilistic Simulation and Deep Neural Networks as Accounts of Human Physical Scene Understanding

Oct 04, 2016
Renqiao Zhang, Jiajun Wu, Chengkai Zhang, William T. Freeman, Joshua B. Tenenbaum

Humans demonstrate remarkable abilities to predict physical events in complex scenes. Two classes of models for physical scene understanding have recently been proposed: "Intuitive Physics Engines", or IPEs, which posit that people make predictions by running approximate probabilistic simulations in causal mental models similar in nature to video-game physics engines, and memory-based models, which make judgments based on analogies to stored experiences of previously encountered scenes and physical outcomes. Versions of the latter have recently been instantiated in convolutional neural network (CNN) architectures. Here we report four experiments that, to our knowledge, are the first rigorous comparisons of simulation-based and CNN-based models, where both approaches are concretely instantiated in algorithms that can run on raw image inputs and produce as outputs physical judgments such as whether a stack of blocks will fall. Both approaches can achieve super-human accuracy levels and can quantitatively predict human judgments to a similar degree, but only the simulation-based models generalize to novel situations in ways that people do, and are qualitatively consistent with systematic perceptual illusions and judgment asymmetries that people show.

* Accepted to CogSci 2016 as an oral presentation 

  Click for Model/Code and Paper
Best-Buddies Similarity - Robust Template Matching using Mutual Nearest Neighbors

Sep 06, 2016
Shaul Oron, Tali Dekel, Tianfan Xue, William T. Freeman, Shai Avidan

We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)--pairs of points in source and target sets, where each point is the nearest neighbor of the other. BBS has several key features that make it robust against complex geometric deformations and high levels of outliers, such as those arising from background clutter and occlusions. We study these properties, provide a statistical analysis that justifies them, and demonstrate the consistent success of BBS on a challenging real-world dataset while using different types of features.


  Click for Model/Code and Paper
Exploiting compositionality to explore a large space of model structures

Oct 16, 2012
Roger Grosse, Ruslan R Salakhutdinov, William T. Freeman, Joshua B. Tenenbaum

The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset. We investigate this question for a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning. To enable model selection, we organize these models into a context-free grammar which generates a wide variety of structures through the compositional application of a few simple rules. We use our grammar to generically and efficiently infer latent components and estimate predictive likelihood for nearly 2500 structures using a small toolbox of reusable algorithms. Using a greedy search over our grammar, we automatically choose the decomposition structure from raw data by evaluating only a small fraction of all models. The proposed method typically finds the correct structure for synthetic data and backs off gracefully to simpler models under heavy noise. It learns sensible structures for datasets as diverse as image patches, motion capture, 20 Questions, and U.S. Senate votes, all using exactly the same code.

* Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012) 

  Click for Model/Code and Paper
Learning to Reconstruct Shapes from Unseen Classes

Dec 28, 2018
Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Joshua B. Tenenbaum, William T. Freeman, Jiajun Wu

From a single image, humans are able to perceive the full 3D shape of an object by exploiting learned shape priors from everyday life. Contemporary single-image 3D reconstruction algorithms aim to solve this task in a similar fashion, but often end up with priors that are highly biased by training classes. Here we present an algorithm, Generalizable Reconstruction (GenRe), designed to capture more generic, class-agnostic shape priors. We achieve this with an inference network and training procedure that combine 2.5D representations of visible surfaces (depth and silhouette), spherical shape representations of both visible and non-visible surfaces, and 3D voxel-based representations, in a principled manner that exploits the causal structure of how 3D shapes give rise to 2D images. Experiments demonstrate that GenRe performs well on single-view shape reconstruction, and generalizes to diverse novel objects from categories not seen during training.

* NeurIPS 2018 (Oral). The first two authors contributed equally to this paper. Project page: http://genre.csail.mit.edu/ 

  Click for Model/Code and Paper
Learning Shape Priors for Single-View 3D Completion and Reconstruction

Sep 13, 2018
Jiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong Zhang, William T. Freeman, Joshua B. Tenenbaum

The problem of single-view 3D shape completion or reconstruction is challenging, because among the many possible shapes that explain an observation, most are implausible and do not correspond to natural objects. Recent research in the field has tackled this problem by exploiting the expressiveness of deep convolutional networks. In fact, there is another level of ambiguity that is often overlooked: among plausible shapes, there are still multiple shapes that fit the 2D image equally well; i.e., the ground truth shape is non-deterministic given a single-view input. Existing fully supervised approaches fail to address this issue, and often produce blurry mean shapes with smooth surfaces but no fine details. In this paper, we propose ShapeHD, pushing the limit of single-view shape completion and reconstruction by integrating deep generative models with adversarially learned shape priors. The learned priors serve as a regularizer, penalizing the model only if its output is unrealistic, not if it deviates from the ground truth. Our design thus overcomes both levels of ambiguity aforementioned. Experiments demonstrate that ShapeHD outperforms state of the art by a large margin in both shape completion and shape reconstruction on multiple real datasets.

* ECCV 2018. The first two authors contributed equally to this work. Project page: http://shapehd.csail.mit.edu/ 

  Click for Model/Code and Paper
Unsupervised Training for 3D Morphable Model Regression

Jun 15, 2018
Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, William T. Freeman

We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objectives: a batch distribution loss that encourages the output distribution to match the distribution of the morphable model, a loopback loss that ensures the network can correctly reinterpret its own output, and a multi-view identity loss that compares the features of the predicted 3D face and the input photograph from multiple viewing angles. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results.

* Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8377-8386 
* CVPR 2018 version with supplemental material (http://openaccess.thecvf.com/content_cvpr_2018/html/Genova_Unsupervised_Training_for_CVPR_2018_paper.html) 

  Click for Model/Code and Paper
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Dec 20, 2017
Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba

The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. 2016, with additional experiments and discussion.

* Journal preprint of arXiv:1608.07017 (unpublished submission to IJCV) 

  Click for Model/Code and Paper
Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

Jan 04, 2017
Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Freeman, Joshua B. Tenenbaum

We study the problem of 3D object generation. We propose a novel framework, namely 3D Generative Adversarial Network (3D-GAN), which generates 3D objects from a probabilistic space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets. The benefits of our model are three-fold: first, the use of an adversarial criterion, instead of traditional heuristic criteria, enables the generator to capture object structure implicitly and to synthesize high-quality 3D objects; second, the generator establishes a mapping from a low-dimensional probabilistic space to the space of 3D objects, so that we can sample objects without a reference image or CAD models, and explore the 3D object manifold; third, the adversarial discriminator provides a powerful 3D shape descriptor which, learned without supervision, has wide applications in 3D object recognition. Experiments demonstrate that our method generates high-quality 3D objects, and our unsupervisedly learned features achieve impressive performance on 3D object recognition, comparable with those of supervised learning methods.

* NIPS 2016. The first two authors contributed equally to this work 

  Click for Model/Code and Paper
Ambient Sound Provides Supervision for Visual Learning

Dec 05, 2016
Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba

The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.

* ECCV 2016 

  Click for Model/Code and Paper
Learning Shape Templates with Structured Implicit Functions

Apr 12, 2019
Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna, William T. Freeman, Thomas Funkhouser

Template 3D shapes are useful for many tasks in graphics and vision, including fitting observation data, analyzing shape collections, and transferring shape attributes. Because of the variety of geometry and topology of real-world shapes, previous methods generally use a library of hand-made templates. In this paper, we investigate learning a general shape template from data. To allow for widely varying geometry and topology, we choose an implicit surface representation based on composition of local shape elements. While long known to computer graphics, this representation has not yet been explored in the context of machine learning for vision. We show that structured implicit functions are suitable for learning and allow a network to smoothly and simultaneously fit multiple classes of shapes. The learned shape template supports applications such as shape exploration, correspondence, abstraction, interpolation, and semantic segmentation from an RGB image.

* 12 pages, 9 figures, 4 tables 

  Click for Model/Code and Paper
Seeing Tree Structure from Vibration

Sep 13, 2018
Tianfan Xue, Jiajun Wu, Zhoutong Zhang, Chengkai Zhang, Joshua B. Tenenbaum, William T. Freeman

Humans recognize object structure from both their appearance and motion; often, motion helps to resolve ambiguities in object structure that arise when we observe object appearance only. There are particular scenarios, however, where neither appearance nor spatial-temporal motion signals are informative: occluding twigs may look connected and have almost identical movements, though they belong to different, possibly disconnected branches. We propose to tackle this problem through spectrum analysis of motion signals, because vibrations of disconnected branches, though visually similar, often have distinctive natural frequencies. We propose a novel formulation of tree structure based on a physics-based link model, and validate its effectiveness by theoretical analysis, numerical simulation, and empirical experiments. With this formulation, we use nonparametric Bayesian inference to reconstruct tree structure from both spectral vibration signals and appearance cues. Our model performs well in recognizing hierarchical tree structure from real-world videos of trees and vessels.

* ECCV 2018. The first two authors contributed equally to this work. Project page: http://tree.csail.mit.edu/ 

  Click for Model/Code and Paper
Synthesizing Normalized Faces from Facial Identity Features

Oct 17, 2017
Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, William T. Freeman

We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network. Unlike previous approaches, our encoding feature vector is largely invariant to lighting, pose, and facial expression. Exploiting this invariance, we train our decoder network using only frontal, neutral-expression photographs. Since these photographs are well aligned, we can decompose them into a sparse set of landmark points and aligned texture maps. The decoder then predicts landmarks and textures independently and combines them using a differentiable image warping operation. The resulting images can be used for a number of applications, such as analyzing facial attributes, exposure and white balance adjustment, or creating a 3-D avatar.


  Click for Model/Code and Paper
Program-Guided Image Manipulators

Sep 04, 2019
Jiayuan Mao, Xiuming Zhang, Yikai Li, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu

Humans are capable of building holistic representations for images at various levels, from local objects, to pairwise relations, to global structures. The interpretation of structures involves reasoning over repetition and symmetry of the objects in the image. In this paper, we present the Program-Guided Image Manipulator (PG-IM), inducing neuro-symbolic program-like representations to represent and manipulate images. Given an image, PG-IM detects repeated patterns, induces symbolic programs, and manipulates the image using a neural network that is guided by the program. PG-IM learns from a single image, exploiting its internal statistics. Despite trained only on image inpainting, PG-IM is directly capable of extrapolation and regularity editing in a unified framework. Extensive experiments show that PG-IM achieves superior performance on all the tasks.

* ICCV 2019. First two authors contributed equally. Project page: http://pgim.csail.mit.edu/ 

  Click for Model/Code and Paper
Visual Deprojection: Probabilistic Recovery of Collapsed Dimensions

Sep 01, 2019
Guha Balakrishnan, Adrian V. Dalca, Amy Zhao, John V. Guttag, Fredo Durand, William T. Freeman

We introduce visual deprojection: the task of recovering an image or video that has been collapsed along a dimension. Projections arise in various contexts, such as long-exposure photography, where a dynamic scene is collapsed in time to produce a motion-blurred image, and corner cameras, where reflected light from a scene is collapsed along a spatial dimension because of an edge occluder to yield a 1D video. Deprojection is ill-posed-- often there are many plausible solutions for a given input. We first propose a probabilistic model capturing the ambiguity of the task. We then present a variational inference strategy using convolutional neural networks as functional approximators. Sampling from the inference network at test time yields plausible candidates from the distribution of original signals that are consistent with a given input projection. We evaluate the method on several datasets for both spatial and temporal deprojection tasks. We first demonstrate the method can recover human gait videos and face images from spatial projections, and then show that it can recover videos of moving digits from dramatically motion-blurred images obtained via temporal projection.

* ICCV 2019 

  Click for Model/Code and Paper
Boundless: Generative Adversarial Networks for Image Extension

Aug 19, 2019
Piotr Teterwak, Aaron Sarna, Dilip Krishnan, Aaron Maschinot, David Belanger, Ce Liu, William T. Freeman

Image extension models have broad applications in image editing, computational photography and computer graphics. While image inpainting has been extensively studied in the literature, it is challenging to directly apply the state-of-the-art inpainting methods to image extension as they tend to generate blurry or repetitive pixels with inconsistent semantics. We introduce semantic conditioning to the discriminator of a generative adversarial network (GAN), and achieve strong results on image extension with coherent semantics and visually pleasing colors and textures. We also show promising results in extreme extensions, such as panorama generation.


  Click for Model/Code and Paper