Models, code, and papers for "Bolei Zhou":

Image Processing Using Multi-Code GAN Prior

Dec 15, 2019
Jinjin Gu, Yujun Shen, Bolei Zhou

Despite the success of Generative Adversarial Networks (GANs) in image synthesis, applying trained GAN models to real image processing remains challenging. Because the generator in GANs typically maps the latent space to the image space, there leaves no space for it to take a real image as the input. To make a trained GAN handle real images, existing methods attempt to invert a target image back to the latent space either by back-propagation or by learning an additional encoder. However, the reconstructions from both of the methods are far from ideal. In this work, we propose a new inversion approach to incorporate the well-trained GANs as effective prior to a variety of image processing tasks. In particular, to invert a given GAN model, we employ multiple latent codes to generate multiple feature maps at some intermediate layer of the generator, then compose them with adaptive channel importance to output the final image. Such an over-parameterization of the latent space significantly improves the image reconstruction quality, outperforming existing GAN inversion methods. The resulting high-fidelity image reconstruction enables the trained GAN models as prior to many real-world applications, such as image colorization, super-resolution, image inpainting, and semantic manipulation. We further analyze the properties of the layer-wise representation learned by GAN models and shed light on what knowledge each layer is capable of representing.

* 19 pages, 22 figures, 4 tables 

  Access Model/Code and Paper
Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis

Nov 29, 2019
Ceyuan Yang, Yujun Shen, Bolei Zhou

Despite the success of Generative Adversarial Networks (GANs) in image synthesis, there lacks enough understanding on what networks have learned inside the deep generative representations and how photo-realistic images are able to be composed of random noises. In this work, we show that highly-structured semantic hierarchy emerges as variation factors for synthesizing scenes from the generative representations in state-of-the-art GAN models, like StyleGAN and BigGAN. By probing the layer-wise representations with a broad set of semantics at different abstraction levels, we are able to quantify the causality between the activations and semantics occurring in the output image. Such a quantification identifies the human-understandable variation factors learned by GANs to compose scenes. The qualitative and quantitative results suggest that the generative representations learned by the GANs with layer-wise latent codes are specialized to synthesize different hierarchical semantics: the early layers tend to determine the spatial layout and configuration, the middle layers control the objects, and the later layers finally render the scene attributes as well as color scheme. Identifying such a set of manipulatable latent variation factors facilitates semantic scene manipulation.

* 19 pages, 19 figures 

  Access Model/Code and Paper
Proceedings of AAAI 2019 Workshop on Network Interpretability for Deep Learning

Jan 25, 2019
Quanshi Zhang, Lixin Fan, Bolei Zhou

This is the Proceedings of AAAI 2019 Workshop on Network Interpretability for Deep Learning

  Access Model/Code and Paper
Optimization as Estimation with Gaussian Processes in Bandit Settings

Aug 12, 2018
Zi Wang, Bolei Zhou, Stefanie Jegelka

Recently, there has been rising interest in Bayesian optimization -- the optimization of an unknown function with assumptions usually expressed by a Gaussian Process (GP) prior. We study an optimization strategy that directly uses an estimate of the argmax of the function. This strategy offers both practical and theoretical advantages: no tradeoff parameter needs to be selected, and, moreover, we establish close connections to the popular GP-UCB and GP-PI strategies. Our approach can be understood as automatically and adaptively trading off exploration and exploitation in GP-UCB and GP-PI. We illustrate the effects of this adaptive tuning via bounds on the regret as well as an extensive empirical evaluation on robotics and vision tasks, demonstrating the robustness of this strategy for a range of performance criteria.

* Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain 

  Access Model/Code and Paper
ConceptLearner: Discovering Visual Concepts from Weakly Labeled Image Collections

Nov 19, 2014
Bolei Zhou, Vignesh Jagadeesh, Robinson Piramuthu

Discovering visual knowledge from weakly labeled data is crucial to scale up computer vision recognition system, since it is expensive to obtain fully labeled data for a large number of concept categories. In this paper, we propose ConceptLearner, which is a scalable approach to discover visual concepts from weakly labeled image collections. Thousands of visual concept detectors are learned automatically, without human in the loop for additional annotation. We show that these learned detectors could be applied to recognize concepts at image-level and to detect concepts at image region-level accurately. Under domain-specific supervision, we further evaluate the learned concepts for scene recognition on SUN database and for object detection on Pascal VOC 2007. ConceptLearner shows promising performance compared to fully supervised and weakly supervised methods.

* 9 pages, 8 figures, 3 tables 

  Access Model/Code and Paper
In-Domain GAN Inversion for Real Image Editing

Mar 31, 2020
Jiapeng Zhu, Yujun Shen, Deli Zhao, Bolei Zhou

Recent work has shown that a variety of controllable semantics emerges in the latent space of the Generative Adversarial Networks (GANs) when being trained to synthesize images. However, it is difficult to use these learned semantics for real image editing. A common practice of feeding a real image to a trained GAN generator is to invert it back to a latent code. However, we find that existing inversion methods typically focus on reconstructing the target image by pixel values yet fail to land the inverted code in the semantic domain of the original latent space. As a result, the reconstructed image cannot well support semantic editing through varying the latent code. To solve this problem, we propose an in-domain GAN inversion approach, which not only faithfully reconstructs the input image but also ensures the inverted code to be semantically meaningful for editing. We first learn a novel domain-guided encoder to project any given image to the native latent space of GANs. We then propose a domain-regularized optimization by involving the encoder as a regularizer to fine-tune the code produced by the encoder, which better recovers the target image. Extensive experiments suggest that our inversion method achieves satisfying real image reconstruction and more importantly facilitates various image editing tasks, such as image interpolation and semantic manipulation, significantly outperforming start-of-the-arts.

* 29 pages, 20 figures, 2 tables 

  Access Model/Code and Paper
Interpreting the Latent Space of GANs for Semantic Face Editing

Jul 25, 2019
Yujun Shen, Jinjin Gu, Xiaoou Tang, Bolei Zhou

Despite the recent advance of Generative Adversarial Networks (GANs) in high-fidelity image synthesis, there lacks enough understandings on how GANs are able to map the latent code sampled from a random distribution to a photo-realistic image. Previous work assumes the latent space learned by GAN follows a distributed representation but observes the vector arithmetic phenomenon of the output's semantics in latent space. In this work, we interpret the semantics hidden in the latent space of well-trained GANs. We find that the latent code for well-trained generative models, such as ProgressiveGAN and StyleGAN, actually learns a disentangled representation after some linear transformations. We make a rigorous analysis on the encoding of various semantics in the latent space as well as their properties, and then study how these semantics are correlated to each other. Based on our analysis, we propose a simple and general technique, called InterFaceGAN, for semantic face editing in latent space. Given a synthesized face, we are able to faithfully edit its various attributes such as pose, expression, age, presence of eyeglasses, without retraining the GAN model. Furthermore, we show that even the artifacts occurred in output images are able to be fixed using same approach. Extensive results suggest that learning to synthesize faces spontaneously brings a disentangled and controllable facial attribute representation

* 19 pages, 19 figures 

  Access Model/Code and Paper
FaceFeat-GAN: a Two-Stage Approach for Identity-Preserving Face Synthesis

Dec 04, 2018
Yujun Shen, Bolei Zhou, Ping Luo, Xiaoou Tang

The advance of Generative Adversarial Networks (GANs) enables realistic face image synthesis. However, synthesizing face images that preserve facial identity as well as have high diversity within each identity remains challenging. To address this problem, we present FaceFeat-GAN, a novel generative model that improves both image quality and diversity by using two stages. Unlike existing single-stage models that map random noise to image directly, our two-stage synthesis includes the first stage of diverse feature generation and the second stage of feature-to-image rendering. The competitions between generators and discriminators are carefully designed in both stages with different objective functions. Specially, in the first stage, they compete in the feature domain to synthesize various facial features rather than images. In the second stage, they compete in the image domain to render photo-realistic images that contain high diversity but preserve identity. Extensive experiments show that FaceFeat-GAN generates images that not only retain identity information but also have high diversity and quality, significantly outperforming previous methods.

* 12 pages and 6 figures 

  Access Model/Code and Paper
Temporal Relational Reasoning in Videos

Jul 25, 2018
Bolei Zhou, Alex Andonian, Aude Oliva, Antonio Torralba

Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species. In this paper, we introduce an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales. We evaluate TRN-equipped networks on activity recognition tasks using three recent video datasets - Something-Something, Jester, and Charades - which fundamentally depend on temporal relational reasoning. Our results demonstrate that the proposed TRN gives convolutional neural networks a remarkable capacity to discover temporal relations in videos. Through only sparsely sampled video frames, TRN-equipped networks can accurately predict human-object interactions in the Something-Something dataset and identify various human gestures on the Jester dataset with very competitive performance. TRN-equipped networks also outperform two-stream networks and 3D convolution networks in recognizing daily activities in the Charades dataset. Further analyses show that the models learn intuitive and interpretable visual common sense knowledge in videos.

* camera-ready version for ECCV'18 

  Access Model/Code and Paper
Interpreting Deep Visual Representations via Network Dissection

Jun 26, 2018
Bolei Zhou, David Bau, Aude Oliva, Antonio Torralba

The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. However, CNNs often criticized as being black boxes that lack interpretability, since they have millions of unexplained model parameters. In this work, we describe Network Dissection, a method that interprets networks by providing labels for the units of their deep visual representations. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. The method reveals that deep representations are more transparent and interpretable than expected: we find that representations are significantly more interpretable than they would be under a random equivalently powerful basis. We apply the method to interpret and compare the latent representations of various network architectures trained to solve different supervised and self-supervised training tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initializations, and the network depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a prediction given by a CNN for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into their hierarchical structure.

* *B. Zhou and D. Bau contributed equally to this work. 15 pages, 27 figures 

  Access Model/Code and Paper
Revisiting the Importance of Individual Units in CNNs via Ablation

Jun 07, 2018
Bolei Zhou, Yiyou Sun, David Bau, Antonio Torralba

We revisit the importance of the individual units in Convolutional Neural Networks (CNNs) for visual recognition. By conducting unit ablation experiments on CNNs trained on large scale image datasets, we demonstrate that, though ablating any individual unit does not hurt overall classification accuracy, it does lead to significant damage on the accuracy of specific classes. This result shows that an individual unit is specialized to encode information relevant to a subset of classes. We compute the correlation between the accuracy drop under unit ablation and various attributes of an individual unit such as class selectivity and weight L1 norm. We confirm that unit attributes such as class selectivity are a poor predictor for impact on overall accuracy as found previously in recent work \cite{morcos2018importance}. However, our results show that class selectivity along with other attributes are good predictors of the importance of one unit to individual classes. We evaluate the impact of random rotation, batch normalization, and dropout to the importance of units to specific classes. Our results show that units with high selectivity play an important role in network classification power at the individual class level. Understanding and interpreting the behavior of these units is necessary and meaningful.

  Access Model/Code and Paper
Understanding Intra-Class Knowledge Inside CNN

Jul 21, 2015
Donglai Wei, Bolei Zhou, Antonio Torrabla, William Freeman

Convolutional Neural Network (CNN) has been successful in image recognition tasks, and recent works shed lights on how CNN separates different classes with the learned inter-class knowledge through visualization. In this work, we instead visualize the intra-class knowledge inside CNN to better understand how an object class is represented in the fully-connected layers. To invert the intra-class knowledge into more interpretable images, we propose a non-parametric patch prior upon previous CNN visualization models. With it, we show how different "styles" of templates for an object class are organized by CNN in terms of location and content, and represented in a hierarchical and ensemble way. Moreover, such intra-class knowledge can be used in many interesting applications, e.g. style-based image retrieval and style-based object completion.

* tech report for: 

  Access Model/Code and Paper
Deep Flow-Guided Video Inpainting

May 08, 2019
Rui Xu, Xiaoxiao Li, Bolei Zhou, Chen Change Loy

Video inpainting, which aims at filling in missing regions of a video, remains challenging due to the difficulty of preserving the precise spatial and temporal coherence of video contents. In this work we propose a novel flow-guided video inpainting approach. Rather than filling in the RGB pixels of each frame directly, we consider video inpainting as a pixel propagation problem. We first synthesize a spatially and temporally coherent optical flow field across video frames using a newly designed Deep Flow Completion network. Then the synthesized flow field is used to guide the propagation of pixels to fill up the missing regions in the video. Specifically, the Deep Flow Completion network follows a coarse-to-fine refinement to complete the flow fields, while their quality is further improved by hard flow example mining. Following the guide of the completed flow, the missing video regions can be filled up precisely. Our method is evaluated on DAVIS and YouTube-VOS datasets qualitatively and quantitatively, achieving the state-of-the-art performance in terms of inpainting quality and speed.

* cvpr'19 

  Access Model/Code and Paper
Scene Graph Generation from Objects, Phrases and Region Captions

Sep 15, 2017
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, Xiaogang Wang

Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations, and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Objects, phrases, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the state-of-art method with more than 3% margin.

* accepted by ICCV 2017 

  Access Model/Code and Paper
Policy Continuation with Hindsight Inverse Dynamics

Nov 01, 2019
Hao Sun, Zhizhong Li, Xiaotong Liu, Dahua Lin, Bolei Zhou

Solving goal-oriented tasks is an important but challenging problem in reinforcement learning (RL). For such tasks, the rewards are often sparse, making it difficult to learn a policy effectively. To tackle this difficulty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay, enabling the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms. On two multi-goal tasks GridWorld and FetchReach, PCHID significantly improves the sample efficiency as well as the final performance.

  Access Model/Code and Paper
Cross-view Semantic Segmentation for Sensing Surroundings

Jun 09, 2019
Bowen Pan, Jiankai Sun, Alex Andonian, Aude Oliva, Bolei Zhou

Sensing surroundings is ubiquitous and effortless to humans: It takes a single glance to extract the spatial configuration of objects and the free space from the scene. To help machine vision with spatial understanding capabilities, we introduce the View Parsing Network (VPN) for cross-view semantic segmentation. In this framework, the first-view observations are parsed into a top-down-view semantic map indicating precise object locations. VPN contains a view transformer module, designed to aggregate the first-view observations taken from multiple angles and modalities, in order to draw a bird-view semantic map. We evaluate the VPN framework for cross-view segmentation on two types of environments, indoors and driving-traffic scenes. Experimental results show that our model accurately predicts the top-down-view semantic mask of the visible objects from the first-view observations, as well as infer the location of contextually-relevant objects even if they are invisible.

  Access Model/Code and Paper
Unified Perceptual Parsing for Scene Understanding

Jul 26, 2018
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun

Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes. Models are available at \url{}.

* Accepted to European Conference on Computer Vision (ECCV) 2018 

  Access Model/Code and Paper
Network Dissection: Quantifying Interpretability of Deep Visual Representations

Apr 19, 2017
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, Antonio Torralba

We propose a general framework called Network Dissection for quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts. Given any CNN model, the proposed method draws on a broad data set of visual concepts to score the semantics of hidden units at each intermediate convolutional layer. The units with semantics are given labels across a range of objects, parts, scenes, textures, materials, and colors. We use the proposed method to test the hypothesis that interpretability of units is equivalent to random linear combinations of units, then we apply our method to compare the latent representations of various networks when trained to solve different supervised and self-supervised training tasks. We further analyze the effect of training iterations, compare networks trained with different initializations, examine the impact of network depth and width, and measure the effect of dropout and batch normalization on the interpretability of deep visual representations. We demonstrate that the proposed method can shed light on characteristics of CNN models and training methods that go beyond measurements of their discriminative power.

* First two authors contributed equally. Oral presentation at CVPR 2017 

  Access Model/Code and Paper