Models, code, and papers for "In So Kweon":

Deep Iterative Frame Interpolation for Full-frame Video Stabilization

Sep 05, 2019
Jinsoo Choi, In So Kweon

Video stabilization is a fundamental and important technique for higher quality videos. Prior works have extensively explored video stabilization, but most of them involve cropping of the frame boundaries and introduce moderate levels of distortion. We present a novel deep approach to video stabilization which can generate video frames without cropping and low distortion. The proposed framework utilizes frame interpolation techniques to generate in between frames, leading to reduced inter-frame jitter. Once applied in an iterative fashion, the stabilization effect becomes stronger. A major advantage is that our framework is end-to-end trainable in an unsupervised manner. In addition, our method is able to run in near real-time (15 fps). To the best of our knowledge, this is the first work to propose an unsupervised deep approach to full-frame video stabilization. We show the advantages of our method through quantitative and qualitative evaluations comparing to the state-of-the-art methods.

* Accepted to ACM Transactions on Graphics, To be presented at SIGGRAPH Asia 2019 

  Click for Model/Code and Paper
Learning Loss for Active Learning

May 09, 2019
Donggeun Yoo, In So Kweon

The performance of deep neural networks improves with more annotated data. The problem is that the budget for annotation is limited. One solution to this is active learning, where a model asks human to annotate data that it perceived as uncertain. A variety of recent methods have been proposed to apply active learning to deep networks but most of them are either designed specific for their target tasks or computationally inefficient for large networks. In this paper, we propose a novel active learning method that is simple but task-agnostic, and works efficiently with the deep networks. We attach a small parametric module, named "loss prediction module," to a target network, and learn it to predict target losses of unlabeled inputs. Then, this module can suggest data that the target model is likely to produce a wrong prediction. This method is task-agnostic as networks are learned from a single loss regardless of target tasks. We rigorously validate our method through image classification, object detection, and human pose estimation, with the recent network architectures. The results demonstrate that our method consistently outperforms the previous methods over the tasks.

* Accepted to CVPR 2019 (Oral) 

  Click for Model/Code and Paper
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Nov 24, 2018
Dahun Kim, Donghyeon Cho, In So Kweon

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all. Recently, this worthwhile stream of study extends to video domain where the cost of human labeling is even more expensive. However, the most of existing methods are still based on 2D CNN architectures that can not directly capture spatio-temporal information for video applications. In this paper, we introduce a new self-supervised task called as \textit{Space-Time Cubic Puzzles} to train 3D CNNs using large scale video dataset. This task requires a network to arrange permuted 3D spatio-temporal crops. By completing \textit{Space-Time Cubic Puzzles}, the network learns both spatial appearance and temporal relation of video frames, which is our final goal. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.

* Accepted to AAAI 2019 

  Click for Model/Code and Paper
Light-weight place recognition and loop detection using road markings

Oct 20, 2017
Oleksandr Bailo, Francois Rameau, In So Kweon

In this paper, we propose an efficient algorithm for robust place recognition and loop detection using camera information only. Our pipeline purely relies on spatial localization and semantic information of road markings. The creation of the database of road markings sequences is performed online, which makes the method applicable for real-time loop closure for visual SLAM techniques. Furthermore, our algorithm is robust to various weather conditions, occlusions from vehicles, and shadows. We have performed an extensive number of experiments which highlight the effectiveness and scalability of the proposed method.

  Click for Model/Code and Paper
StairNet: Top-Down Semantic Aggregation for Accurate One Shot Detection

Sep 18, 2017
Sanghyun Woo, Soonmin Hwang, In So Kweon

One-stage object detectors such as SSD or YOLO already have shown promising accuracy with small memory footprint and fast speed. However, it is widely recognized that one-stage detectors have difficulty in detecting small objects while they are competitive with two-stage methods on large objects. In this paper, we investigate how to alleviate this problem starting from the SSD framework. Due to their pyramidal design, the lower layer that is responsible for small objects lacks strong semantics(e.g contextual information). We address this problem by introducing a feature combining module that spreads out the strong semantics in a top-down manner. Our final model StairNet detector unifies the multi-scale representations and semantic distribution effectively. Experiments on PASCAL VOC 2007 and PASCAL VOC 2012 datasets demonstrate that StairNet significantly improves the weakness of SSD and outperforms the other state-of-the-art one-stage detectors.

  Click for Model/Code and Paper
Robust Depth Estimation from Auto Bracketed Images

Mar 21, 2018
Sunghoon Im, Hae-Gon Jeon, In So Kweon

As demand for advanced photographic applications on hand-held devices grows, these electronics require the capture of high quality depth. However, under low-light conditions, most devices still suffer from low imaging quality and inaccurate depth acquisition. To address the problem, we present a robust depth estimation method from a short burst shot with varied intensity (i.e., Auto Bracketing) or strong noise (i.e., High ISO). We introduce a geometric transformation between flow and depth tailored for burst images, enabling our learning-based multi-view stereo matching to be performed effectively. We then describe our depth estimation pipeline that incorporates the geometric transformation into our residual-flow network. It allows our framework to produce an accurate depth map even with a bracketed image sequence. We demonstrate that our method outperforms state-of-the-art methods for various datasets captured by a smartphone and a DSLR camera. Moreover, we show that the estimated depth is applicable for image quality enhancement and photographic editing.

* To appear in CVPR 2018. Total 9 pages 

  Click for Model/Code and Paper
Contextually Customized Video Summaries via Natural Language

Mar 02, 2018
Jinsoo Choi, Tae-Hyun Oh, In So Kweon

The best summary of a long video differs among different people due to its highly subjective nature. Even for the same person, the best summary may change with time or mood. In this paper, we introduce the task of generating customized video summaries through simple text. First, we train a deep architecture to effectively learn semantic embeddings of video frames by leveraging the abundance of image-caption data via a progressive and residual manner. Given a user-specific text description, our algorithm is able to select semantically relevant video segments and produce a temporally aligned video summary. In order to evaluate our textually customized video summaries, we conduct experimental comparison with baseline methods that utilize ground-truth information. Despite the challenging baselines, our method still manages to show comparable or even exceeding performance. We also show that our method is able to generate semantically diverse video summaries by only utilizing the learned visual embeddings.

* To appear in WACV 2018 

  Click for Model/Code and Paper
Human Attention Estimation for Natural Images: An Automatic Gaze Refinement Approach

Jan 12, 2016
Jinsoo Choi, Tae-Hyun Oh, In So Kweon

Photo collections and its applications today attempt to reflect user interactions in various forms. Moreover, photo collections aim to capture the users' intention with minimum effort through applications capturing user intentions. Human interest regions in an image carry powerful information about the user's behavior and can be used in many photo applications. Research on human visual attention has been conducted in the form of gaze tracking and computational saliency models in the computer vision community, and has shown considerable progress. This paper presents an integration between implicit gaze estimation and computational saliency model to effectively estimate human attention regions in images on the fly. Furthermore, our method estimates human attention via implicit calibration and incremental model updating without any active participation from the user. We also present extensive analysis and possible applications for personal photo collections.

  Click for Model/Code and Paper
Learning Residual Flow as Dynamic Motion from Stereo Videos

Sep 16, 2019
Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon

We present a method for decomposing the 3D scene flow observed from a moving stereo rig into stationary scene elements and dynamic object motion. Our unsupervised learning framework jointly reasons about the camera motion, optical flow, and 3D motion of moving objects. Three cooperating networks predict stereo matching, camera motion, and residual flow, which represents the flow component due to object motion and not from camera motion. Based on rigid projective geometry, the estimated stereo depth is used to guide the camera motion estimation, and the depth and camera motion are used to guide the residual flow estimation. We also explicitly estimate the 3D scene flow of dynamic objects based on the residual flow and scene depth. Experiments on the KITTI dataset demonstrate the effectiveness of our approach and show that our method outperforms other state-of-the-art algorithms on the optical flow and visual odometry tasks.

* IROS 2019. 

  Click for Model/Code and Paper
LinkNet: Relational Embedding for Scene Graph

Nov 15, 2018
Sanghyun Woo, Dahun Kim, Donghyeon Cho, In So Kweon

Objects and their relationships are critical contents for image understanding. A scene graph provides a structured description that captures these properties of an image. However, reasoning about the relationships between objects is very challenging and only a few recent works have attempted to solve the problem of generating a scene graph from an image. In this paper, we present a method that improves scene graph generation by explicitly modeling inter-dependency among the entire object instances. We design a simple and effective relational embedding module that enables our model to jointly represent connections among all related objects, rather than focus on an object in isolation. Our method significantly benefits the main part of the scene graph generation task: relationship classification. Using it on top of a basic Faster R-CNN, our model achieves state-of-the-art results on the Visual Genome benchmark. We further push the performance by introducing global context encoding module and geometrical layout encoding module. We validate our final model, LinkNet, through extensive ablation studies, demonstrating its efficacy in scene graph generation.

* Accepted to NIPS 2018 

  Click for Model/Code and Paper
Learning Image Representations by Completing Damaged Jigsaw Puzzles

Feb 06, 2018
Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon

In this paper, we explore methods of complicating self-supervised tasks for representation learning. That is, we do severe damage to data and encourage a network to recover them. First, we complicate each of three powerful self-supervised task candidates: jigsaw puzzle, inpainting, and colorization. In addition, we introduce a novel complicated self-supervised task called "Completing damaged jigsaw puzzles" which is puzzles with one piece missing and the other pieces without color. We train a convolutional neural network not only to solve the puzzles, but also generate the missing content and colorize the puzzles. The recovery of the aforementioned damage pushes the network to obtain robust and general-purpose representations. We demonstrate that complicating the self-supervised tasks improves their original versions and that our final task learns more robust and transferable representations compared to the previous methods, as well as the simple combination of our candidate tasks. Our approach achieves state-of-the-art performance in transfer learning on PASCAL classification and semantic segmentation.

* accepted at WACV 2018 

  Click for Model/Code and Paper
Two-Phase Learning for Weakly Supervised Object Localization

Aug 16, 2017
Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon

Weakly supervised semantic segmentation and localiza- tion have a problem of focusing only on the most important parts of an image since they use only image-level annota- tions. In this paper, we solve this problem fundamentally via two-phase learning. Our networks are trained in two steps. In the first step, a conventional fully convolutional network (FCN) is trained to find the most discriminative parts of an image. In the second step, the activations on the most salient parts are suppressed by inference conditional feedback, and then the second learning is performed to find the area of the next most important parts. By combining the activations of both phases, the entire portion of the tar- get object can be captured. Our proposed training scheme is novel and can be utilized in well-designed techniques for weakly supervised semantic segmentation, salient region detection, and object location prediction. Detailed experi- ments demonstrate the effectiveness of our two-phase learn- ing in each task.

* Accepted at ICCV 2017 

  Click for Model/Code and Paper
Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence

May 08, 2019
Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon

Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks. While recent deep learning based inpainting methods deal with a single image and mostly assume that the positions of the corrupted pixels are known, we aim at automatic text removal in video sequences without mask information. In this paper, we propose a simple yet effective framework for fast blind video decaptioning. We construct an encoder-decoder model, where the encoder takes multiple source frames that can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a residual connection from the input frame to the decoder output to enforce our network to focus on the corrupted regions only. Our proposed model was ranked in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2: Video decaptioning. In addition, we further improve this strong model by applying a recurrent feedback. The recurrent feedback not only enforces temporal coherence but also provides strong clues on where the corrupted pixels are. Both qualitative and quantitative experiments demonstrate that our full model produces accurate and temporally consistent video results in real time (50+ fps).

* Accepted at CVPR 2019 

  Click for Model/Code and Paper
Deep Video Inpainting

May 05, 2019
Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon

Video inpainting aims to fill spatio-temporal holes with plausible content in a video. Despite tremendous progress of deep neural networks for image inpainting, it is challenging to extend these methods to the video domain due to the additional time dimension. In this work, we propose a novel deep network architecture for fast video inpainting. Built upon an image-based encoder-decoder model, our framework is designed to collect and refine information from neighbor frames and synthesize still-unknown regions. At the same time, the output is enforced to be temporally consistent by a recurrent feedback and a temporal memory module. Compared with the state-of-the-art image inpainting algorithm, our method produces videos that are much more semantically correct and temporally smooth. In contrast to the prior video completion method which relies on time-consuming optimization, our method runs in near real-time while generating competitive video results. Finally, we applied our framework to video retargeting task, and obtain visually pleasing results.

* Accepted at CVPR 2019 

  Click for Model/Code and Paper
DPSNet: End-to-end Deep Plane Sweep Stereo

May 02, 2019
Sunghoon Im, Hae-Gon Jeon, Stephen Lin, In So Kweon

Multiview stereo aims to reconstruct scene depth from images acquired by a camera under arbitrary motion. Recent methods address this problem through deep learning, which can utilize semantic cues to deal with challenges such as textureless and reflective regions. In this paper, we present a convolutional neural network called DPSNet (Deep Plane Sweep Network) whose design is inspired by best practices of traditional geometry-based approaches for dense depth reconstruction. Rather than directly estimating depth and/or optical flow correspondence from image pairs as done in many previous deep learning methods, DPSNet takes a plane sweep approach that involves building a cost volume from deep features using the plane sweep algorithm, regularizing the cost volume via a context-aware cost aggregation, and regressing the dense depth map from the cost volume. The cost volume is constructed using a differentiable warping process that allows for end-to-end training of the network. Through the effective incorporation of conventional multiview stereo concepts within a deep learning framework, DPSNet achieves state-of-the-art reconstruction results on a variety of challenging datasets.

* ICLR2019 accepted 

  Click for Model/Code and Paper
CBAM: Convolutional Block Attention Module

Jul 18, 2018
Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon

We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS~COCO detection, and VOC~2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.

* Accepted to ECCV 2018 

  Click for Model/Code and Paper
BAM: Bottleneck Attention Module

Jul 18, 2018
Jongchan Park, Sanghyun Woo, Joon-Young Lee, In So Kweon

Recent advances in deep neural networks have been developed via architecture search for stronger representational power. In this work, we focus on the effect of attention in general deep neural networks. We propose a simple and effective attention module, named Bottleneck Attention Module (BAM), that can be integrated with any feed-forward convolutional neural networks. Our module infers an attention map along two separate pathways, channel and spatial. We place our module at each bottleneck of models where the downsampling of feature maps occurs. Our module constructs a hierarchical attention at bottlenecks with a number of parameters and it is trainable in an end-to-end manner jointly with any feed-forward models. We validate our BAM through extensive experiments on CIFAR-100, ImageNet-1K, VOC 2007 and MS COCO benchmarks. Our experiments show consistent improvement in classification and detection performances with various models, demonstrating the wide applicability of BAM. The code and models will be publicly available.

* Accepted to BMVC 2018 (oral) 

  Click for Model/Code and Paper
Distort-and-Recover: Color Enhancement using Deep Reinforcement Learning

Apr 16, 2018
Jongchan Park, Joon-Young Lee, Donggeun Yoo, In So Kweon

Learning-based color enhancement approaches typically learn to map from input images to retouched images. Most of existing methods require expensive pairs of input-retouched images or produce results in a non-interpretable way. In this paper, we present a deep reinforcement learning (DRL) based method for color enhancement to explicitly model the step-wise nature of human retouching process. We cast a color enhancement process as a Markov Decision Process where actions are defined as global color adjustment operations. Then we train our agent to learn the optimal global enhancement sequence of the actions. In addition, we present a 'distort-and-recover' training scheme which only requires high-quality reference images for training instead of input and retouched image pairs. Given high-quality reference images, we distort the images' color distribution and form distorted-reference image pairs for training. Through extensive experiments, we show that our method produces decent enhancement results and our DRL approach is more suitable for the 'distort-and-recover' training scheme than previous supervised approaches. Supplementary material and code are available at

* Accepted to CVPR 2018 

  Click for Model/Code and Paper
Robust and Globally Optimal Manhattan Frame Estimation in Near Real Time

Apr 13, 2018
Kyungdon Joo, Tae-Hyun Oh, Junsik Kim, In So Kweon

Most man-made environments, such as urban and indoor scenes, consist of a set of parallel and orthogonal planar structures. These structures are approximated by the Manhattan world assumption, in which notion can be represented as a Manhattan frame (MF). Given a set of inputs such as surface normals or vanishing points, we pose an MF estimation problem as a consensus set maximization that maximizes the number of inliers over the rotation search space. Conventionally, this problem can be solved by a branch-and-bound framework, which mathematically guarantees global optimality. However, the computational time of the conventional branch-and-bound algorithms is rather far from real-time. In this paper, we propose a novel bound computation method on an efficient measurement domain for MF estimation, i.e., the extended Gaussian image (EGI). By relaxing the original problem, we can compute the bound with a constant complexity, while preserving global optimality. Furthermore, we quantitatively and qualitatively demonstrate the performance of the proposed method for various synthetic and real-world data. We also show the versatility of our approach through three different applications: extension to multiple MF estimation, 3D rotation based video stabilization, and vanishing point estimation (line clustering).

* To appear in TPAMI 

  Click for Model/Code and Paper
Co-domain Embedding using Deep Quadruplet Networks for Unseen Traffic Sign Recognition

Dec 05, 2017
Junsik Kim, Seokju Lee, Tae-Hyun Oh, In So Kweon

Recent advances in visual recognition show overarching success by virtue of large amounts of supervised data. However,the acquisition of a large supervised dataset is often challenging. This is also true for intelligent transportation applications, i.e., traffic sign recognition. For example, a model trained with data of one country may not be easily generalized to another country without much data. We propose a novel feature embedding scheme for unseen class classification when the representative class template is given. Traffic signs, unlike other objects, have official images. We perform co-domain embedding using a quadruple relationship from real and synthetic domains. Our quadruplet network fully utilizes the explicit pairwise similarity relationships among samples from different domains. We validate our method on three datasets with two experiments involving one-shot classification and feature generalization. The results show that the proposed method outperforms competing approaches on both seen and unseen classes.

* To appear in AAAI 2018 

  Click for Model/Code and Paper