Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Divya Kothandaraman

AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Nov 27, 2023
Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha

Figure 1 for AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Figure 2 for AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Figure 3 for AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Figure 4 for AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

We present a novel method, AerialBooth, for synthesizing the aerial view from a single input image using its text description. We leverage the pretrained text-to-2D image stable diffusion model as prior knowledge of the 3D world. The model is finetuned in two steps to optimize for the text embedding and the UNet that reconstruct the input image and its inverse perspective mapping respectively. The inverse perspective mapping creates variance within the text-image space of the diffusion model, while providing weak guidance for aerial view synthesis. At inference, we steer the contents of the generated image towards the input image using novel mutual information guidance that maximizes the information content between the probability distributions of the two images. We evaluate our approach on a wide spectrum of real and synthetic data, including natural scenes, indoor scenes, human action, etc. Through extensive experiments and ablation studies, we demonstrate the effectiveness of AerialBooth and also its generalizability to other text-controlled views. We also show that AerialBooth achieves the best viewpoint-fidelity trade-off though quantitative evaluation on 7 metrics analyzing viewpoint and fidelity w.r.t. input image. Code and data is available at https://github.com/divyakraman/AerialBooth2023.

Via

Access Paper or Ask Questions

PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Apr 14, 2023
Ruiqi Xian, Xijun Wang, Divya Kothandaraman, Dinesh Manocha

Figure 1 for PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Figure 2 for PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Figure 3 for PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Figure 4 for PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

We present a new algorithm for selection of informative frames in video action recognition. Our approach is designed for aerial videos captured using a moving camera where human actors occupy a small spatial resolution of video frames. Our algorithm utilizes the motion bias within aerial videos, which enables the selection of motion-salient frames. We introduce the concept of patch mutual information (PMI) score to quantify the motion bias between adjacent frames, by measuring the similarity of patches. We use this score to assess the amount of discriminative motion information contained in one frame relative to another. We present an adaptive frame selection strategy using shifted leaky ReLu and cumulative distribution function, which ensures that the sampled frames comprehensively cover all the essential segments with high motion salience. Our approach can be integrated with any action recognition model to enhance its accuracy. In practice, our method achieves a relative improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone, and 9.0% on Diving48 datasets.

Via

Access Paper or Ask Questions

Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Mar 15, 2023
Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha

Figure 1 for Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Figure 2 for Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Figure 3 for Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Figure 4 for Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

We present a novel method, Aerial Diffusion, for generating aerial views from a single ground-view image using text guidance. Aerial Diffusion leverages a pretrained text-image diffusion model for prior knowledge. We address two main challenges corresponding to domain gap between the ground-view and the aerial view and the two views being far apart in the text-image embedding manifold. Our approach uses a homography inspired by inverse perspective mapping prior to finetuning the pretrained diffusion model. Additionally, using the text corresponding to the ground-view to finetune the model helps us capture the details in the ground-view image at a relatively low bias towards the ground-view image. Aerial Diffusion uses an alternating sampling strategy to compute the optimal solution on complex high-dimensional manifold and generate a high-fidelity (w.r.t. ground view) aerial image. We demonstrate the quality and versatility of Aerial Diffusion on a plethora of images from various domains including nature, human actions, indoor scenes, etc. We qualitatively prove the effectiveness of our method with extensive ablations and comparisons. To the best of our knowledge, Aerial Diffusion is the first approach that performs ground-to-aerial translation in an unsupervised manner.

* Code: https://github.com/divyakraman/AerialDiffusion

Via

Access Paper or Ask Questions

Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Sep 15, 2022
Divya Kothandaraman, Ming Lin, Dinesh Manocha

Figure 1 for Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Figure 2 for Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Figure 3 for Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Figure 4 for Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

We present a learning algorithm for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras that contain a human actor along with background motion. Typically, the human actors occupy less than one-tenth of the spatial resolution. Our approach simultaneously harnesses the benefits of frequency domain representations, a classical analysis tool in signal processing, and data driven neural networks. We build a differentiable static-dynamic frequency mask prior to model the salient static and dynamic pixels in the video, crucial for the underlying task of action recognition. We use this differentiable mask prior to enable the neural network to intrinsically learn disentangled feature representations via an identity loss function. Our formulation empowers the network to inherently compute disentangled salient features within its layers. Further, we propose a cost-function encapsulating temporal relevance and spatial content to sample the most important frame within uniformly spaced video segments. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset and demonstrate relative improvements of 5.72% - 13.00% over the state-of-the-art and 14.28% - 38.05% over the corresponding baseline model.

Via

Access Paper or Ask Questions

Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Sep 13, 2022
James F. Mullen Jr, Divya Kothandaraman, Aniket Bera, Dinesh Manocha

Figure 1 for Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Figure 2 for Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Figure 3 for Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Figure 4 for Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

We present a novel method for placing a 3D human animation into a 3D scene while maintaining any human-scene interactions in the animation. We use the notion of computing the most important meshes in the animation for the interaction with the scene, which we call "keyframes." These keyframes allow us to better optimize the placement of the animation into the scene such that interactions in the animations (standing, laying, sitting, etc.) match the affordances of the scene (e.g., standing on the floor or laying in a bed). We compare our method, which we call PAAK, with prior approaches, including POSA, PROX ground truth, and a motion synthesis method, and highlight the benefits of our method with a perceptual study. Human raters preferred our PAAK method over the PROX ground truth data 64.6\% of the time. Additionally, in direct comparisons, the raters preferred PAAK over competing methods including 61.5\% compared to POSA.

* WACV 2023. Our project website is available at https://gamma.umd.edu/paak/

Via

Access Paper or Ask Questions

DistillAdapt: Source-Free Active Visual Domain Adaptation

May 24, 2022
Divya Kothandaraman, Sumit Shekhar, Abhilasha Sancheti, Manoj Ghuhan, Tripti Shukla, Dinesh Manocha

Figure 1 for DistillAdapt: Source-Free Active Visual Domain Adaptation

Figure 2 for DistillAdapt: Source-Free Active Visual Domain Adaptation

Figure 3 for DistillAdapt: Source-Free Active Visual Domain Adaptation

Figure 4 for DistillAdapt: Source-Free Active Visual Domain Adaptation

We present a novel method, DistillAdapt, for the challenging problem of Source-Free Active Domain Adaptation (SF-ADA). The problem requires adapting a pretrained source domain network to a target domain, within a provided budget for acquiring labels in the target domain, while assuming that the source data is not available for adaptation due to privacy concerns or otherwise. DistillAdapt is one of the first approaches for SF-ADA, and holistically addresses the challenges of SF-ADA via a novel Guided Attention Transfer Network (GATN) and an active learning heuristic, H_AL. The GATN enables selective distillation of features from the pre-trained network to the target network using a small subset of annotated target samples mined by H_AL. H_AL acquires samples at batch-level and balances transfer-ability from the pre-trained network and uncertainty of the target network. DistillAdapt is task-agnostic, and can be applied across visual tasks such as classification, segmentation and detection. Moreover, DistillAdapt can handle shifts in output label space. We conduct experiments and extensive ablation studies across 3 visual tasks, viz. digits classification (MNIST, SVHN), synthetic (GTA5) to real (CityScapes) image segmentation, and document layout detection (PubLayNet to DSSE). We show that our source-free approach, DistillAdapt, results in an improvement of 0.5% - 31.3% (across datasets and tasks) over prior adaptation methods that assume access to large amounts of annotated source data for adaptation.

* 22 pages

Via

Access Paper or Ask Questions

Fourier Disentangled Space-Time Attention for Aerial Video Recognition

Mar 21, 2022
Divya Kothandaraman, Tianrui Guan, Xijun Wang, Sean Hu, Ming Lin, Dinesh Manocha

Figure 1 for Fourier Disentangled Space-Time Attention for Aerial Video Recognition

Figure 2 for Fourier Disentangled Space-Time Attention for Aerial Video Recognition

Figure 3 for Fourier Disentangled Space-Time Attention for Aerial Video Recognition

Figure 4 for Fourier Disentangled Space-Time Attention for Aerial Video Recognition

We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition. Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent (which is typically small) from the background. Our disentanglement technique operates in the frequency domain to characterize the extent of temporal change of spatial pixels, and exploits convolution-multiplication properties of Fourier transform to map this representation to the corresponding object-background entangled features obtained from the network. To encapsulate contextual information and long-range space-time dependencies, we present a novel Fourier Attention algorithm, which emulates the benefits of self-attention by modeling the weighted outer product in the frequency domain. Our Fourier attention formulation uses much fewer computations than self-attention. We have evaluated our approach on multiple UAV datasets including UAV Human RGB, UAV Human Night, Drone Action, and NEC Drone. We demonstrate a relative improvement of 8.02% - 38.69% in top-1 accuracy and up to 3 times faster over prior works.

Via

Access Paper or Ask Questions

GANav: Group-wise Attention Network for Classifying Navigable Regions in Unstructured Outdoor Environments

Mar 07, 2021
Tianrui Guan, Divya Kothandaraman, Rohan Chandra, Dinesh Manocha

Figure 1 for GANav: Group-wise Attention Network for Classifying Navigable Regions in Unstructured Outdoor Environments

Figure 2 for GANav: Group-wise Attention Network for Classifying Navigable Regions in Unstructured Outdoor Environments

Figure 3 for GANav: Group-wise Attention Network for Classifying Navigable Regions in Unstructured Outdoor Environments

Figure 4 for GANav: Group-wise Attention Network for Classifying Navigable Regions in Unstructured Outdoor Environments

We present a new learning-based method for identifying safe and navigable regions in off-road terrains and unstructured environments from RGB images. Our approach consists of classifying groups of terrain classes based on their navigability levels using coarse-grained semantic segmentation. We propose a bottleneck transformer-based deep neural network architecture that uses a novel group-wise attention mechanism to distinguish between navigability levels of different terrains.Our group-wise attention heads enable the network to explicitly focus on the different groups and improve the accuracy. In addition, we propose a dynamic weighted cross entropy loss function to handle the long-tailed nature of the dataset. We show through extensive evaluations on the RUGD and RELLIS-3D datasets that our learning algorithm improves the accuracy of visual perception in off-road terrains for navigation. We compare our approach with prior work on these datasets and achieve an improvement over the state-of-the-art mIoU by 6.74-39.1% on RUGD and 3.82-10.64% on RELLIS-3D.

Via

Access Paper or Ask Questions

SAfE: Self-Attention Based Unsupervised Road Safety Classification in Hazardous Environments

Nov 27, 2020
Divya Kothandaraman, Rohan Chandra, Dinesh Manocha

Figure 1 for SAfE: Self-Attention Based Unsupervised Road Safety Classification in Hazardous Environments

Figure 2 for SAfE: Self-Attention Based Unsupervised Road Safety Classification in Hazardous Environments

Figure 3 for SAfE: Self-Attention Based Unsupervised Road Safety Classification in Hazardous Environments

Figure 4 for SAfE: Self-Attention Based Unsupervised Road Safety Classification in Hazardous Environments

We present a novel approach SAfE that can identify parts of an outdoor scene that are safe for driving, based on attention models. Our formulation is designed for hazardous weather conditions that can impair the visibility of human drivers as well as autonomous vehicles, increasing the risk of accidents. Our approach is unsupervised and uses domain adaptation, with entropy minimization and attention transfer discriminators, to leverage the large amounts of labeled data corresponding to clear weather conditions. Our attention transfer discriminator uses attention maps from the clear weather image to help the network learn relevant regions to attend to, on the images from the hazardous weather dataset. We conduct experiments on CityScapes simulated datasets depicting various weather conditions such as rain, fog and snow under different intensities, and additionally on Berkeley Deep Drive. Our result show that using attention models improves the standard unsupervised domain adaptation performance by 29.29%. Furthermore, we also compare with unsupervised domain adaptation methods and show an improvement of at least 12.02% (mIoU) over the state-of-the-art.

* 16 pages, 10 figures, 5 tables

Via

Access Paper or Ask Questions

Unsupervised Domain Adaptive Knowledge Distillation for Semantic Segmentation

Nov 03, 2020
Divya Kothandaraman, Athira Nambiar, Anurag Mittal

Figure 1 for Unsupervised Domain Adaptive Knowledge Distillation for Semantic Segmentation

Figure 2 for Unsupervised Domain Adaptive Knowledge Distillation for Semantic Segmentation

Figure 3 for Unsupervised Domain Adaptive Knowledge Distillation for Semantic Segmentation

Figure 4 for Unsupervised Domain Adaptive Knowledge Distillation for Semantic Segmentation

Practical autonomous driving systems face two crucial challenges: memory constraints and domain gap issues. We present an approach to learn domain adaptive knowledge in models with limited memory, thus bestowing the model with the ability to deal with these issues in a comprehensive manner. We delve into this in the context of unsupervised domain-adaptive semantic segmentation and propose a multi-level distillation strategy to effectively distil knowledge at different levels. Further, we introduce a cross entropy loss that leverages pseudo labels from the teacher. These pseudo teacher labels play a multifaceted role towards: (i) knowledge distillation from the teacher network to the student network & (ii) serving as a proxy for the ground truth for target domain images, where the problem is completely unsupervised. We introduce four paradigms for distilling domain adaptive knowledge and carry out extensive experiments and ablation studies on real-to-real and synthetic-to-real scenarios. Our experiments demonstrate the profound success of our proposed method.

* 11 pages, 5 tables, 3 figures

Via

Access Paper or Ask Questions