Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Danna Gurari

Collecting Consistently High Quality Object Tracks with Minimal Human Involvement by Using Self-Supervised Learning to Detect Tracker Errors

May 06, 2024
Samreen Anjum, Suyog Jain, Danna Gurari

We propose a hybrid framework for consistently producing high-quality object tracks by combining an automated object tracker with little human input. The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking. Our approach leverages self-supervised learning on unlabeled videos to learn a tailored representation for a target object that is then used to actively monitor its tracked region and decide when the tracker fails. Since labeled data is not needed, our approach can be applied to novel object categories. Experiments on three datasets demonstrate our method outperforms existing approaches, especially for small, fast moving, or occluded objects.

Via

Access Paper or Ask Questions

Interpreting COVID Lateral Flow Tests' Results with Foundation Models

Apr 21, 2024
Stuti Pandey, Josh Myers-Dean, Jarek Reynolds, Danna Gurari

Lateral flow tests (LFTs) enable rapid, low-cost testing for health conditions including Covid, pregnancy, HIV, and malaria. Automated readers of LFT results can yield many benefits including empowering blind people to independently learn about their health and accelerating data entry for large-scale monitoring (e.g., for pandemics such as Covid) by using only a single photograph per LFT test. Accordingly, we explore the abilities of modern foundation vision language models (VLMs) in interpreting such tests. To enable this analysis, we first create a new labeled dataset with hierarchical segmentations of each LFT test and its nested test result window. We call this dataset LFT-Grounding. Next, we benchmark eight modern VLMs in zero-shot settings for analyzing these images. We demonstrate that current VLMs frequently fail to correctly identify the type of LFT test, interpret the test results, locate the nested result window of the LFT tests, and recognize LFT tests when they partially obfuscated. To facilitate community-wide progress towards automated LFT reading, we publicly release our dataset at https://iamstuti.github.io/lft_grounding_foundation_models/.

Via

Access Paper or Ask Questions

Fully Authentic Visual Question Answering Dataset from Online Communities

Nov 27, 2023
Chongyan Chen, Mengchen Liu, Noel Codella, Yunsheng Li, Lu Yuan, Danna Gurari

Visual Question Answering (VQA) entails answering questions about images. We introduce the first VQA dataset in which all contents originate from an authentic use case. Sourced from online question answering community forums, we call it VQAonline. We then characterize our dataset and how it relates to eight other VQA datasets. Observing that answers in our dataset tend to be much longer (e.g., with a mean of 173 words) and thus incompatible with standard VQA evaluation metrics, we next analyze which of the six popular metrics for longer text evaluation align best with human judgments. We then use the best-suited metrics to evaluate six state-of-the-art vision and language foundation models on VQAonline and reveal where they struggle most. We will release the dataset soon to facilitate future extensions.

Via

Access Paper or Ask Questions

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Aug 24, 2023
Chongyan Chen, Samreen Anjum, Danna Gurari

Figure 1 for VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Figure 2 for VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Figure 3 for VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Figure 4 for VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We benchmark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/.

* IEEE/CVF International Conference on Computer Vision

Via

Access Paper or Ask Questions

Interactive Segmentation for Diverse Gesture Types Without Context

Jul 20, 2023
Josh Myers-Dean, Yifei Fan, Brian Price, Wilson Chan, Danna Gurari

Figure 1 for Interactive Segmentation for Diverse Gesture Types Without Context

Figure 2 for Interactive Segmentation for Diverse Gesture Types Without Context

Figure 3 for Interactive Segmentation for Diverse Gesture Types Without Context

Figure 4 for Interactive Segmentation for Diverse Gesture Types Without Context

Interactive segmentation entails a human marking an image to guide how a model either creates or edits a segmentation. Our work addresses limitations of existing methods: they either only support one gesture type for marking an image (e.g., either clicks or scribbles) or require knowledge of the gesture type being employed, and require specifying whether marked regions should be included versus excluded in the final segmentation. We instead propose a simplified interactive segmentation task where a user only must mark an image, where the input can be of any gesture type without specifying the gesture type. We support this new task by introducing the first interactive segmentation dataset with multiple gesture types as well as a new evaluation metric capable of holistically evaluating interactive segmentation algorithms. We then analyze numerous interactive segmentation algorithms, including ones adapted for our novel task. While we observe promising performance overall, we also highlight areas for future improvement. To facilitate further extensions of this work, we publicly share our new dataset at https://github.com/joshmyersdean/dig.

Via

Access Paper or Ask Questions

Helping Visually Impaired People Take Better Quality Pictures

May 14, 2023
Maniratnam Mandal, Deepti Ghadiyaram, Danna Gurari, Alan C. Bovik

Figure 1 for Helping Visually Impaired People Take Better Quality Pictures

Figure 2 for Helping Visually Impaired People Take Better Quality Pictures

Figure 3 for Helping Visually Impaired People Take Better Quality Pictures

Figure 4 for Helping Visually Impaired People Take Better Quality Pictures

Perception-based image analysis technologies can be used to help visually impaired people take better quality pictures by providing automated guidance, thereby empowering them to interact more confidently on social media. The photographs taken by visually impaired users often suffer from one or both of two kinds of quality issues: technical quality (distortions), and semantic quality, such as framing and aesthetic composition. Here we develop tools to help them minimize occurrences of common technical distortions, such as blur, poor exposure, and noise. We do not address the complementary problems of semantic quality, leaving that aspect for future work. The problem of assessing and providing actionable feedback on the technical quality of pictures captured by visually impaired users is hard enough, owing to the severe, commingled distortions that often occur. To advance progress on the problem of analyzing and measuring the technical quality of visually impaired user-generated content (VI-UGC), we built a very large and unique subjective image quality and distortion dataset. This new perceptual resource, which we call the LIVE-Meta VI-UGC Database, contains $40$K real-world distorted VI-UGC images and $40$K patches, on which we recorded $2.7$M human perceptual quality judgments and $2.7$M distortion labels. Using this psychometric resource we also created an automatic blind picture quality and distortion predictor that learns local-to-global spatial quality relationships, achieving state-of-the-art prediction performance on VI-UGC pictures, significantly outperforming existing picture quality models on this unique class of distorted picture data. We also created a prototype feedback system that helps to guide users to mitigate quality issues and take better quality pictures, by creating a multi-task learning framework.

Via

Access Paper or Ask Questions

Salient Object Detection for Images Taken by People With Vision Impairments

Jan 12, 2023
Jarek Reynolds, Chandra Kanth Nagesh, Danna Gurari

Figure 1 for Salient Object Detection for Images Taken by People With Vision Impairments

Figure 2 for Salient Object Detection for Images Taken by People With Vision Impairments

Figure 3 for Salient Object Detection for Images Taken by People With Vision Impairments

Figure 4 for Salient Object Detection for Images Taken by People With Vision Impairments

Salient object detection is the task of producing a binary mask for an image that deciphers which pixels belong to the foreground object versus background. We introduce a new salient object detection dataset using images taken by people who are visually impaired who were seeking to better understand their surroundings, which we call VizWiz-SalientObject. Compared to seven existing datasets, VizWiz-SalientObject is the largest (i.e., 32,000 human-annotated images) and contains unique characteristics including a higher prevalence of text in the salient objects (i.e., in 68\% of images) and salient objects that occupy a larger ratio of the images (i.e., on average, $\sim$50\% coverage). We benchmarked seven modern salient object detection methods on our dataset and found they struggle most with images featuring salient objects that are large, have less complex boundaries, and lack text as well as for lower quality images. We invite the broader community to work on our new dataset challenge by publicly sharing the dataset at https://vizwiz.org/tasks-and-datasets/salient-object .

* Computer Vision and Pattern Recognition

Via

Access Paper or Ask Questions

Line Search-Based Feature Transformation for Fast, Stable, and Tunable Content-Style Control in Photorealistic Style Transfer

Oct 12, 2022
Tai-Yin Chiu, Danna Gurari

Figure 1 for Line Search-Based Feature Transformation for Fast, Stable, and Tunable Content-Style Control in Photorealistic Style Transfer

Figure 2 for Line Search-Based Feature Transformation for Fast, Stable, and Tunable Content-Style Control in Photorealistic Style Transfer

Figure 3 for Line Search-Based Feature Transformation for Fast, Stable, and Tunable Content-Style Control in Photorealistic Style Transfer

Figure 4 for Line Search-Based Feature Transformation for Fast, Stable, and Tunable Content-Style Control in Photorealistic Style Transfer

Photorealistic style transfer is the task of synthesizing a realistic-looking image when adapting the content from one image to appear in the style of another image. Modern models commonly embed a transformation that fuses features describing the content image and style image and then decodes the resulting feature into a stylized image. We introduce a general-purpose transformation that enables controlling the balance between how much content is preserved and the strength of the infused style. We offer the first experiments that demonstrate the performance of existing transformations across different style transfer models and demonstrate how our transformation performs better in its ability to simultaneously run fast, produce consistently reasonable results, and control the balance between content and style in different models. To support reproducing our method and models, we share the code at https://github.com/chiutaiyin/LS-FT.

* WACV2023

Via

Access Paper or Ask Questions

VizWiz-FewShot: Locating Objects in Images Taken by People With Visual Impairments

Jul 24, 2022
Yu-Yun Tseng, Alexander Bell, Danna Gurari

Figure 1 for VizWiz-FewShot: Locating Objects in Images Taken by People With Visual Impairments

Figure 2 for VizWiz-FewShot: Locating Objects in Images Taken by People With Visual Impairments

Figure 3 for VizWiz-FewShot: Locating Objects in Images Taken by People With Visual Impairments

Figure 4 for VizWiz-FewShot: Locating Objects in Images Taken by People With Visual Impairments

We introduce a few-shot localization dataset originating from photographers who authentically were trying to learn about the visual content in the images they took. It includes nearly 10,000 segmentations of 100 categories in over 4,500 images that were taken by people with visual impairments. Compared to existing few-shot object detection and instance segmentation datasets, our dataset is the first to locate holes in objects (e.g., found in 12.3\% of our segmentations), it shows objects that occupy a much larger range of sizes relative to the images, and text is over five times more common in our objects (e.g., found in 22.4\% of our segmentations). Analysis of three modern few-shot localization algorithms demonstrates that they generalize poorly to our new dataset. The algorithms commonly struggle to locate objects with holes, very small and very large objects, and objects lacking text. To encourage a larger community to work on these unsolved challenges, we publicly share our annotated few-shot dataset at https://vizwiz.org .

* Accepted to ECCV 2022. The first two authors contributed equally

Via

Access Paper or Ask Questions

PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models

Mar 25, 2022
Tai-Yin Chiu, Danna Gurari

Figure 1 for PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models

Figure 2 for PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models

Figure 3 for PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models

Figure 4 for PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models

Photorealistic style transfer entails transferring the style of a reference image to another image so the result seems like a plausible photo. Our work is inspired by the observation that existing models are slow due to their large sizes. We introduce PCA-based knowledge distillation to distill lightweight models and show it is motivated by theory. To our knowledge, this is the first knowledge distillation method for photorealistic style transfer. Our experiments demonstrate its versatility for use with different backbone architectures, VGG and MobileNet, across six image resolutions. Compared to existing models, our top-performing model runs at speeds 5-20x faster using at most 1\% of the parameters. Additionally, our distilled models achieve a better balance between stylization strength and content preservation than existing models. To support reproducing our method and models, we share the code at \textit{https://github.com/chiutaiyin/PCA-Knowledge-Distillation}.

Via

Access Paper or Ask Questions