Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yongliang Wu

Reframe Anything: LLM Agent for Open World Video Reframing

Mar 10, 2024
Jiawang Cao, Yongliang Wu, Weiheng Chi, Wenbo Zhu, Ziyue Su, Jay Wu

Figure 1 for Reframe Anything: LLM Agent for Open World Video Reframing

Figure 2 for Reframe Anything: LLM Agent for Open World Video Reframing

Figure 3 for Reframe Anything: LLM Agent for Open World Video Reframing

Figure 4 for Reframe Anything: LLM Agent for Open World Video Reframing

The proliferation of mobile devices and social media has revolutionized content dissemination, with short-form video becoming increasingly prevalent. This shift has introduced the challenge of video reframing to fit various screen aspect ratios, a process that highlights the most compelling parts of a video. Traditionally, video reframing is a manual, time-consuming task requiring professional expertise, which incurs high production costs. A potential solution is to adopt some machine learning models, such as video salient object detection, to automate the process. However, these methods often lack generalizability due to their reliance on specific training data. The advent of powerful large language models (LLMs) open new avenues for AI capabilities. Building on this, we introduce Reframe Any Video Agent (RAVA), a LLM-based agent that leverages visual foundation models and human instructions to restructure visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.

* 14 pages, 6 figures

Via

Access Paper or Ask Questions

Exploring Diverse In-Context Configurations for Image Captioning

May 26, 2023
Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, Xin Geng

Figure 1 for Exploring Diverse In-Context Configurations for Image Captioning

Figure 2 for Exploring Diverse In-Context Configurations for Image Captioning

Figure 3 for Exploring Diverse In-Context Configurations for Image Captioning

Figure 4 for Exploring Diverse In-Context Configurations for Image Captioning

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, i.e., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case.

Via

Access Paper or Ask Questions