Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Songtao Jiang

MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models

Apr 16, 2024
Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Zuozhu Liu

Figure 1 for MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models

Figure 2 for MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models

Figure 3 for MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models

Figure 4 for MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models

Mixture of Expert Tuning (MoE-Tuning) has effectively enhanced the performance of general MLLMs with fewer parameters, yet its application in resource-limited medical settings has not been fully explored. To address this gap, we developed MoE-TinyMed, a model tailored for medical applications that significantly lowers parameter demands. In evaluations on the VQA-RAD, SLAKE, and Path-VQA datasets, MoE-TinyMed outperformed LLaVA-Med in all Med-VQA closed settings with just 3.6B parameters. Additionally, a streamlined version with 2B parameters surpassed LLaVA-Med's performance in PathVQA, showcasing its effectiveness in resource-limited healthcare settings.

Via

Access Paper or Ask Questions

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Apr 06, 2024
Songtao Jiang, Yan Zhang, Chenyi Zhou, Yeying Jin, Yang Feng, Jian Wu, Zuozhu Liu

Figure 1 for Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Figure 2 for Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Figure 3 for Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Figure 4 for Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings. This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception. VTPrompt merges visual and text prompts to extract key concepts from textual questions and employs a detection model to highlight relevant objects as visual prompts in images. The processed images alongside text prompts are subsequently fed into MLLMs to produce more accurate answers. Our experiments with GPT-4V and Gemini Pro, on three benchmarks, i.e., MME , MMB and POPE, demonstrate significant improvements. Particularly, our method led to a score improvement of up to 183.5 for GPT-4V on MME and enhanced MMB performance by 8.17\% for GPT-4V and 15.69\% for Gemini Pro.

Via

Access Paper or Ask Questions