Sorry, you need to enable JavaScript to visit this website.

PROMPTING LARGE LANGUAGE MODELS WITH FINE-GRAINED VISUAL RELATIONS FROM SCENE GRAPH FOR VISUAL QUESTION ANSWERING

DOI:
10.60864/s5mx-qk95
Citation Author(s):
Submitted by:
P Liu
Last updated:
17 April 2024 - 4:45am
Document Type:
Presentation Slides
Event:
 

Visual Question Answering (VQA) is a task that requires models to comprehend both questions and images. An increasing number of works are leveraging the strong reasoning capabilities of Large Language Models (LLMs) to address VQA. These methods typically utilize image captions as visual text description to aid LLMs in comprehending images. However, these captions often overlooking the relations of fine-grained objects, which will limit the reasoning capability of LLMs. In this paper, we present PFVR, a modular framework that Prompts LLMs with Fine-grained Visual Relationships for VQA. PFVR primarily consists of an answer-guided generation module (AGG) and a questionguided filtering module (QGF). The two modules can combine to extract the fine-grained visual relations from scene graph, which will finally serve as crucial context for LLMs to comprehend the image. Extensive experiments conducted on the popular VQA dataset, GQA, confirm PFVR achieves state-of-the-art results compared to other strong VQA competitors, demonstrating its exceptional effectiveness.

up
0 users have voted: