PROMPTING LARGE LANGUAGE MODELS WITH FINE-GRAINED VISUAL RELATIONS FROM SCENE GRAPH FOR VISUAL QUESTION ANSWERING

Visual Question Answering (VQA) is a task that requires models to comprehend both questions and images. An increasing number of works are leveraging the strong reasoning capabilities of Large Language Models (LLMs) to address VQA. These methods typically utilize image captions as visual text description to aid LLMs in comprehending images. However, these captions often overlooking the relations of fine-grained objects, which will limit the reasoning capability of LLMs. In this paper, we present PFVR, a modular framework that Prompts LLMs with Fine-grained Visual Relationships for VQA. PFVR primarily consists of an answer-guided generation module (AGG) and a questionguided filtering module (QGF). The two modules can combine to extract the fine-grained visual relations from scene graph, which will finally serve as crucial context for LLMs to comprehend the image. Extensive experiments conducted on the popular VQA dataset, GQA, confirm PFVR achieves state-of-the-art results compared to other strong VQA competitors, demonstrating its exceptional effectiveness.

icassp_2024_1.pptx

icassp_2024_1.pptx (139)

Thumbs Up

CITE

Documents

Presentation Slides

PROMPTING LARGE LANGUAGE MODELS WITH FINE-GRAINED VISUAL RELATIONS FROM SCENE GRAPH FOR VISUAL QUESTION ANSWERING

icassp_2024_1.pptx

QUESTIONS?