Sorry, you need to enable JavaScript to visit this website.

LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING

Citation Author(s):
Jing Liu,Zhiwei Fang,Hanqing Lu
Submitted by:
Fei Liu
Last updated:
19 September 2019 - 11:05am
Document Type:
Poster
Document Year:
2019
Event:
Presenters:
Fei Liu
 

Visual Question Answering (VQA) involves complex relations of two modalities, including the relations between words and between image regions. Thus, encoding these relations is important to accurate VQA. In this paper, we propose two modules to encode the two types of relations respectively. The language relation encoding module is proposed to encode multi-scale relations between words via a novel masked selfattention. The visual relation encoding module is proposed to encode the relations between image regions. It computes the response at a position as a weighted sum of the features at other positions in the feature maps. Extensive experiments demonstrate the effectiveness of each modules. Our model achieves state-of-the-art performance on the VQA 1.0 dataset.

up
0 users have voted: