LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING

Citation Author(s):: Fei Liu

Jing Liu,Zhiwei Fang,Hanqing Lu
Submitted by:: Fei Liu
Last updated:: 19 September 2019 - 11:05am
Document Type:: Poster
Document Year:: 2019
Event:: ICIP 2019
Presenters:: Fei Liu

Categories:: Image/Video Processing

Visual Question Answering (VQA) involves complex relations of two modalities, including the relations between words and between image regions. Thus, encoding these relations is important to accurate VQA. In this paper, we propose two modules to encode the two types of relations respectively. The language relation encoding module is proposed to encode multi-scale relations between words via a novel masked selfattention. The visual relation encoding module is proposed to encode the relations between image regions. It computes the response at a position as a weighted sum of the features at other positions in the feature maps. Extensive experiments demonstrate the effectiveness of each modules. Our model achieves state-of-the-art performance on the VQA 1.0 dataset.

poster.pdf

poster.pdf (329)

Thumbs Up

CITE

Documents

Poster

LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING

poster.pdf

QUESTIONS?