Documents
Poster
LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING
- Citation Author(s):
- Submitted by:
- Fei Liu
- Last updated:
- 19 September 2019 - 11:05am
- Document Type:
- Poster
- Document Year:
- 2019
- Event:
- Presenters:
- Fei Liu
- Categories:
- Log in to post comments
Visual Question Answering (VQA) involves complex relations of two modalities, including the relations between words and between image regions. Thus, encoding these relations is important to accurate VQA. In this paper, we propose two modules to encode the two types of relations respectively. The language relation encoding module is proposed to encode multi-scale relations between words via a novel masked selfattention. The visual relation encoding module is proposed to encode the relations between image regions. It computes the response at a position as a weighted sum of the features at other positions in the feature maps. Extensive experiments demonstrate the effectiveness of each modules. Our model achieves state-of-the-art performance on the VQA 1.0 dataset.