Sorry, you need to enable JavaScript to visit this website.

AUDIO-VISUAL FUSION AND CONDITIONING WITH NEURAL NETWORKS FOR EVENT RECOGNITION

Citation Author(s):
Jean Rouat, Stéphane Dupont
Submitted by:
Mathilde Brousmiche
Last updated:
14 October 2019 - 8:52pm
Document Type:
Presentation Slides
Document Year:
2019
Event:
Presenters:
Mathilde Brousmiche
Paper Code:
60
 

Video event recognition based on audio and visual modalities is an open research problem. The mainstream literature on video event recognition focuses on the visual modality and does not take into account the relevant information present in the audio modality. We propose to study several fusion architectures for the audio-visual recognition task of video events. We first build classical fusion architectures using concatenation, addition or Multimodal Compact Bilinear pooling (MCB). Then, we propose to create connections between visual and audio processing with Feature-Wise Linear Modulation (FiLM) layers. For instance, the information present in the audio modality is exploited to change the visual classification behaviour. We found that multimodal event classification performance is always better than unimodal performance, whatever the fusion or conditioning method used. Classification accuracy based on one modality improves when we add the modulation of the other modality through FiLM layers.

up
0 users have voted: