Documents
Presentation Slides
AUDIO-VISUAL FUSION AND CONDITIONING WITH NEURAL NETWORKS FOR EVENT RECOGNITION
- Citation Author(s):
- Submitted by:
- Mathilde Brousmiche
- Last updated:
- 14 October 2019 - 8:52pm
- Document Type:
- Presentation Slides
- Document Year:
- 2019
- Event:
- Presenters:
- Mathilde Brousmiche
- Paper Code:
- 60
- Categories:
- Log in to post comments
Video event recognition based on audio and visual modalities is an open research problem. The mainstream literature on video event recognition focuses on the visual modality and does not take into account the relevant information present in the audio modality. We propose to study several fusion architectures for the audio-visual recognition task of video events. We first build classical fusion architectures using concatenation, addition or Multimodal Compact Bilinear pooling (MCB). Then, we propose to create connections between visual and audio processing with Feature-Wise Linear Modulation (FiLM) layers. For instance, the information present in the audio modality is exploited to change the visual classification behaviour. We found that multimodal event classification performance is always better than unimodal performance, whatever the fusion or conditioning method used. Classification accuracy based on one modality improves when we add the modulation of the other modality through FiLM layers.