AUDIO-VISUAL FUSION AND CONDITIONING WITH NEURAL NETWORKS FOR EVENT RECOGNITION

Video event recognition based on audio and visual modalities is an open research problem. The mainstream literature on video event recognition focuses on the visual modality and does not take into account the relevant information present in the audio modality. We propose to study several fusion architectures for the audio-visual recognition task of video events. We first build classical fusion architectures using concatenation, addition or Multimodal Compact Bilinear pooling (MCB). Then, we propose to create connections between visual and audio processing with Feature-Wise Linear Modulation (FiLM) layers. For instance, the information present in the audio modality is exploited to change the visual classification behaviour. We found that multimodal event classification performance is always better than unimodal performance, whatever the fusion or conditioning method used. Classification accuracy based on one modality improves when we add the modulation of the other modality through FiLM layers.

MLSP_presentation.pdf

MLSP_presentation.pdf (454)

Thumbs Up

CITE

Documents

Presentation Slides

AUDIO-VISUAL FUSION AND CONDITIONING WITH NEURAL NETWORKS FOR EVENT RECOGNITION

MLSP_presentation.pdf

QUESTIONS?