WATCH, LISTEN ONCE, AND SYNC: AUDIO-VISUAL SYNCHRONIZATION WITH MULTI-MODAL REGRESSION CNN

Citation Author(s):: Toshiki Kikuchi

Toshiki Kikuchi, Yuko Ozasa
Submitted by:: Toshiki Kikuchi
Last updated:: 13 April 2018 - 12:19am
Document Type:: Presentation Slides
Document Year:: 2018
Event:: ICASSP 2018
Presenters:: Toshiki Kikuchi
Paper Code:: MMSP-L1.3

Categories:: Multimodal signal processing

Recovering audio-visual synchronization is an important task in the field of visual speech processing.
In this paper, we present a multi-modal regression model that uses a convolutional neural network (CNN) for recovering audio-visual synchronization of single-person speech videos. The proposed model takes audio and visual features of multiple frames as the input and predicts a drifted frame number of the audio-visual pair which we input. We treat this synchronization task as a regression problem. Thus, the model does not need to search with a sliding window which would increase the computational cost. Experimental results show that the proposed method outperforms other baseline methods for recovered accuracy and computational cost.

slides_for_sigport.pdf

Presentation Slides (697)

Thumbs Up

CITE

Documents

Presentation Slides

WATCH, LISTEN ONCE, AND SYNC: AUDIO-VISUAL SYNCHRONIZATION WITH MULTI-MODAL REGRESSION CNN

slides_for_sigport.pdf

QUESTIONS?