Sorry, you need to enable JavaScript to visit this website.

WATCH, LISTEN ONCE, AND SYNC: AUDIO-VISUAL SYNCHRONIZATION WITH MULTI-MODAL REGRESSION CNN

Citation Author(s):
Toshiki Kikuchi, Yuko Ozasa
Submitted by:
Toshiki Kikuchi
Last updated:
13 April 2018 - 12:19am
Document Type:
Presentation Slides
Document Year:
2018
Event:
Presenters:
Toshiki Kikuchi
Paper Code:
MMSP-L1.3
 

Recovering audio-visual synchronization is an important task in the field of visual speech processing.
In this paper, we present a multi-modal regression model that uses a convolutional neural network (CNN) for recovering audio-visual synchronization of single-person speech videos. The proposed model takes audio and visual features of multiple frames as the input and predicts a drifted frame number of the audio-visual pair which we input. We treat this synchronization task as a regression problem. Thus, the model does not need to search with a sliding window which would increase the computational cost. Experimental results show that the proposed method outperforms other baseline methods for recovered accuracy and computational cost.

up
0 users have voted: