Documents
Presentation Slides
Presentation Slides
WATCH, LISTEN ONCE, AND SYNC: AUDIO-VISUAL SYNCHRONIZATION WITH MULTI-MODAL REGRESSION CNN
- Citation Author(s):
- Submitted by:
- Toshiki Kikuchi
- Last updated:
- 13 April 2018 - 12:19am
- Document Type:
- Presentation Slides
- Document Year:
- 2018
- Event:
- Presenters:
- Toshiki Kikuchi
- Paper Code:
- MMSP-L1.3
- Categories:
- Log in to post comments
Recovering audio-visual synchronization is an important task in the field of visual speech processing.
In this paper, we present a multi-modal regression model that uses a convolutional neural network (CNN) for recovering audio-visual synchronization of single-person speech videos. The proposed model takes audio and visual features of multiple frames as the input and predicts a drifted frame number of the audio-visual pair which we input. We treat this synchronization task as a regression problem. Thus, the model does not need to search with a sliding window which would increase the computational cost. Experimental results show that the proposed method outperforms other baseline methods for recovered accuracy and computational cost.