Documents
Poster
		    Poster: Synchformer: Efficient Synchronization from Sparse Cues

- DOI:
 - 10.60864/v8m9-j241
 - Citation Author(s):
 - Submitted by:
 - Vladimir Iashin
 - Last updated:
 - 6 June 2024 - 10:27am
 - Document Type:
 - Poster
 - Document Year:
 - 2024
 - Event:
 - Presenters:
 - Vladimir Iashin
 - Paper Code:
 - MLSP-P4.1
 
- Categories:
 - Keywords:
 
- Log in to post comments
 
Our objective is audio-visual synchronization with a focus on ‘in-the-wild’ videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale ‘in-the-wild’ dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability. Code, models, and project page: https://www.robots.ox.ac.uk/~vgg/research/synchformer/