Documents
Poster
UNIFIED PRETRAINING TARGET BASED CROSS-MODAL VIDEO-MUSIC RETRIEVAL
- DOI:
- 10.60864/d13s-hx80
- Citation Author(s):
- Submitted by:
- Tianjun Mao
- Last updated:
- 6 June 2024 - 10:28am
- Document Type:
- Poster
- Event:
- Presenters:
- Tianjun Mao
- Paper Code:
- MMSP-P5.8
- Categories:
- Log in to post comments
Background music (BGM) can enhance the video’s emotion and thus make it engaging. However, selecting an appropriate BGM often requires domain knowledge or a deep understanding of the video. This has led to the development of video-music retrieval techniques. Most existing approaches utilize pre-trained video/music feature extractors trained with different target sets to obtain average video/music-level embeddings for cross-modal matching. The drawbacks are two-fold. One is that different target sets for video/music pre-training may cause the generated embeddings difficult to match. The second is that using average embeddings loses the possibility of utilizing the underlying temporal correlation between video and music. The proposed approach leverages a unified target set to perform video/music pretraining and produce clip-level embeddings to preserve temporal information. The downstream cross-modal matching is based on the clip-level audio-visual features as well as a cross-modal attention mechanism. We also explore the use of rhythm and optical flow information in this work. As there are no suitable videomusic retrieval datasets, we perform the proposed method on our internal QQ little world dataset. Experiments demonstrate that the proposed method can achieve superior video-music retrieval performance over the state-of-the-art methods.