M3SUM: A Novel Unsupervised Language-guided Video Summarization

Language-guided video summarization empowers users to use natural language queries to effortlessly summarize lengthy videos into concise and relevant summaries that cater specifically to their information needs, which is more friendly to access and digest. However, most of the previous works rely on tremendous (also expensive) annotated videos and complex designs to align different modals at the feature level. In this paper, we first explore the combination of off-the-shelf models for each modal to solve the complex multi-modal problem by proposing a novel unsupervised language-guided video summarization method: Modular Multi-Modal Summarization (M3Sum), which does not require any training data or parameter updates. Specifically, instead of training an alignment module at the feature level, we convert all modal information (e.g. audio and frames) into textual descriptions and design a parameter-free alignment mechanism to fuse text descriptions from different modals. Benefiting from the remarkable long-context understanding capability of large language models (LLMs), our approach demonstrates comparable performance to most unsupervised methods and even outperforms certain supervised methods.

icassp2024_m3sum.pdf

icassp2024_m3sum.pdf (14)

Thumbs Up

CITE

Documents

Presentation Slides

M3SUM: A Novel Unsupervised Language-guided Video Summarization

icassp2024_m3sum.pdf

QUESTIONS?