Sorry, you need to enable JavaScript to visit this website.

Cross-modal Multiscale Difference-aware Network for Joint Moment Retrieval and Highlight Detection

DOI:
10.60864/nyy5-1398
Citation Author(s):
Mingyao Zhou, Wenjing Chen, Hao Sun, Wei Xie
Submitted by:
Mingyao Zhou
Last updated:
6 June 2024 - 10:28am
Document Type:
Methodology
Document Year:
2024
Event:
Presenters:
Mingyao Zhou
Paper Code:
MMSP-P5.11
 

Since the goals of both Moment Retrieval (MR) and Highlight Detection (HD) are to quickly obtain the required content from the video according to user needs, several works have attempted to take advantage of the commonality between both tasks to design transformer-based networks for joint MR and HD. Although these methods achieve impressive performance, they still face some problems: \textbf{a)} Semantic gaps across different modalities. \textbf{b)} Various durations of different query-relevant moments and highlights. \textbf{c)} Smooth transitions among diverse events. To this end, we propose a Cross-modal Multiscale Difference-aware Network, named CMDNet. First, a clip-text alignment module is constructed to narrow semantic gaps between different modalities. Second, a multiscale difference perception module is utilized to mine the differential information between adjacent clips and perform multiscale modeling to obtain discriminative representations. Finally, these representations are fed into the MR and HD task heads to retrieve relevant moments and estimate highlight scores precisely. Extensive experiments on three popular datasets demonstrate that CMDNet achieves state-of-the-art performance.

up
0 users have voted: