Cross-modal Multiscale Difference-aware Network for Joint Moment Retrieval and Highlight Detection

Since the goals of both Moment Retrieval (MR) and Highlight Detection (HD) are to quickly obtain the required content from the video according to user needs, several works have attempted to take advantage of the commonality between both tasks to design transformer-based networks for joint MR and HD. Although these methods achieve impressive performance, they still face some problems: \textbf{a)} Semantic gaps across different modalities. \textbf{b)} Various durations of different query-relevant moments and highlights. \textbf{c)} Smooth transitions among diverse events. To this end, we propose a Cross-modal Multiscale Difference-aware Network, named CMDNet. First, a clip-text alignment module is constructed to narrow semantic gaps between different modalities. Second, a multiscale difference perception module is utilized to mine the differential information between adjacent clips and perform multiscale modeling to obtain discriminative representations. Finally, these representations are fed into the MR and HD task heads to retrieve relevant moments and estimate highlight scores precisely. Extensive experiments on three popular datasets demonstrate that CMDNet achieves state-of-the-art performance.

CMDNet_04_13.pdf

Poster (190)

Thumbs Up

CITE

Documents

Methodology

Cross-modal Multiscale Difference-aware Network for Joint Moment Retrieval and Highlight Detection

CMDNet_04_13.pdf

QUESTIONS?