Documents
Presentation Slides
MTAF: SHOPPING GUIDE MICRO-VIDEOS POPULARITY PREDICTION USING MULTIMODAL AND TEMPORAL ATTENTION FUSION APPROACH
- Citation Author(s):
- Submitted by:
- Ningrui Ou
- Last updated:
- 10 May 2022 - 2:50am
- Document Type:
- Presentation Slides
- Document Year:
- 2022
- Event:
- Presenters:
- Ningrui Ou
- Paper Code:
- MLSP-52.2
- Categories:
- Log in to post comments
Predicting the popularity of shopping guide micro-videos incorporating merchandise is crucial for online advertising. What are the significant factors affecting the popularity of the micro-video? How to extract and effectively fuse multiple modalities for the micro-video popularity prediction? This is a question that needs to be urgently answered to better provide insights for advertisers. In this paper, we propose a Multimodal and Temporal Attention Fusion (MTAF) framework to represent and combine multi-modal features. Specifically, we first explore the importance of the microvideo content-agnostic factors using two existing tree-based ensemble methods. Furthermore, we employ three state-ofthe-art pre-trained models, BERT, VGGish and ResNet152, to obtain high-level multimodal content representations, including uploaders’ description of products, vocal emotion, facial attractiveness, respectively. In addition, a bi-directional GRU is used to learn early popularity trend characteristics of the micro-video. Finally, a multimodal and temporal attention mechanism layer is designed to combine all features from the multiple sources. Comprehensive experiments are conducted
on TikTok e-commerce micro-video dataset to evaluate the effectiveness of our model and different modalities.