Documents
Presentation Slides
Multi-Source Dynamic Interactive Network Collaborative Reasoning Image Captioning
- Citation Author(s):
- Submitted by:
- Zhixin Li
- Last updated:
- 6 April 2024 - 5:27am
- Document Type:
- Presentation Slides
- Document Year:
- 2024
- Event:
- Presenters:
- Qiang Su
- Paper Code:
- MMSP-L1.2
- Categories:
- Log in to post comments
Rich image and text features can largely improve the training of image captioning tasks. However, rich image and text features mean the incorporation of a large amount of unnecessary information. In our work, in order to fully explore and utilize the key information in images and text, we view the combination of image and text features as a data screening problem. The combination of image and text features is dynamically screened through a series of inference strategies with the aim of selecting the optimal image and text features. First, in order to enhance the prior knowledge of the model, three input features, grid image, region image and text, are designed in this paper. Second, the model designs three feature enhancement channels, a global scene enhancement channel, a regional feature enhancement channel and a multimodal semantic enhancement channel, for the multi-source dynamic interaction network. Finally, the model uses a dynamic selection mechanism to choose the most appropriate enhancement features to input to the decoder. We validate the effectiveness of the approach by comparing the baseline model. Moreover, an in-depth analysis of each module demonstrates that the method can more fully utilize the current resources to achieve better results.