Sorry, you need to enable JavaScript to visit this website.

Multi-Source Dynamic Interactive Network Collaborative Reasoning Image Captioning

Citation Author(s):
Submitted by:
Zhixin Li
Last updated:
6 April 2024 - 5:27am
Document Type:
Presentation Slides
Document Year:
2024
Event:
Presenters:
Qiang Su
Paper Code:
MMSP-L1.2
 

Rich image and text features can largely improve the training of image captioning tasks. However, rich image and text features mean the incorporation of a large amount of unnecessary information. In our work, in order to fully explore and utilize the key information in images and text, we view the combination of image and text features as a data screening problem. The combination of image and text features is dynamically screened through a series of inference strategies with the aim of selecting the optimal image and text features. First, in order to enhance the prior knowledge of the model, three input features, grid image, region image and text, are designed in this paper. Second, the model designs three feature enhancement channels, a global scene enhancement channel, a regional feature enhancement channel and a multimodal semantic enhancement channel, for the multi-source dynamic interaction network. Finally, the model uses a dynamic selection mechanism to choose the most appropriate enhancement features to input to the decoder. We validate the effectiveness of the approach by comparing the baseline model. Moreover, an in-depth analysis of each module demonstrates that the method can more fully utilize the current resources to achieve better results.

up
0 users have voted: