Sorry, you need to enable JavaScript to visit this website.

Cross-Modal Deep Networks For Document Image Classification

Primary tabs

Citation Author(s):
Souhail Bakkali, Zuheng Ming, Mickaël Coustaty, Marçal Rusiñol
Submitted by:
Souhail Bakkali
Last updated:
2 November 2020 - 11:26am
Document Type:
Presentation Slides
Document Year:
Presenters Name:
Souhail Bakkali
Paper Code:



As a fundamental step of document related tasks, document classification has been widely adopted to various document image processing applications. Unlike the general image classification problem in the computer vision field, text document images contain both the visual cues and the corresponding text within the image. However, how to bridge these two different modalities and leverage textual and visual features to classify text document images remains challenging. In this paper, we present a cross-modal deep network that enables to capture both the textual content and the visual information included in document images. Thanks to the efficient jointly learning of text and image features, the proposed cross-modal approach shows its superiority to the state-of-the-art single-modal methods. In this paper, we propose to use NASNet-Large and Bert to extract image and text features respectively. Experimental results demonstrate that the proposed cross-modal approach achieves new state-of-the-art results for text document image classification on the benchmark Tobacco-3482 dataset, outperforming the current state-of-the-art method by 3.91% of classification accuracy.

0 users have voted: