Documents
Poster
Segmentation of Text-Lines and Words from JPEG Compressed Printed Text Documents Using DCT Coefficients
- Citation Author(s):
- Submitted by:
- mohammed javed
- Last updated:
- 7 April 2020 - 5:04am
- Document Type:
- Poster
- Document Year:
- 2020
- Event:
- Presenters:
- Mohammed Javed
- Paper Code:
- DCC2020-181
- Categories:
- Log in to post comments
Segmenting a document image into text-lines and words finds applications in many research areas of DIA(Document Image Analysis) such as OCR, Word Spotting, and document retrieval. However, carrying out segmentation operation directly in the compressed document images is still an unexplored and challenging research area. Since JPEG is most widely accepted compression algorithm, this research paper attempts to segment a JPEG compressed printed text document image into text-lines and words, without fully decompressing the image. During JPEG compression, the non-overlapping 8x8 DCT blocks encode text contents of two adjacent text-lines and words without leaving any visible clue for segmentation. This paper proposes two stage algorithms for segmentation of text-lines and words by intelligently analyzing approximate text-line and word boundaries using the DC coefficient during the first stage. In the second stage, AC coefficients of selected DCT blocks are used to extract exact line and word boundaries. The experimental results on a JPEG compressed document dataset(with variable spacing between lines and words, different font sizes and styles) shows a good computational performance.