Sorry, you need to enable JavaScript to visit this website.

Improving Run Length Encoding by Preprocessing

Citation Author(s):
Sven Fiergolla, Petra Wolf
Submitted by:
Sven Fiergolla
Last updated:
28 February 2021 - 11:04am
Document Type:
Poster
Document Year:
2021
Event:
Presenters:
Sven Fiergolla and Petra Wolf
Categories:
Keywords:
 

The Run Length Encoding (RLE) compression method is a long standing simple lossless compression scheme which is easy to implement and achieves a good compression on input data which contains repeating consecutive symbols. In its pure form RLE is not applicable on natural text or other input data with short sequences of identical symbols. We present a combination of preprocessing steps that turn arbitrary input data in a byte-wise encoding into a bit-string which is highly suitable for RLE compression. The main idea is to first read all most significant bits of the input byte-string, followed by the second most significant bit and so on. We combine this approach by a dynamic byte remapping as well as a Burrows-Wheeler-Scott transform on a byte level. Finally, we apply a Huffman Encoding on the output of the bit-wise RLE encoding to allow for more dynamic lengths of code words encoding runs of the RLE. With our technique we can achieve a lossless average compression which is better than the standard RLE compression by a factor of 8 on average.

up
0 users have voted:

Comments

Thanks for this poster presentation.
I think your ideas have potential, but need further analysis.
The first thing I noted when reading your arXiv pre-print is that there is no discussion about how to extract the BWST (Burrows-Wheeler-Scott transform)
from the C library LibDivSufSort. As far as I am aware of, this library computes the standard BWT, not the bijective one.
There is an arXiv paper [1] available on how to compute the BWST in linear time, but unfortunately, there is currently no implementation available.

Regarding compression, the common approach used for the Burrows-Wheeler transform is to apply a move-to-front (MTF) encoding, then a run-length encoding (RLE),
and finally a Huffman coding. You can see in [2, Table 2] (aliased as bwtzip), that this already gives a good compression.
In that sense, your comparison of your proposed compression scheme with just the plain RLE is unfair.
It would make much more sense to compare it against BWT+MTF+RLE+Huffman or BWST+MTF+RLE+Huffman.
In particular, I would be really interested in seeing whether (a) BWTS or BWT, as well as whether (b) MTF or Dynamic-Byte-Remapping+Vertical-Byte-Reading does a better job.

[1]: https://arxiv.org/abs/1911.06985
[2]: https://drops.dagstuhl.de/opus/volltexte/2017/7601/