Sorry, you need to enable JavaScript to visit this website.

Binary-Coded Ternary Number Representation in Natural Language Text Compression

Citation Author(s):
Submitted by:
Igor Zavadskyi
Last updated:
23 February 2022 - 4:18am
Document Type:
Presentation Slides
Document Year:
2022
Event:
Presenters:
Igor Zavadskyi
Paper Code:
169
Categories:
 

A lossless data compression code based on a binary-coded ternary number representation is investigated. The code is synchronizable and supports a direct search in a compressed file. A simple monotonous encoding and very fast decoding algorithms are constructed owing to code properties. Experiments show that in natural language text compression the new code outperforms byte-aligned codes SCDC and RPBC either in compression ratio or in decoding speed.

up
1 user has voted: Dominik Koeppl

Comments

I wonder whether the table on your first slide about the "compression ratio" versus "decoding speed" is a well known fact.
Are there any practical benchmark suites known that I can compile and run to observe these rankings?

I suppose that the "Codewords and blocks of trits intersection" handling is important if you want to not only decode the binary digits, but the actual codewords.
I think there are use cases where you just want to stream your decoded bits. Do you know how the measured approaches behave in such a setting?

Finally, have you thought about using the quinary numeral system instead of the ternary one?
Maybe it gives another trade-off with better compression ratio but slower decompression speed.

Regarding the table on the first slide. The compression ratio and decoding speed of RPBC, SCDC, Multi-delimiter and Fibonacci codes I've measured by myself, for different texts given in presentation (and other texts), while the comparison with Huffman, arithmetic and ANS codes can be derived from the publicly available papers.

>>I think there are use cases where you just want to stream your decoded bits. Do you know how the measured approaches behave in such a setting?

Thanks for the comment. Actually, I've measured 2 cases: decoding the array of numbers and full decompression of the original text. Both are represented on slide 19, the time of numbers decoding before the slash, and the full decompression time after the slash.

However, if you mean streaming the bits of decoded numbers, I do not think this is useful use case, because how will we know where the number ends and the next number starts?

>>have you thought about using the quinary numeral system instead of the ternary one?

I thought (and made some experiments) about using the septenary numeral system, since we can encode 7 numbers + 1 delimiter using 3 bits. It seems that by the compression ratio this approach outperforms the BCT-code for the extremely large texts only. The quinary numeral system leaves a room for 2 extra numbers in 3 bits, and thus it doesn't seem to be too efficient. However, may be we can store somehow 5^3=125 numbers in 7 bits and this can give yet another trade-off. Thanks for the idea, but it needs to be developed further.

Thanks for your insightful reply.
For Slide 19, it is interesting to see that the interpretation of the decoded bits as numbers takes a critical amount of time. I wonder what is the reason for that. Also, this additional step takes roughly the same amount of time for each of the used encodings, I suppose?

Indeed, if we just stream the bits of the decoded numbers, we need some delimiter as separators between the decoded binary representation of the numbers to make sense.

The usage of the septenary numeral system sounds very interesting to me. With "extremely large texts" you mean particularly long numbers that need to be encoded? Would it make sense to do a scaling experiment on the number of bits of the input numbers, and measure at which bit lengths the septenary encoding beats the ternary one?