Binary-Coded Ternary Number Representation in Natural Language Text Compression

A lossless data compression code based on a binary-coded ternary number representation is investigated. The code is synchronizable and supports a direct search in a compressed file. A simple monotonous encoding and very fast decoding algorithms are constructed owing to code properties. Experiments show that in natural language text compression the new code outperforms byte-aligned codes SCDC and RPBC either in compression ratio or in decoding speed.

presentation_Zavadskyi.pptx

presentation_Zavadskyi.pptx (198)

Links:

The bct-code: Encoding and decoding in C programming language

Thumbs Up

Comments

Quinary numbers

Permalink Submitted by Dominik Koeppl on 2 March 2022 - 11:01pm

I wonder whether the table on your first slide about the "compression ratio" versus "decoding speed" is a well known fact.
Are there any practical benchmark suites known that I can compile and run to observe these rankings?

I suppose that the "Codewords and blocks of trits intersection" handling is important if you want to not only decode the binary digits, but the actual codewords.
I think there are use cases where you just want to stream your decoded bits. Do you know how the measured approaches behave in such a setting?

Finally, have you thought about using the quinary numeral system instead of the ternary one?
Maybe it gives another trade-off with better compression ratio but slower decompression speed.

Regarding quinary numbers

Permalink Submitted by Igor Zavadskyi on 3 March 2022 - 9:55am

Regarding the table on the first slide. The compression ratio and decoding speed of RPBC, SCDC, Multi-delimiter and Fibonacci codes I've measured by myself, for different texts given in presentation (and other texts), while the comparison with Huffman, arithmetic and ANS codes can be derived from the publicly available papers.

>>I think there are use cases where you just want to stream your decoded bits. Do you know how the measured approaches behave in such a setting?

Thanks for the comment. Actually, I've measured 2 cases: decoding the array of numbers and full decompression of the original text. Both are represented on slide 19, the time of numbers decoding before the slash, and the full decompression time after the slash.

However, if you mean streaming the bits of decoded numbers, I do not think this is useful use case, because how will we know where the number ends and the next number starts?

>>have you thought about using the quinary numeral system instead of the ternary one?

I thought (and made some experiments) about using the septenary numeral system, since we can encode 7 numbers + 1 delimiter using 3 bits. It seems that by the compression ratio this approach outperforms the BCT-code for the extremely large texts only. The quinary numeral system leaves a room for 2 extra numbers in 3 bits, and thus it doesn't seem to be too efficient. However, may be we can store somehow 5^3=125 numbers in 7 bits and this can give yet another trade-off. Thanks for the idea, but it needs to be developed further.

Thanks for your insightful

Permalink Submitted by Dominik Koeppl on 3 March 2022 - 7:26pm

Thanks for your insightful reply.
For Slide 19, it is interesting to see that the interpretation of the decoded bits as numbers takes a critical amount of time. I wonder what is the reason for that. Also, this additional step takes roughly the same amount of time for each of the used encodings, I suppose?

Indeed, if we just stream the bits of the decoded numbers, we need some delimiter as separators between the decoded binary representation of the numbers to make sense.

The usage of the septenary numeral system sounds very interesting to me. With "extremely large texts" you mean particularly long numbers that need to be encoded? Would it make sense to do a scaling experiment on the number of bits of the input numbers, and measure at which bit lengths the septenary encoding beats the ternary one?

CITE

Documents

Presentation Slides