Approximating Optimal Bidirectional Macro Schemes.

Lempel-Ziv is an easy-to-compute member of a wide family of so-called macro schemes; it restricts pointers to go in one direction only. Optimal bidirectional macro schemes are NP-complete to find, but they may provide much better compression on highly repetitive sequences. We consider the problem of approximating optimal bidirectional macro schemes. We describe a simulated annealing algorithm that usually converges quickly. Moreover, in some cases, we obtain bidirectional macro schemes that are provably a 2-approximation of the optimal. We test our algorithm on a number of artificial repetitive texts and verify that it is efficient in practice and outperforms Lempel-Ziv, sometimes by a wide margin.

DCC-2020-Approximating_Optimal_Bidirectional_Macro_Schemes.pptx

DCC-2020-Approximating_Optimal_Bidirectional_Macro_Schemes.pptx (501)

Thumbs Up

Comments

Thanks for the video!

Permalink Submitted by Dominik Koeppl on 12 April 2020 - 1:05am

Thanks for the video!
I already wondered about how to efficiently compute such an approximation, but with the link/cut tree data structure, it seems feasible.
If I understood correctly, the gist is that in some cases, your compression scheme based on simulated annealing has a size of at most 2γ, where γ is the size of the smallest string attractor.

While watching your video, I've collected some small questions:

With what configuration do you start the simulated annealing? Do you start with the LZ77 phrases (in particular, the LZ77 scheme that introduces a new character at the end of each referencing phrase)?

If I interpreted the figure in Slide 23 correctly, the black-shaded characters are the endings of the phrases. However, are there any explicit characters?

How did you evaluate the 2-approximation? Do you compute the smallest bidirectional macro scheme or the smallest string attractor via brute force?

Some references that could be interesting for future evaluation:
LZRR (https://ieeexplore.ieee.org/document/8712715) and plcpcomp (https://doi.org/10.4230/LIPIcs.ESA.2019.41) are other bidirectional macro schemes with empirically well-received approximations.
In particular, the plcpcomp approach uses ideas similar to the link/cut tree data structure for merging or splitting phrases, but with the difference that there the focus was put on improving the decompression in external memory.

For the future work: If you propose tests with large files, I wonder how fast your algorithm is.
Simulated annealing and the merge operation on the link cut tree seems to be very slow.

Finally, a didactic hint: Instead of introducing LZ78 in detail (Slide 4), it would have been better to introduce string attractors (needed in Slide 24) and a common macro scheme (e.g. LZ77).

CITE

Documents

Presentation Slides

Approximating Optimal Bidirectional Macro Schemes.

DCC-2020-Approximating_Optimal_Bidirectional_Macro_Schemes.pptx

Comments

Thanks for the video!

QUESTIONS?