Converting RLBWT to LZ77 in smaller space

Comments

About parallelism

Permalink Submitted by Dominik Koeppl on 2 March 2022 - 5:33am

If I understood correctly, the basic idea is to compute the lengths of the factors without their references, for which it is not necessary to have the suffix array entries available. Instead, a bit vector with dynamic flipping operation, augmented with a rank support data structure is used to determine the factor lengths. Subsequently, we only need to compute those SA values that would have been requested, which we subsequently do in a second pass.
Although your experiments show, for each number of threads, a time measure for the first pass, the first pass is purely sequential, so no matter how many threads you use, the time should stay roughly the same? Regarding parallelism, I am thinking of a heuristic that first partitions the text into blocks of equal length and then runs the factorization for each part; this should also give a valid LZ77-like parsing that could be computed completely in parallel. Also, maybe we could reconstruct from that the correct LZ77 parsing, but I think that you need to compute the parsing from scratch for certain bad examples - nevertheless it would be an interesting heuristic to try out on real data.
I think that your algorithm also works when you do not include the extra characters in your phrases such that you can represent them in the LZSS-fashioned pairs of lengths and positions (π_h, λ_h)?

CITE

Documents

Presentation Slides

Converting RLBWT to LZ77 in smaller space

RLBWT2LZ77.pdf

Comments

About parallelism

QUESTIONS?