If I understood correctly, the basic idea is to compute the lengths of the factors without their references, for which it is not necessary to have the suffix array entries available. Instead, a bit vector with dynamic flipping operation, augmented with a rank support data structure is used to determine the factor lengths. Subsequently, we only need to compute those SA values that would have been requested, which we subsequently do in a second pass.
Although your experiments show, for each number of threads, a time measure for the first pass, the first pass is purely sequential, so no matter how many threads you use, the time should stay roughly the same? Regarding parallelism, I am thinking of a heuristic that first partitions the text into blocks of equal length and then runs the factorization for each part; this should also give a valid LZ77-like parsing that could be computed completely in parallel. Also, maybe we could reconstruct from that the correct LZ77 parsing, but I think that you need to compute the parsing from scratch for certain bad examples - nevertheless it would be an interesting heuristic to try out on real data.
I think that your algorithm also works when you do not include the extra characters in your phrases such that you can represent them in the LZSS-fashioned pairs of lengths and positions (π_h, λ_h)?
[1] Masaki Shigekuni, Tomohiro I,
"Converting RLBWT to LZ77 in smaller space",
IEEE Signal Processing Society SigPort,
2022. [Online]. Available: https://sigport.org/documents/converting-rlbwt-lz77-smaller-space. Accessed: Oct. 07, 2024.
@misc{6195-22,
url = {https://sigport.org/documents/converting-rlbwt-lz77-smaller-space},
author = {Masaki Shigekuni; Tomohiro I },
publisher = {IEEE Signal Processing Society SigPort},
title = {Converting RLBWT to LZ77 in smaller space},
year = {2022}
}
TY - DATA
T1 - Converting RLBWT to LZ77 in smaller space
AU - Masaki Shigekuni; Tomohiro I
PY - 2022
PB - IEEE Signal Processing Society SigPort
UR - https://sigport.org/documents/converting-rlbwt-lz77-smaller-space
ER -
Masaki Shigekuni, Tomohiro I.
(2022).
Converting RLBWT to LZ77 in smaller space.
IEEE Signal Processing Society SigPort.
https://sigport.org/documents/converting-rlbwt-lz77-smaller-space
Masaki Shigekuni, Tomohiro I,
2022.
Converting RLBWT to LZ77 in smaller space.
Available at:
https://sigport.org/documents/converting-rlbwt-lz77-smaller-space.
1. Masaki Shigekuni, Tomohiro I.
Converting RLBWT to LZ77 in smaller space [Internet].
IEEE Signal Processing Society SigPort; 2022.
Available from :
https://sigport.org/documents/converting-rlbwt-lz77-smaller-space
Comments
About parallelism
If I understood correctly, the basic idea is to compute the lengths of the factors without their references, for which it is not necessary to have the suffix array entries available. Instead, a bit vector with dynamic flipping operation, augmented with a rank support data structure is used to determine the factor lengths. Subsequently, we only need to compute those SA values that would have been requested, which we subsequently do in a second pass.
Although your experiments show, for each number of threads, a time measure for the first pass, the first pass is purely sequential, so no matter how many threads you use, the time should stay roughly the same? Regarding parallelism, I am thinking of a heuristic that first partitions the text into blocks of equal length and then runs the factorization for each part; this should also give a valid LZ77-like parsing that could be computed completely in parallel. Also, maybe we could reconstruct from that the correct LZ77 parsing, but I think that you need to compute the parsing from scratch for certain bad examples - nevertheless it would be an interesting heuristic to try out on real data.
I think that your algorithm also works when you do not include the extra characters in your phrases such that you can represent them in the LZSS-fashioned pairs of lengths and positions (π_h, λ_h)?