Computing Matching Statistics on Repetitive Texts

Comments

Omitting pattern positions

Permalink Submitted by Dominik Koeppl on 22 March 2022 - 7:38am

If I understood correctly, you adapted the LZ78-index for matching statistics.
The $m^2$ term in your time bounds (at least for the first solution) is for querying the labeled binary relation for every pattern position, and traversing both Patricia tries for each match.
I wonder whether the number of those pattern positions can be reduced if we would exchange the LZ78-index with a grammar index using Lyndon words or locality sensitive parsing such as:

Kazuya Tsuruta, Dominik Köppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda:
Grammar-compressed Self-index with Lyndon Words. CoRR abs/2004.05309 (2020)

Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza:
Optimal-Time Dictionary-Compressed Indexes. ACM Trans. Algorithms 17(1): 8:1-8:39 (2021)

RE: Omitting pattern positions

Permalink Submitted by Younan Gao on 22 March 2022 - 4:07pm

Hi,

Thanks for the question. Indeed, a patter of length of m could cross at most (m-1) phrase boundaries. For each of them, we check at most m positions; as a result, we have the term O(m^2).

Except the first solution, the other solutions require O(gamma log n) words of space, where gamma is the size of smallest string attractor.
It would be exiting to find a data structure of O(g) or O(gamma log (n / gamma)) words that supports the computation in O(m polylog n) time, although I don’t have a specific idea of how to do that at this moment. But both papers you shared could be good directions. Especially, I think it might be worth a try to apply the grammar-compressed index.

Thanks!

CITE

Documents

Presentation Slides

Computing Matching Statistics on Repetitive Texts

dcc-slides-0305.pdf

Comments

Omitting pattern positions

RE: Omitting pattern positions

QUESTIONS?