# Computing Matching Statistics on Repetitive Texts

Citation Author(s):
Younan Gao
Submitted by:
Younan Gao
Last updated:
5 March 2022 - 2:51pm
Document Type:
Presentation Slides
Document Year:
2022
Event:
Presenters:
Younan Gao
Paper Code:
123
Categories:
Keywords:

## dcc-slides-0305.pdf

### Omitting pattern positions

If I understood correctly, you adapted the LZ78-index for matching statistics.
The \$m^2\$ term in your time bounds (at least for the first solution) is for querying the labeled binary relation for every pattern position, and traversing both Patricia tries for each match.
I wonder whether the number of those pattern positions can be reduced if we would exchange the LZ78-index with a grammar index using Lyndon words or locality sensitive parsing such as:

Kazuya Tsuruta, Dominik Köppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda:
Grammar-compressed Self-index with Lyndon Words. CoRR abs/2004.05309 (2020)

Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka￼, Gonzalo Navarro, Nicola Prezza￼:
Optimal-Time Dictionary-Compressed Indexes. ACM Trans. Algorithms 17(1): 8:1-8:39 (2021)

### RE: Omitting pattern positions

Hi,

Thanks for the question. Indeed, a patter of length of m could cross at most (m-1) phrase boundaries. For each of them, we check at most m positions; as a result, we have the term O(m^2).

Except the first solution, the other solutions require O(gamma log n) words of space, where gamma is the size of smallest string attractor.
It would be exiting to find a data structure of O(g) or O(gamma log (n / gamma)) words that supports the computation in O(m polylog n) time, although I don’t have a specific idea of how to do that at this moment. But both papers you shared could be good directions. Especially, I think it might be worth a try to apply the grammar-compressed index.

Thanks!