Computing Matching Statistics on Repetitive Texts
- Citation Author(s):
- Submitted by:
- Younan Gao
- Last updated:
- 5 March 2022 - 2:51pm
- Document Type:
- Presentation Slides
- Document Year:
- Younan Gao
- Paper Code:
Omitting pattern positions
If I understood correctly, you adapted the LZ78-index for matching statistics.
The $m^2$ term in your time bounds (at least for the first solution) is for querying the labeled binary relation for every pattern position, and traversing both Patricia tries for each match.
I wonder whether the number of those pattern positions can be reduced if we would exchange the LZ78-index with a grammar index using Lyndon words or locality sensitive parsing such as:
Kazuya Tsuruta, Dominik Köppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda:
Grammar-compressed Self-index with Lyndon Words. CoRR abs/2004.05309 (2020)
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka￼, Gonzalo Navarro, Nicola Prezza￼:
Optimal-Time Dictionary-Compressed Indexes. ACM Trans. Algorithms 17(1): 8:1-8:39 (2021)
RE: Omitting pattern positions
Thanks for the question. Indeed, a patter of length of m could cross at most (m-1) phrase boundaries. For each of them, we check at most m positions; as a result, we have the term O(m^2).
Except the first solution, the other solutions require O(gamma log n) words of space, where gamma is the size of smallest string attractor.
It would be exiting to find a data structure of O(g) or O(gamma log (n / gamma)) words that supports the computation in O(m polylog n) time, although I don’t have a specific idea of how to do that at this moment. But both papers you shared could be good directions. Especially, I think it might be worth a try to apply the grammar-compressed index.