Sorry, you need to enable JavaScript to visit this website.

A Grammar Compressor for Collections of Reads with Applications to the Construction of the BWT

Citation Author(s):
Diego Diaz-Dominguez, Gonzalo Navarro
Submitted by:
Diego Diaz
Last updated:
1 March 2021 - 12:58am
Document Type:
Presentation Slides
Document Year:
2021
Event:
Presenters:
Diego Diaz
Categories:
Keywords:
 

We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments in real data showed that the space reduction we achieve with our compressor is competitive with LZ-based methods and better than entropy-based approaches. Compared to other popular grammars, in this kind of data, we achieve, on average, 12% extra compression and require less working space and time.

up
0 users have voted:

Comments

Your proposed method seems to be very similar to the preprint [1]
introducing a linear-time construction of the bijective
BWT. You can use this construction algorithm also for computing the
eBWT. The connection is the following: The eBWT requires a set of primitive
strings, but every primitive string has a conjugate that is a Lyndon
word. So you take all these Lyndon conjugates and sort them in
descending order. If you concatenate them together to a single string, and compute
the bijective BWT of this single string, you get the eBWT of your
primitive strings.
I just wonder whether you can take advantage of the algorithm of [1] for your computation?
You probably have to change the definition of the grammar using the
definition of LMS inf-substrings ([1] does not use the dollar signs to
separate the strings). Otherwise, the ideas like simulating the circularity
in Algorithm 2 of your pre-print paper [2] are quite similar to [1].
In that respect, you might get the time complexity in slide 6 down to O(n t_dict),
where t_dict is the time for looking up and storing the rules in a dictionary.

[1]: https://arxiv.org/abs/1911.06985
[2]: https://arxiv.org/abs/2102.03961

Some small comments:

page 6:
- what is $k$ in the time complexity?
- the grayed out numbers to the right of the BWT(T^3) represent SA(T^3), right?

page 12: is $r$ the number of runs of your BWT?
page 13: how are the genomes stored such that you get 12.77 GB per genome? One byte per base pair would only give ~6GB for the entire *diploid* human genome.
page 14: is the RLFM also built like RePair with PFP? (see https://arxiv.org/abs/1803.11245)