A Grammar Compressor for Collections of Reads with Applications to the Construction of the BWT

Error message

The specified file temporary://fileDZhLn3 could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
The specified file temporary://filezCUzxN could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
The specified file temporary://fileqXTeqr could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
The specified file temporary://fileeV2oxA could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
The specified file temporary://filemMqsej could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
The specified file temporary://filenH4DeP could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
The specified file temporary://fileuH2CiC could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
The specified file temporary://fileTz0hJW could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.

We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments in real data showed that the space reduction we achieve with our compressor is competitive with LZ-based methods and better than entropy-based approaches. Compared to other popular grammars, in this kind of data, we achieve, on average, 12% extra compression and require less working space and time.

DCC21_ddiaz_gnav_slides.pdf

slides (324)

Thumbs Up

Comments

Possible Speed-Up?

Permalink Submitted by Dominik Koeppl on 14 March 2021 - 6:39am

Your proposed method seems to be very similar to the preprint [1]
introducing a linear-time construction of the bijective
BWT. You can use this construction algorithm also for computing the
eBWT. The connection is the following: The eBWT requires a set of primitive
strings, but every primitive string has a conjugate that is a Lyndon
word. So you take all these Lyndon conjugates and sort them in
descending order. If you concatenate them together to a single string, and compute
the bijective BWT of this single string, you get the eBWT of your
primitive strings.
I just wonder whether you can take advantage of the algorithm of [1] for your computation?
You probably have to change the definition of the grammar using the
definition of LMS inf-substrings ([1] does not use the dollar signs to
separate the strings). Otherwise, the ideas like simulating the circularity
in Algorithm 2 of your pre-print paper [2] are quite similar to [1].
In that respect, you might get the time complexity in slide 6 down to O(n t_dict),
where t_dict is the time for looking up and storing the rules in a dictionary.

[1]: https://arxiv.org/abs/1911.06985
[2]: https://arxiv.org/abs/2102.03961

Some small comments:

page 6:
- what is $k$ in the time complexity?
- the grayed out numbers to the right of the BWT(T^3) represent SA(T^3), right?

page 12: is $r$ the number of runs of your BWT?
page 13: how are the genomes stored such that you get 12.77 GB per genome? One byte per base pair would only give ~6GB for the entire *diploid* human genome.
page 14: is the RLFM also built like RePair with PFP? (see https://arxiv.org/abs/1803.11245)

CITE

Documents

Presentation Slides

A Grammar Compressor for Collections of Reads with Applications to the Construction of the BWT

Error message

DCC21_ddiaz_gnav_slides.pdf

Comments

Possible Speed-Up?

QUESTIONS?