Documents
Poster
MPEG-G Reference-Based Compression of Unaligned Reads Through Ultra-Fast Alignments
- Citation Author(s):
- Submitted by:
- Unsal Ozturk
- Last updated:
- 4 March 2022 - 4:14pm
- Document Type:
- Poster
- Document Year:
- 2022
- Event:
- Presenters:
- Unsal Ozturk
- Paper Code:
- 186
- Categories:
- Keywords:
- Log in to post comments
With the widespread application of next generation sequencing technologies, the volume of sequencing data became comparable to that of big data domains. The compression of sequencing reads (nucleotide sequences, quality values, read names), in both raw and aligned data, is a way to alleviate bandwidth, transfer, and storage requirements of genomics pipelines. ISO/IEC MPEG-G standardizes the compressed representation (i.e. storage and streaming) of structured, indexed sets of genomic sequencing data for both raw and aligned data. For the latter, reference-based compression is a strategy used to compress nucleotide sequences of sequencing reads by using alignment information to a reference sequence, which can be used to represent nucleotide sequences by storing the starting position of the alignment on the reference sequence, and the differences between the reference and the actual read. This general scheme is implemented in different ways by genomic data compressors, such as DeeZ, Quip, and CRAM, which apply to aligned reads.
This work presents a preprocessing stage for the reference-based compression of unaligned reads using the MPEG-G standard. Reads are firstly aligned to a reference sequence with a fast-aligner, with the aim of finding "good-enough" alignments without regard to biological constraints. The output of the alignment step is then transcoded into MPEG-G records defined in ISO/IEC 23092-2, and any extraneous information about the alignment not useful to reference-based compression is discarded. Finally, the MPEG-G records are encoded via a proprietary MPEG-G encoder. The experiments were performed with two public sequencing read sets, ERR174310.chr9 and G15511.HCC1143_BL.1.chr9, and hs37d5 was used as the reference sequence. For the fast-alignment step, GEM2, BWA-MEM and minimap were tested and tuned to run as fast as possible. GEM2 outperformed the other aligners in terms of speed, and hence was used for the rest of the experiments. A software package called g2tc was implemented to transcode GEM2 output to MPEG-G records, which were then compressed by the MPEG-G encoder. Fast-alignment outperformed unaligned MPEG-G compression rates by 5%, CRAM by 5%, and gzip by 15-20%, when the whole pipeline was allowed to take as much time as gzip compression at compression level 6. Further improvements in compression speed are expected by running many GEM2 instances on the data by loading the GEM index many times in memory, parallel transcoding, and program communication in memory rather than through the disk.
Comments
Comparison with RLZ?
Your approach reminds me of relative Lempel-Ziv (RLZ, https://dx.doi.org/10.1007%2F978-3-642-16321-0_20).
It would be interesting to see whether we can draw a connection between your approach and RLZ. As far as I understood your approach, you could use RLZ instead of an aligner.