A Benchmark of Entropy Coders for the Compression of Genome Sequencing Data

Genomic sequencing data contain three different data fields: read names, quality values, and nucleotide sequences. In this work, a variety of entropy encoders and compression algorithms were benchmarked in terms of compression-decompression rates and times separately for each data field as raw data from FASTQ files (implemented in the Fastq analysis script) and in MPEG-G uncompressed descriptor symbols decoded from MPEG-G bitstreams (implemented in the symbols analysis script). The result of this benchmark is then compared to the performance of CABAC, which is the encoder used in first edition of the ISO/IEC MPEG-G standard for all types of descriptors, since CABAC was the best performing in terms of compression rates for the three types of data, thus providing overall better compression rates compared to other entropy coders in total. However, in some use cases encoding and decoding speed might be of higher interest than compression, and for specific datasets, types of data, or descriptor streams, other entropy coders might provide higher speed and/or better compression performance than CABAC.

rANS, bsc, bzip2, lzma, LZHAM, zlib, brotli, LZ4, and zSTD were benchmarked separately on each field and uncompressed descriptor stream, with both aligned and unaligned sequencing reads derived from the publicly available ERR174310 and G15511.HCC1143BL.1 sequencing datasets. The full experimental results are hosted on a GitHub repository (https://github.com/epfl-scistimm/2022-DCC). While it is not directly possible to compare the performances of these codecs, as the performance metrics form a partial order, we conclude that replacing CABAC with other compressors for the compression of different data fields in different use cases could be beneficial to the overall encoding-decoding performance. Results suggest that for archival purposes, bsc is the most suitable as it provides high compression rates at the cost of lower compression-decompression speeds. In general, zSTD (and to a similar extent brotli) was closer to the Pareto frontier (of speed vs. compression rates) than other codecs (depending on the encoding parameters), hence could be used as a general-purpose compressor for read names and quality values. For use cases where compression-decompression speed or throughput matters more than compression rates, LZ4 could be used, as it provided by far the fastest compression; however, with lower compression rates.

dcc_lcc_presentation.pdf

dcc_lcc_presentation.pdf (300)

Thumbs Up

Comments

FASTQ specialized compressors

Permalink Submitted by Dominik Koeppl on 21 March 2022 - 6:49am

Thanks for this elaborated benchmark on compression methods for FASTQ.
Your benchmark is made with several general-purpose compressors, but
may I ask you whether you had also a look on compressors specialized on FASTQ data,
such as LW-FQZip, MZPAQ, genozip (https://github.com/divonlan/genozip), etc.?

CITE

Documents

Poster

A Benchmark of Entropy Coders for the Compression of Genome Sequencing Data

dcc_lcc_presentation.pdf

Comments

FASTQ specialized compressors

QUESTIONS?