Contact Matrix Compressor

The study of three-dimensional folding of chromosomes is important to understand genomics processes. This is done through techniques, such as Hi-C, that analyze the spatial organization of chromosomes in a cell. The data coming from the study is a 2-dimensional quantitative maps with genomic coordinate systems. We present a novel approach called Contact Matrix Compressor(CMC) for the efficient compression of Hi-C data. By exploiting the properties of the data, such as diagonally dominant and symmetrical, CMC achieves a much higher compression.
CMC outperforms the existing method Cooler, and also the generic compression methods LZMA as well as BZip2.

DCC2022_presentation_v2.pptx

DCC2022_presentation_v2.pptx (301)

Thumbs Up

Comments

Experiments on general-purpose compressors

Permalink Submitted by Dominik Koeppl on 10 March 2022 - 12:34am

If I understood correctly, then the contact matrix are quadratic matrices whose entries are natural numbers?
In some settings your matrices are symmetric, but that is not always the case?

Can you give more details on how the experiments with LZMA and bzip2 were performed?
Did you apply them on the raw data or on the Cooler format?

Finally, a nifty comment: On Slide 28, I think you only need 7 bits for each row, since you can represent the range [0..127] with just 7 bits.

Re:Experiments on general-purpose compressors

Permalink Submitted by Yeremia Gunawan... on 21 March 2022 - 7:23am

Thanks for the questions!

> If I understood correctly, then the contact matrix are quadratic matrices whose entries are natural numbers?
Correct.
> In some settings your matrices are symmetric, but that is not always the case?
The contact matrix itself is always quadratic, but not for sub-contact matrix and tile matrix.
The contact matrix is split into sub-contact matrices depending on the chromosome pairs (intra/inter-chromosomal).
Due to difference in chromosome length, inter-chromosomal might be non-quadratic.

> Can you give more details on how the experiments with LZMA and bzip2 were performed?
> Did you apply them on the raw data or on the Cooler format?
Here you could find the content/scheme of Cooler: https://cooler.readthedocs.io/en/latest/schema.html
It contains the two-table representation of HiC data.
All of the payloads are extracted and then re-encoded using either LZMA or BZip2.
There are additional data that might not described in the paper/presentation such as `bins/weight` but it is also encoded using LZMA, BZip2 and CMC.

> Finally, a nifty comment: On Slide 28, I think you only need 7 bits for each row, since you can represent the range [0..127] with just 7 bits.
Ah yes, i forget to fix it. Thanks!

CITE

Documents

Presentation Slides

Contact Matrix Compressor

DCC2022_presentation_v2.pptx

Comments

Experiments on general-purpose compressors

Re:Experiments on general-purpose compressors

QUESTIONS?