Sorry, you need to enable JavaScript to visit this website.

Contact Matrix Compressor

Citation Author(s):
Jörn Ostermann
Submitted by:
Yeremia Gunawan...
Last updated:
21 March 2022 - 7:34am
Document Type:
Presentation Slides
Document Year:
2022
Event:
Presenters:
Yeremia Gunawan Adhisantoso
Paper Code:
#139
Categories:
Keywords:
 

The study of three-dimensional folding of chromosomes is important to understand genomics processes. This is done through techniques, such as Hi-C, that analyze the spatial organization of chromosomes in a cell. The data coming from the study is a 2-dimensional quantitative maps with genomic coordinate systems. We present a novel approach called Contact Matrix Compressor(CMC) for the efficient compression of Hi-C data. By exploiting the properties of the data, such as diagonally dominant and symmetrical, CMC achieves a much higher compression.
CMC outperforms the existing method Cooler, and also the generic compression methods LZMA as well as BZip2.

up
0 users have voted:

Comments

If I understood correctly, then the contact matrix are quadratic matrices whose entries are natural numbers?
In some settings your matrices are symmetric, but that is not always the case?

Can you give more details on how the experiments with LZMA and bzip2 were performed?
Did you apply them on the raw data or on the Cooler format?

Finally, a nifty comment: On Slide 28, I think you only need 7 bits for each row, since you can represent the range [0..127] with just 7 bits.

Thanks for the questions!

> If I understood correctly, then the contact matrix are quadratic matrices whose entries are natural numbers?
Correct.
> In some settings your matrices are symmetric, but that is not always the case?
The contact matrix itself is always quadratic, but not for sub-contact matrix and tile matrix.
The contact matrix is split into sub-contact matrices depending on the chromosome pairs (intra/inter-chromosomal).
Due to difference in chromosome length, inter-chromosomal might be non-quadratic.

> Can you give more details on how the experiments with LZMA and bzip2 were performed?
> Did you apply them on the raw data or on the Cooler format?
Here you could find the content/scheme of Cooler: https://cooler.readthedocs.io/en/latest/schema.html
It contains the two-table representation of HiC data.
All of the payloads are extracted and then re-encoded using either LZMA or BZip2.
There are additional data that might not described in the paper/presentation such as `bins/weight` but it is also encoded using LZMA, BZip2 and CMC.

> Finally, a nifty comment: On Slide 28, I think you only need 7 bits for each row, since you can represent the range [0..127] with just 7 bits.
Ah yes, i forget to fix it. Thanks!