Documents
Presentation Slides
Presentation Slides
Contact Matrix Compressor
- Citation Author(s):
- Submitted by:
- Yeremia Gunawan...
- Last updated:
- 21 March 2022 - 7:34am
- Document Type:
- Presentation Slides
- Document Year:
- 2022
- Event:
- Presenters:
- Yeremia Gunawan Adhisantoso
- Paper Code:
- #139
- Categories:
- Keywords:
- Log in to post comments
The study of three-dimensional folding of chromosomes is important to understand genomics processes. This is done through techniques, such as Hi-C, that analyze the spatial organization of chromosomes in a cell. The data coming from the study is a 2-dimensional quantitative maps with genomic coordinate systems. We present a novel approach called Contact Matrix Compressor(CMC) for the efficient compression of Hi-C data. By exploiting the properties of the data, such as diagonally dominant and symmetrical, CMC achieves a much higher compression.
CMC outperforms the existing method Cooler, and also the generic compression methods LZMA as well as BZip2.
Comments
Experiments on general-purpose compressors
If I understood correctly, then the contact matrix are quadratic matrices whose entries are natural numbers?
In some settings your matrices are symmetric, but that is not always the case?
Can you give more details on how the experiments with LZMA and bzip2 were performed?
Did you apply them on the raw data or on the Cooler format?
Finally, a nifty comment: On Slide 28, I think you only need 7 bits for each row, since you can represent the range [0..127] with just 7 bits.
Re:Experiments on general-purpose compressors
Thanks for the questions!
> If I understood correctly, then the contact matrix are quadratic matrices whose entries are natural numbers?
Correct.
> In some settings your matrices are symmetric, but that is not always the case?
The contact matrix itself is always quadratic, but not for sub-contact matrix and tile matrix.
The contact matrix is split into sub-contact matrices depending on the chromosome pairs (intra/inter-chromosomal).
Due to difference in chromosome length, inter-chromosomal might be non-quadratic.
> Can you give more details on how the experiments with LZMA and bzip2 were performed?
> Did you apply them on the raw data or on the Cooler format?
Here you could find the content/scheme of Cooler: https://cooler.readthedocs.io/en/latest/schema.html
It contains the two-table representation of HiC data.
All of the payloads are extracted and then re-encoded using either LZMA or BZip2.
There are additional data that might not described in the paper/presentation such as `bins/weight` but it is also encoded using LZMA, BZip2 and CMC.
> Finally, a nifty comment: On Slide 28, I think you only need 7 bits for each row, since you can represent the range [0..127] with just 7 bits.
Ah yes, i forget to fix it. Thanks!