Documents
Presentation Slides
Measuring the Similarity of Files by Data Compression
- Citation Author(s):
- Submitted by:
- Hubert Schoelnast
- Last updated:
- 28 February 2023 - 5:20pm
- Document Type:
- Presentation Slides
- Document Year:
- 2023
- Event:
- Presenters:
- Hubert Schölnast
- Paper Code:
- 199
- Categories:
- Keywords:
- Log in to post comments
Lossless data compression algorithms were developed to shrink files. But these algorithms can also be used to measure file similarity. In this article, the meta-algorithms Concat Compress and Cross Compress are subjected to an extensive practical test together with the compression algorithms Re-Pair, gzip and bz2: Five labeled datasets are subjected to a classification procedure using these algorithms. Theoretical considerations about the two meta-algorithms were already made about 10 years ago, but little has happened since then. The practical implementation of these methods is still in its infancy. The results now presented are promising and show the great potential of this approach. However, it also becomes clear that there are still many open research questions in this area.