Sorry, you need to enable JavaScript to visit this website.

Measuring the Similarity of Files by Data Compression

Citation Author(s):
Submitted by:
Hubert Schoelnast
Last updated:
28 February 2023 - 5:20pm
Document Type:
Presentation Slides
Document Year:
Hubert Schölnast
Paper Code:

Lossless data compression algorithms were developed to shrink files. But these algorithms can also be used to measure file similarity. In this article, the meta-algorithms Concat Compress and Cross Compress are subjected to an extensive practical test together with the compression algorithms Re-Pair, gzip and bz2: Five labeled datasets are subjected to a classification procedure using these algorithms. Theoretical considerations about the two meta-algorithms were already made about 10 years ago, but little has happened since then. The practical implementation of these methods is still in its infancy. The results now presented are promising and show the great potential of this approach. However, it also becomes clear that there are still many open research questions in this area.

0 users have voted: