# DZip: improved general-purpose lossless compression based on novel neural network modeling

Abstract:

We consider lossless compression based on statistical data modeling followed by prediction-based encoding, where an accurate statistical model for the input data leads to substantial improvements in compression. We propose DZip, a general-purpose compressor for sequential data that exploits the well-known modeling capabilities of neural networks (NNs) for prediction, followed by arithmetic coding. DZip uses a novel hybrid architecture based on adaptive and semi-adaptive training. Unlike most NN based compressors, DZip does not require additional training data and is not restricted to specific data types, only needing the alphabet size of the input data. The proposed compressor outperforms general-purpose compressors such as Gzip (on average 26% reduction) on a variety of real datasets, achieves near-optimal compression on synthetic datasets, and performs close to specialized compressors for large sequence lengths, without any human input. The main limitation of DZip in its current implementation is the encoding/decoding time, which limits its practicality. Nevertheless, the results showcase the potential of developing improved general-purpose compressors based on neural networks and hybrid modeling.

## DZip_DCC_Poster_2020.pdf

Poster (34)

### As far as I understood the

As far as I understood the poster,
your main idea is to optimize the compression by looking on how to compress a sliding window of fixed K characters most efficiently,
so you are not aiming at compressing highly repetitive texts with a high K-th order entropy for small K?
I think that the maximum context length K may affect your running time considerably.
Can you limit the context length K, like as a command-line parameter?

Unfortunately, some parts of the poster were hard to understand:
I have not figured out why there are $\hat{y}_{rk}$ and $\hat{y}_{-r}$.
Where is $\mathcal{L}$ used?
There is no Fig.1 / Fig.2, but Fig.3 and Fig. 4
Where is Equation (1) used?
Are Stage I and Stage II pre-computation stages, and the figure below the actual control flow for compressing a file?
It is hard for me to understand the figure (there are some terms that seem not to be explained like biGRU).

Hi Dominik,

Firstly, I would like to refer you to this article "https://arxiv.org/abs/1911.03572" corresponding to our paper for a detailed explanation of DZip (refer to the Methods section of the article above). The simplest way to understand our method is to think of arithmetic encoding which requires a probability distribution over a discrete set of symbols. Now, our method is essentially supplying the probability distribution for encoding each character in the sequence using a Neural Network (NN). To train our Neural Network, we minimize the loss ($\mathcal{L}$) (Eq. 1) between predicted probability distribution (\hat{y}_{rk}) and the actual symbol (y_{rk}). To predict the probability distribution, we look at the past K symbols (similar to a markov model) where K is referred to as the context length. Stage I denotes the training prior to encoding, and stage II denotes the encoding procedure. The figure below (at the bottom in left column) illustrates the combined model used in stage II with more details about individual layers used in our NN model. biGRU (bidirectional Gated Recurrent Unit)[1] is one of the layers in our proposed NN architecture.

1. K. Cho et al., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in 2014 (EMNLP), Doha, Qatar, Oct. 2014, pp. 1724–1734

Apologies for indexing the figures/table inappropriately. Fig 3/4 should have been Tables 1/2 respectively. We do not refer to the figures in our writeup inside the poster, which might lead to confusion. The first two figures below Eq. 1 represent our method and Tables 1 and 2 (mistakenly labelled as Fig 3 and 4) describe our quantitative comparison with benchmark compressors.

### Paper Details

Authors:
Mohit Goyal, Kedar Tatwawadi, Idoia Ochoa
Submitted On:
18 March 2020 - 5:07pm
Type:
Poster
Event:
Presenter's Name:
Shubham Chandak
Paper Code:
143
Session:
Posters
Document Year:
2020

(34)

### Subscribe

[1] Mohit Goyal, Kedar Tatwawadi, Idoia Ochoa, "DZip: improved general-purpose lossless compression based on novel neural network modeling", IEEE SigPort, 2020. [Online]. Available: http://sigport.org/4998. Accessed: Jun. 02, 2020.
@article{4998-20,
url = {http://sigport.org/4998},
author = {Mohit Goyal; Kedar Tatwawadi; Idoia Ochoa },
publisher = {IEEE SigPort},
title = {DZip: improved general-purpose lossless compression based on novel neural network modeling},
year = {2020} }
TY - EJOUR
T1 - DZip: improved general-purpose lossless compression based on novel neural network modeling
AU - Mohit Goyal; Kedar Tatwawadi; Idoia Ochoa
PY - 2020
PB - IEEE SigPort
UR - http://sigport.org/4998
ER -
Mohit Goyal, Kedar Tatwawadi, Idoia Ochoa. (2020). DZip: improved general-purpose lossless compression based on novel neural network modeling. IEEE SigPort. http://sigport.org/4998
Mohit Goyal, Kedar Tatwawadi, Idoia Ochoa, 2020. DZip: improved general-purpose lossless compression based on novel neural network modeling. Available at: http://sigport.org/4998.
Mohit Goyal, Kedar Tatwawadi, Idoia Ochoa. (2020). "DZip: improved general-purpose lossless compression based on novel neural network modeling." Web.
1. Mohit Goyal, Kedar Tatwawadi, Idoia Ochoa. DZip: improved general-purpose lossless compression based on novel neural network modeling [Internet]. IEEE SigPort; 2020. Available from : http://sigport.org/4998