Grammar compression with probabilistic context-free grammar

We propose a new approach for universal lossless text compression, based on grammar compression. In the literature, a target string T has been compressed as a context-free grammar G in Chomsky normal form satisfying L(G) = T. Such a grammar is often called a straight-line program (SLP). In this work, we consider a probabilistic grammar G that generates T, but not necessarily as a unique element of L(G). In order to recover the original text T unambiguously, we keep both the grammar G and the derivation tree of T from the start symbol in G, in compressed form. We show some simple evidence that our proposal is indeed more efficient than SLPs for certain texts, both from theoretical and practical points of view.

dcc_poster.pdf

dcc_poster.pdf (643)

Thumbs Up

Comments

While reading your poster, I

Permalink Submitted by Dominik Koeppl on 12 April 2020 - 1:28am

While reading your poster, I've collected some small questions:

About the idea for improvement:
- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)
- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?

For the experimental results
- is "prototype" your approach based on Re-Pair or on another grammar?
- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?
- what it G_k ?

About the idea for

Permalink Submitted by Diptarama Hendrian on 21 April 2020 - 4:17am

About the idea for improvement:
- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)
> In the prototype algorithm we see the right and left context for each bigram to find different bigrams with the same context.

- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?
> Yes.

For the experimental results
- is "prototype" your approach based on Re-Pair or on another grammar?
> Yes, our prototype algorithm is based on Re-Pair.

- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?
> We use this https://code.google.com/archive/p/re-pair/ implementation for Re-Pair.

- what it G_k ?
> Ideal grammar for type-k noise.

See https://arxiv.org/abs/2003.08097 for the details

CITE

Documents

Poster

Grammar compression with probabilistic context-free grammar

dcc_poster.pdf

Comments

While reading your poster, I

About the idea for

QUESTIONS?