Sorry, you need to enable JavaScript to visit this website.

Grammar compression with probabilistic context-free grammar

Citation Author(s):
Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi
Submitted by:
Diptarama Hendrian
Last updated:
31 March 2020 - 2:53am
Document Type:
Poster
Document Year:
2020
Event:
Presenters:
Hiroaki Naganuma
 

We propose a new approach for universal lossless text compression, based on grammar compression. In the literature, a target string T has been compressed as a context-free grammar G in Chomsky normal form satisfying L(G) = T. Such a grammar is often called a straight-line program (SLP). In this work, we consider a probabilistic grammar G that generates T, but not necessarily as a unique element of L(G). In order to recover the original text T unambiguously, we keep both the grammar G and the derivation tree of T from the start symbol in G, in compressed form. We show some simple evidence that our proposal is indeed more efficient than SLPs for certain texts, both from theoretical and practical points of view.

up
0 users have voted:

Comments

While reading your poster, I've collected some small questions:

About the idea for improvement:
- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)
- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?

For the experimental results
- is "prototype" your approach based on Re-Pair or on another grammar?
- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?
- what it G_k ?

About the idea for improvement:
- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)
> In the prototype algorithm we see the right and left context for each bigram to find different bigrams with the same context.

- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?
> Yes.

For the experimental results
- is "prototype" your approach based on Re-Pair or on another grammar?
> Yes, our prototype algorithm is based on Re-Pair.

- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?
> We use this https://code.google.com/archive/p/re-pair/ implementation for Re-Pair.

- what it G_k ?
> Ideal grammar for type-k noise.

See https://arxiv.org/abs/2003.08097 for the details