Documents
Poster
Poster
Grammar compression with probabilistic context-free grammar
- Citation Author(s):
- Submitted by:
- Diptarama Hendrian
- Last updated:
- 31 March 2020 - 2:53am
- Document Type:
- Poster
- Document Year:
- 2020
- Event:
- Presenters:
- Hiroaki Naganuma
- Categories:
- Keywords:
- Log in to post comments
We propose a new approach for universal lossless text compression, based on grammar compression. In the literature, a target string T has been compressed as a context-free grammar G in Chomsky normal form satisfying L(G) = T. Such a grammar is often called a straight-line program (SLP). In this work, we consider a probabilistic grammar G that generates T, but not necessarily as a unique element of L(G). In order to recover the original text T unambiguously, we keep both the grammar G and the derivation tree of T from the start symbol in G, in compressed form. We show some simple evidence that our proposal is indeed more efficient than SLPs for certain texts, both from theoretical and practical points of view.
dcc_poster.pdf
dcc_poster.pdf (491)
Comments
While reading your poster, I
While reading your poster, I've collected some small questions:
About the idea for improvement:
- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)
- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?
For the experimental results
- is "prototype" your approach based on Re-Pair or on another grammar?
- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?
- what it G_k ?
About the idea for
About the idea for improvement:
- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)
> In the prototype algorithm we see the right and left context for each bigram to find different bigrams with the same context.
- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?
> Yes.
For the experimental results
- is "prototype" your approach based on Re-Pair or on another grammar?
> Yes, our prototype algorithm is based on Re-Pair.
- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?
> We use this https://code.google.com/archive/p/re-pair/ implementation for Re-Pair.
- what it G_k ?
> Ideal grammar for type-k noise.
See https://arxiv.org/abs/2003.08097 for the details