Abstract:

We propose a new approach for universal lossless text compression, based on grammar compression. In the literature, a target string T has been compressed as a context-free grammar G in Chomsky normal form satisfying L(G) = T. Such a grammar is often called a straight-line program (SLP). In this work, we consider a probabilistic grammar G that generates T, but not necessarily as a unique element of L(G). In order to recover the original text T unambiguously, we keep both the grammar G and the derivation tree of T from the start symbol in G, in compressed form. We show some simple evidence that our proposal is indeed more efficient than SLPs for certain texts, both from theoretical and practical points of view.

### Paper Details

- Authors:
- Submitted On:
- 31 March 2020 - 2:53am
- Short Link:
- Type:
- Poster
- Event:
- Presenter's Name:
- Hiroaki Naganuma
- Session:
- Posters
- Document Year:
- 2020
- Cite

### Subscribe

[1] Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi,
"Grammar compression with probabilistic context-free grammar",
IEEE SigPort,
2020. [Online]. Available: http://sigport.org/5074. Accessed: Aug. 13, 2020.

@article{5074-20,

url = {http://sigport.org/5074},

author = {Hiroaki Naganuma; Diptarama Hendrian; Ryo Yoshinaka; Ayumi Shinohara; Naoki Kobayashi },

publisher = {IEEE SigPort},

title = {Grammar compression with probabilistic context-free grammar},

year = {2020} }

url = {http://sigport.org/5074},

author = {Hiroaki Naganuma; Diptarama Hendrian; Ryo Yoshinaka; Ayumi Shinohara; Naoki Kobayashi },

publisher = {IEEE SigPort},

title = {Grammar compression with probabilistic context-free grammar},

year = {2020} }

TY - EJOUR

T1 - Grammar compression with probabilistic context-free grammar

AU - Hiroaki Naganuma; Diptarama Hendrian; Ryo Yoshinaka; Ayumi Shinohara; Naoki Kobayashi

PY - 2020

PB - IEEE SigPort

UR - http://sigport.org/5074

ER -

T1 - Grammar compression with probabilistic context-free grammar

AU - Hiroaki Naganuma; Diptarama Hendrian; Ryo Yoshinaka; Ayumi Shinohara; Naoki Kobayashi

PY - 2020

PB - IEEE SigPort

UR - http://sigport.org/5074

ER -

Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi.
(2020).
Grammar compression with probabilistic context-free grammar.
IEEE SigPort.
http://sigport.org/5074

Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi,
2020.
Grammar compression with probabilistic context-free grammar.
Available at:
http://sigport.org/5074.

Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi.
(2020).
"Grammar compression with probabilistic context-free grammar."
Web.

1. Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi.
Grammar compression with probabilistic context-free grammar [Internet].
IEEE SigPort; 2020.
Available from :
http://sigport.org/5074

## Comments

## While reading your poster, I

While reading your poster, I've collected some small questions:

About the idea for improvement:

- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)

- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?

For the experimental results

- is "prototype" your approach based on Re-Pair or on another grammar?

- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?

- what it G_k ?

## About the idea for

About the idea for improvement:

- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)

> In the prototype algorithm we see the right and left context for each bigram to find different bigrams with the same context.

- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?

> Yes.

For the experimental results

- is "prototype" your approach based on Re-Pair or on another grammar?

> Yes, our prototype algorithm is based on Re-Pair.

- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?

> We use this https://code.google.com/archive/p/re-pair/ implementation for Re-Pair.

- what it G_k ?

> Ideal grammar for type-k noise.

See https://arxiv.org/abs/2003.08097 for the details