Sorry, you need to enable JavaScript to visit this website.

facebooktwittermailshare

Grammar compression with probabilistic context-free grammar

Abstract: 

We propose a new approach for universal lossless text compression, based on grammar compression. In the literature, a target string T has been compressed as a context-free grammar G in Chomsky normal form satisfying L(G) = T. Such a grammar is often called a straight-line program (SLP). In this work, we consider a probabilistic grammar G that generates T, but not necessarily as a unique element of L(G). In order to recover the original text T unambiguously, we keep both the grammar G and the derivation tree of T from the start symbol in G, in compressed form. We show some simple evidence that our proposal is indeed more efficient than SLPs for certain texts, both from theoretical and practical points of view.

up
0 users have voted:

Comments

While reading your poster, I've collected some small questions:

About the idea for improvement:
- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)
- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?

For the experimental results
- is "prototype" your approach based on Re-Pair or on another grammar?
- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?
- what it G_k ?

About the idea for improvement:
- how do you find the noise? (I think this step is also critical for the needed time to compute your proposed PCFG)
> In the prototype algorithm we see the right and left context for each bigram to find different bigrams with the same context.

- With 'derivation sequence' you mean that you augment the derivation tree representing a deviation rule with a flag that stores which deviation is used. These flags are written out as a derivation sequence, which then is compressed by arithmetic encoding?
> Yes.

For the experimental results
- is "prototype" your approach based on Re-Pair or on another grammar?
> Yes, our prototype algorithm is based on Re-Pair.

- you compare the compression ratio without giving details of the used encoding of the grammars. Did you encode the Re-Pair grammar output with an artihmetic coder?
> We use this https://code.google.com/archive/p/re-pair/ implementation for Re-Pair.

- what it G_k ?
> Ideal grammar for type-k noise.

See https://arxiv.org/abs/2003.08097 for the details

Paper Details

Authors:
Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi
Submitted On:
31 March 2020 - 2:53am
Short Link:
Type:
Poster
Event:
Presenter's Name:
Hiroaki Naganuma
Session:
Posters
Document Year:
2020
Cite

Document Files

dcc_poster.pdf

(56)

Keywords

Additional Categories

Subscribe

[1] Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi, "Grammar compression with probabilistic context-free grammar", IEEE SigPort, 2020. [Online]. Available: http://sigport.org/5074. Accessed: Aug. 13, 2020.
@article{5074-20,
url = {http://sigport.org/5074},
author = {Hiroaki Naganuma; Diptarama Hendrian; Ryo Yoshinaka; Ayumi Shinohara; Naoki Kobayashi },
publisher = {IEEE SigPort},
title = {Grammar compression with probabilistic context-free grammar},
year = {2020} }
TY - EJOUR
T1 - Grammar compression with probabilistic context-free grammar
AU - Hiroaki Naganuma; Diptarama Hendrian; Ryo Yoshinaka; Ayumi Shinohara; Naoki Kobayashi
PY - 2020
PB - IEEE SigPort
UR - http://sigport.org/5074
ER -
Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi. (2020). Grammar compression with probabilistic context-free grammar. IEEE SigPort. http://sigport.org/5074
Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi, 2020. Grammar compression with probabilistic context-free grammar. Available at: http://sigport.org/5074.
Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi. (2020). "Grammar compression with probabilistic context-free grammar." Web.
1. Hiroaki Naganuma, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara, Naoki Kobayashi. Grammar compression with probabilistic context-free grammar [Internet]. IEEE SigPort; 2020. Available from : http://sigport.org/5074