Thank you for your presentation!
May I ask you whether you can you shed more light on the details of your algorithm?
If it is just a parallelization, then the output of the original Re-Pair
algorithm should coincide with your parallel version. Yet, you report
that your compression ratio is slightly worse. My guess is that you
keep a global frequency table, and every CPU processes its own part
of the text (called "Block" in your slides) without synchronization
barriers such that the not most-frequent bigrams.
Also, what is the Re-Pair software you compare with? (There are multiple
available, all with different trade-offs.)
Copy and paste the HTML code below to embed your dataset:
Click the buttons below: