A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters

Sardar JAf
25 November 2016 - 3:44am
Sardar Jaf
In this study, we outline a potential problem
in normalising texts that are based on a modified version
of the Arabic alphabet. One of the main resources
available for processing resource-scarce languages is
raw text collected from the Internet. Many less-
resourced languages, such as Kurdish, Farsi, Urdu,
Pashtu, etc., use a modified version of the Arabic writing
system. Many characters in harvested data from the
Internet may have exactly the same form but encoded
with different Unicode values (ambiguous characters).
The existence of ambiguous characters in words leads to
word duplication, thus it is important to identify and
unify ambiguous characters during the normalisation
stage. Here, we demonstrate cases related to ambiguous
Kurdish and Farsi characters and propose a semi-
automatic approach to identify and unify them.

