Sorry, you need to enable JavaScript to visit this website.

A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters

Citation Author(s):
Submitted by:
Sardar JAf
Last updated:
25 November 2016 - 3:44am
Document Type:
Document Year:
Presenters Name:
Sardar Jaf
Paper Code:



In this study, we outline a potential problem
in normalising texts that are based on a modified version
of the Arabic alphabet. One of the main resources
available for processing resource-scarce languages is
raw text collected from the Internet. Many less-
resourced languages, such as Kurdish, Farsi, Urdu,
Pashtu, etc., use a modified version of the Arabic writing
system. Many characters in harvested data from the
Internet may have exactly the same form but encoded
with different Unicode values (ambiguous characters).
The existence of ambiguous characters in words leads to
word duplication, thus it is important to identify and
unify ambiguous characters during the normalisation
stage. Here, we demonstrate cases related to ambiguous
Kurdish and Farsi characters and propose a semi-
automatic approach to identify and unify them.

0 users have voted:

Dataset Files