Documents
Poster
A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters
- Citation Author(s):
- Submitted by:
- Sardar JAf
- Last updated:
- 25 November 2016 - 3:44am
- Document Type:
- Poster
- Document Year:
- 2016
- Event:
- Presenters:
- Sardar Jaf
- Paper Code:
- 56
- Categories:
- Log in to post comments
In this study, we outline a potential problem
in normalising texts that are based on a modified version
of the Arabic alphabet. One of the main resources
available for processing resource-scarce languages is
raw text collected from the Internet. Many less-
resourced languages, such as Kurdish, Farsi, Urdu,
Pashtu, etc., use a modified version of the Arabic writing
system. Many characters in harvested data from the
Internet may have exactly the same form but encoded
with different Unicode values (ambiguous characters).
The existence of ambiguous characters in words leads to
word duplication, thus it is important to identify and
unify ambiguous characters during the normalisation
stage. Here, we demonstrate cases related to ambiguous
Kurdish and Farsi characters and propose a semi-
automatic approach to identify and unify them.