Sorry, you need to enable JavaScript to visit this website.

facebooktwittermailshare

A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters

Abstract: 

In this study, we outline a potential problem
in normalising texts that are based on a modified version
of the Arabic alphabet. One of the main resources
available for processing resource-scarce languages is
raw text collected from the Internet. Many less-
resourced languages, such as Kurdish, Farsi, Urdu,
Pashtu, etc., use a modified version of the Arabic writing
system. Many characters in harvested data from the
Internet may have exactly the same form but encoded
with different Unicode values (ambiguous characters).
The existence of ambiguous characters in words leads to
word duplication, thus it is important to identify and
unify ambiguous characters during the normalisation
stage. Here, we demonstrate cases related to ambiguous
Kurdish and Farsi characters and propose a semi-
automatic approach to identify and unify them.

up
0 users have voted:

Paper Details

Authors:
Submitted On:
25 November 2016 - 3:44am
Short Link:
Type:
Poster
Event:
Presenter's Name:
Sardar Jaf
Paper Code:
56
Document Year:
2016
Cite

Document Files

sarar_jaf_poster.pdf

(98 downloads)

Subscribe

[1] , "A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters", IEEE SigPort, 2016. [Online]. Available: http://sigport.org/1306. Accessed: Aug. 20, 2017.
@article{1306-16,
url = {http://sigport.org/1306},
author = { },
publisher = {IEEE SigPort},
title = {A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters},
year = {2016} }
TY - EJOUR
T1 - A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters
AU -
PY - 2016
PB - IEEE SigPort
UR - http://sigport.org/1306
ER -
. (2016). A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters. IEEE SigPort. http://sigport.org/1306
, 2016. A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters. Available at: http://sigport.org/1306.
. (2016). "A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters." Web.
1. . A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters [Internet]. IEEE SigPort; 2016. Available from : http://sigport.org/1306