Knowledge and Data Engineering

Aicyber’s System for IALP 2016 Shared Task:Character-enhanced Word Vectors and Boosted Neural Networks

ialp_ppt3.pdf

IALP2016-114-PDF (327)

Categories:: Knowledge and Data Engineering

17 Views

Recurrent Neural Network-based Language Models with Variation in Net Topology, Language, and Granularity

In this paper, we study language models based on recurrent neural networks on three databases in two languages. We implement basic recurrent neural networks (RNN) and refined RNNs with long short-term memory (LSTM) cells. We use the corpora of Penn Tree Bank (PTB) and AMI in English, and the Academia Sinica Balanced Corpus (ASBC) in Chinese. On ASBC, we investigate wordbased and character-based language models. For characterbased language models, we look into the cases where the inter-word space is treated or not treated as a token.

40_RNN.pdf

40_RNN (707)

Categories:: Knowledge and Data Engineering

19 Views

Verifying the Long-range Dependency of RNN Language Models

Read more about Verifying the Long-range Dependency of RNN Language Models
Log in to post comments

It has been argued that recurrent neural network language models are better in capturing long-range dependency than n-gram language models. In this paper, we attempt to verify this claim by investigating the prediction accuracy and the perplexity of these language models as a function of word position, i.e., the position of a word in a sentence. It is expected that as word position increases, the advantage of using recurrent neural network language models over n-gram language models will become more and more evident.

long_range_dependency_RNN.pdf

41_ngram_rnn (727)

Categories:: Knowledge and Data Engineering

15 Views

History Question Classification and Representation for Chinese Gaokao

Read more about History Question Classification and Representation for Chinese Gaokao
Log in to post comments

In this paper, we propose a question representation based on entity labeling and question classification for a automatic question answering system of Chinese Gaokao history question. A CRF model is used for the entity labeling and SVM/CNN/LSTM models are tested for question classification. Our experiments show that CRF model provides a high performance when used to label informative entities out while neural networks has a promising performance for the question classification task.

80.pdf

80.pdf (319)

Categories:: Knowledge and Data Engineering

29 Views

The Effect of Shallow Segmentation for English-Tigrinya Statistical Machine Translation

This paper presents initial English-Tigrinya statistical machine translation (SMT) research. Tigrinya is a highly inflected Semitic language spoken in Eritrea and Ethiopia. Translation involving morphologically complex languages is challenged by factors including data sparseness and source-target word alignment. We try to address these problems through morphological segmentation of Tigrinya words. After segmentation the difference in token count dropped significantly from 37.7% to 0.1%. The out-of-vocabulary rate was reduced by 46%.

IALP-tig.pdf

IALP-tig.pdf (806)

Categories:: Knowledge and Data Engineering

43 Views

Word Sense Implantation as Orthographical Conversion

Read more about Word Sense Implantation as Orthographical Conversion
Log in to post comments

We present a word sense disambiguation (WSD) tool of Japanese Hiragana words. Unlike other WSD tasks which output something like “sense #3” as result, our WSD task rewrites the target word into a Kanji word, which is a different orthography. This means that the task is also a kind of orthographical normalization as well as WSD. In this paper we present the task, our method, and the performance.

IALP-wsd.pdf

IALP-wsd.pdf (724)

Categories:: Knowledge and Data Engineering

13 Views

Detecting Representative Web Articles Using Heterogeneous Graphs

Read more about Detecting Representative Web Articles Using Heterogeneous Graphs
Log in to post comments

slide.11.21_upload.pptx

slide.11.21_upload.pptx (753)

Categories:: Knowledge and Data Engineering

10 Views

Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus

Read more about Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus
Log in to post comments

Uyghur is minority language in China, it is one of the official languages in Xinjiang Uyghur Autonomous Region of China. More than 10 million people use Uyghur in their daily life and even on the Internet. However, lack of Uyghur entity relation corpus constrains relation extraction applications in Uyghur. In this paper, we describe annotation schemes for creating annotated corpus for Uyghur named entity and Uyghur named entity relation.

Annotation Schemes for Constructing Uyghur Named Entity Relation.pdf

Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus (67)

Categories:: Knowledge and Data Engineering

22 Views

Construction of the Basic Sentence-pattern Instance Database Based on the International Chinese Textbook Treebank

Construction of the Basic Sentence-pattern Instance Database Based on the International Chinese Textbook Treebank.pdf

Construction of the Basic Sentence-pattern Instance Database Based on the International Chinese Textbook Treebank.pdf (425)

Categories:: Knowledge and Data Engineering

4 Views

Japanese Orthographical Normalization Does Not Work for Statistical Machine Translation

We have investigated the effect of normalizing Japanese orthographical variants into a uniform orthography on statistical machine translation (SMT) between Japanese and English. In Japanese, 10% of words have reportedly more than one orthographical variants, which is a promising fact for improving translation quality when we normalize these orthographical variants.

15-IALP2016.pdf

15-IALP2016.pdf (293)

Categories:: Knowledge and Data Engineering

1 Views

Knowledge and Data Engineering

Pages