Kais Dukes | 2 May 2011 11:23
Picon
Favicon

New Version 0.4 of the Quranic Arabic Corpus

Apologies for cross-posting.

The Quranic Arabic Corpus (http://corpus.quran.com) is an international collaborative linguistic
project initiated at the University of Leeds, that aims to bridge the gap between the traditional Arabic
grammar of i'rab and techniques from modern computational linguistics. This open source resource
includes part-of-speech tagging for the Quran, morphological segmentation and a formal
representation of Quranic syntax using dependency graphs. Version 0.4 of the corpus provides several
improvements over the previous release:

*** [Increased coverage for the syntactic treebank]. Version 0.4 of the treebank covers 40% of the Quran by
word count (30,895 out of 77,429 words). The treebank provides syntactic annotation using dependency
grammar for chapters 1-8 and 59-114 of the Quran.

*** [Revised morphological analysis]. Following online collaboration by volunteer annotators, over
500 suggestions have cross-checked against traditional sources of Arabic grammar, resulting in more
accurate morphological tagging.

*** [Improved Quran dictionary and lemmatization]. The list of roots and lemmas that group related
derived words has been made more consistent with traditional Arabic lexicons. The online Quran
dictionary now also includes concordance lines from Quranic verses as context.

*** [Readability and navigation improvements]. The content of the website has been better organized,
with improvements to navigation and layout. Several typing mistakes and omissions have been corrected
in the word by word interlinear translation into English.

*** [More accurate tagging of proper nouns]. Eight new named entities have been added to the semantic
ontology that were previously tagged only as nouns: Al-Ahqaf, Al-Jahiliyah, Al-Jumu'ah, Baal,
Magians, Salsabil, Sirius, and Zaqqum.

*** [More accurate tagging for particles waw and fa]. In accordance with traditional Arabic grammar, for
(Continue reading)


Gmane