Jonas Brekle | 21 Dec 2011 15:13
Picon
Gravatar

[ANN] Wiktionary RDF-extraction with DBpedia for en and de

Dear all,

the past months Sebastian and I worked on a DBpedia-based extractor for
Wiktionary. The main goal was to create one that is so configurable,
that applying it to the different languages-versions of Wiktionary is
just a matter of configuration but not programming. And the
configuration should be possible to do for someone that has a good
understanding of the Wiki syntax (and currently XML, but we plan to hide
that too, via a web-based frontend similar to the mappings-wiki) but not
Scala or RDF.

We now have configs and dumps at the example of the english and german wiktionary, to show the state of our
development and
initiate a discussion about design and implementation. If you are not
interested in the technical stuff you may skip the detailed
description below and just evaluate the dump. 
English contains 16M triples and took 9days on my dual core 2GHz laptop; thats 4 articles per second.
German contains 1.3M triples and took 3h (German has only 300k articles, whereas English has 3M and the
config is smaller). We know that there is some "noise" in the data (incorrect parsed data), we fix that in
the next weeks.
The source code is in the "wiktionary" branch of the dbpedia svn-repo, the dump files can be downloaded here:
http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_en.nt.bz2
http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_de.nt.bz2

The idea is somewhat different from dbpedia (although we use the framework): 
Instead of infoboxes and very specific extractors we tried to make a meta-extractor of declarative nature
instead of imperative. 
The rational is that although there exist some scrapers, none of them allows to parse more than a 3-5 languages.
So we encode the language-specific characteristics of each Wiktionary in a machine-readable format
(e.g. the "config-de.xml").
(Continue reading)


Gmane