21 Dec 2011 15:13
[ANN] Wiktionary RDF-extraction with DBpedia for en and de
Dear all, the past months Sebastian and I worked on a DBpedia-based extractor for Wiktionary. The main goal was to create one that is so configurable, that applying it to the different languages-versions of Wiktionary is just a matter of configuration but not programming. And the configuration should be possible to do for someone that has a good understanding of the Wiki syntax (and currently XML, but we plan to hide that too, via a web-based frontend similar to the mappings-wiki) but not Scala or RDF. We now have configs and dumps at the example of the english and german wiktionary, to show the state of our development and initiate a discussion about design and implementation. If you are not interested in the technical stuff you may skip the detailed description below and just evaluate the dump. English contains 16M triples and took 9days on my dual core 2GHz laptop; thats 4 articles per second. German contains 1.3M triples and took 3h (German has only 300k articles, whereas English has 3M and the config is smaller). We know that there is some "noise" in the data (incorrect parsed data), we fix that in the next weeks. The source code is in the "wiktionary" branch of the dbpedia svn-repo, the dump files can be downloaded here: http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_en.nt.bz2 http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_de.nt.bz2 The idea is somewhat different from dbpedia (although we use the framework): Instead of infoboxes and very specific extractors we tried to make a meta-extractor of declarative nature instead of imperative. The rational is that although there exist some scrapers, none of them allows to parse more than a 3-5 languages. So we encode the language-specific characteristics of each Wiktionary in a machine-readable format (e.g. the "config-de.xml").(Continue reading)
RSS Feed