16 Dec 2011 22:13
Parser status update
Gabriel Wicke <wicke <at> wikidev.net>
2011-12-16 21:13:33 GMT
2011-12-16 21:13:33 GMT
Hello, so far the HTML5 parser integration seems to have turned out quite well. 180 parser tests are now passing, and most of the remaining ones are about missing functionality in later stages of the parser pipeline. Since this week, the produced HTMl DOM can also be converted to WikiDom (or close to it). A sample result of the [[en:Barack Obama]] article is available at http://dev.wikidev.net/gabriel/tmp/obama.wikidom.txt. The unoptimized parse to WikiDom without template expansions etc currently takes about 35 seconds on my laptop. The various moving parts of the setup (and how to try it out) are described in https://www.mediawiki.org/wiki/Future/Parser_development. In glorious ASCII, it might look roughly like this: PEG wiki/HTML tokenizer (could also be any SAX-style parser) -> Token stream transformations -> HTML5 tree builder -> HTML DOM tree -> DOM Postprocessors +-> (X)HTML +-> DOMConverter -> WikiDom -> Visual Editor The tokenizer is built from a completely static grammar, and leaves all configuration-dependent behavior to later stages. Most interesting bits happen in token stream transformations, which are dispatched using a registration mechanism by token type. The order of handlers can be specified, and early handlers can abort further processing for a token. Syntax-specific transformations on a token can register for early processing, so that later transformations on a token can operate on a(Continue reading)
RSS Feed