Gabriel Wicke | 16 Dec 2011 22:13
Favicon

Parser status update

Hello,

so far the HTML5 parser integration seems to have turned out quite well.
180 parser tests are now passing, and most of the remaining ones are
about missing functionality in later stages of the parser pipeline.

Since this week, the produced HTMl DOM can also be converted to WikiDom
(or close to it). A sample result of the [[en:Barack Obama]] article is
available at http://dev.wikidev.net/gabriel/tmp/obama.wikidom.txt. The
unoptimized parse to WikiDom without template expansions etc currently
takes about 35 seconds on my laptop.

The various moving parts of the setup (and how to try it out) are
described in https://www.mediawiki.org/wiki/Future/Parser_development.
In glorious ASCII, it might look roughly like this:

PEG wiki/HTML tokenizer         (could also be any SAX-style parser)
-> Token stream transformations
-> HTML5 tree builder
-> HTML DOM tree
-> DOM Postprocessors +-> (X)HTML
                      +-> DOMConverter -> WikiDom -> Visual Editor

The tokenizer is built from a completely static grammar, and leaves all
configuration-dependent behavior to later stages. Most interesting bits
happen in token stream transformations, which are dispatched using a
registration mechanism by token type. The order of handlers can be
specified, and early handlers can abort further processing for a token.
Syntax-specific transformations on a token can register for early
processing, so that later transformations on a token can operate on a
(Continue reading)


Gmane