Ori Livneh | 2 Apr 2012 19:27
Picon
Gravatar

Re: Patch for parser

Awesome -- thanks!

I have a couple of questions to ask, too. These aren't urgent, so feel free to reply whenever you have the time.

1) Conformance vs. enhancement

I'm not sure if you saw my comment in the diff, but some of the ISBN validation code I wrote is deliberately unreachable because it's stricter than the naive validation the Mediawiki currently performs. (I stupidly wrote it before checking to see what exactly the current parser does.) In general, is it ever desirable to try and improve on the old parsing grammar or is total conformance the overriding goal?

2) ECMAScript target

Should I strive to write code that is compatible with older browsers? Thus far I've avoided ES 1.5 constructs like forEach, map, filter, etc. But if this is only going to run server-side under node for the foreseeable future, maybe that's an unnecessary handicap. 

3) Intermediate representation vs. parser hacks

The preliminary work I've done to parse behavior switches (__TOC__ & friends) has the parser directly toggle attributes on a configuration object. It does this rather than produce a token for some subsequent tree visitor to interpret. So the parsing is arguably somewhat lossy in this case, but possibly not: when serializing back wikitext, you could read all behaviors from the configuration object and encode them at the top; their position isn't significant, afaik.)

So: did I make the right decision? PEG.js's support for arbitrary JavaScript code in the grammar makes this tempting to do sometimes, but perhaps it's wiser to insist that the parser not overreach the goal of building a tree. What is your take?

I'm CCing wikitext-l, since your responses are likely to be useful to future contributors.

Best,
Ori

On Mon, Apr 2, 2012 at 12:41 PM, Gabriel Wicke <gwicke <at> wikimedia.org> wrote:
Hi Ori,

I committed your patch in https://gerrit.wikimedia.org/r/#change,4094. Looks good, and lets 14 more tests pass ;) Thanks!!

Gabriel


On Mon, Apr 2, 2012 at 9:59 AM, Ori Livneh <ori.livneh <at> gmail.com> wrote:
Hey Gabriel,

Working on this on stolen time, as it were, but I do have some modest progress to report. The attached patch revises the handling of RFC auto-links and adds support for ISBNs, PMID links, and preliminary support for behavior switches (things like __TOC__).

I applied for git access, so hopefully I'll have my own branch set up soon and we'll be able to do this in a slightly less haphazard way

Hope you've had a nice trip,

Ori


_______________________________________________
Wikitext-l mailing list
Wikitext-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Gabriel Wicke | 3 Apr 2012 13:12
Favicon

Re: Patch for parser

On 04/02/2012 07:27 PM, Ori Livneh wrote:
> Awesome -- thanks!
> 
> I have a couple of questions to ask, too. These aren't urgent, so feel
> free to reply whenever you have the time.
> 
> 1) Conformance vs. enhancement
> 
> I'm not sure if you saw my comment in the diff, but some of the ISBN
> validation code I wrote is deliberately unreachable because it's
> stricter than the naive validation the Mediawiki currently performs. (I
> stupidly wrote it before checking to see what exactly the current parser
> does.) In general, is it ever desirable to try and improve on the old
> parsing grammar or is total conformance the overriding goal?

Breaking commonly-used wikitext constructs would need a very good
justification, while enhancements / clean-ups for little-used edge cases
are a good idea. I created a dumpGrepper tool (see
VisualEditor/tests/parser/dumpGrepper.js) to check how common some
constructs are.

For ISBN in particular, the following bug discussed the trade-offs
involved in stricter validation:
https://bugzilla.wikimedia.org/show_bug.cgi?id=2391
There is now strict validation and a warning in Special:Booksources, but
the parser still recognizes invalid ISBNs.

> 2) ECMAScript target
> 
> Should I strive to write code that is compatible with older browsers?
> Thus far I've avoided ES 1.5 constructs like forEach, map, filter, etc.
> But if this is only going to run server-side under node for
> the foreseeable future, maybe that's an unnecessary handicap. 

We don't have concrete plans to run this code on old clients, and by the
time that would happen most browsers might be recent enough. Performance
is important though, and simple constructs like for / while loops were
still faster on V8 last time I checked.

> 3) Intermediate representation vs. parser hacks
> 
> The preliminary work I've done to parse behavior switches (__TOC__ &
> friends) has the parser directly toggle attributes on a configuration
> object. It does this rather than produce a token for some subsequent
> tree visitor to interpret. So the parsing is arguably somewhat lossy in
> this case, but possibly not: when serializing back wikitext, you could
> read all behaviors from the configuration object and encode them at the
> top; their position isn't significant, afaik.)

The position of the switch is still significant to avoid dirty diffs.
Adam Wight actually implemented a bit of the back-end infrastructure for
this in https://gerrit.wikimedia.org/r/#change,4050. We should be able
to combine the two patches by joining your tokenizer work with Adam's
token transformer work.

> So: did I make the right decision? PEG.js's support for arbitrary
> JavaScript code in the grammar makes this tempting to do sometimes, but
> perhaps it's wiser to insist that the parser not overreach the goal of
> building a tree. What is your take?

The tokenizer should try hard to produce something round-trippable while
still trying to perform most context-free parsing if it can. It does
build a parse tree in the process, but then immediately flattens it by
emitting a stream of tokens. In this case we should emit tokens for the
behavior switches, which are then handled in a token stream transformer
(see Adam's patch).

I plan to port the wikitext serializer in the editor to operate on
tokens instead, so that we can do round-trip testing at different levels
in the processing chain. Serializing a DOM subtree is then equivalent to
walking the tree and calling the start / end token serializers for each
DOM node. Behavior switches thus need to be represented as (invisible)
nodes in the DOM as well to support round-tripping via the DOM. HTML5
Microdata would suggest the meta element for this, while I am not 100%
sure which would be the preferred element for RDFa. Perhaps a span
without text content. The mapping from internal token to meta / span can
be performed in the token stream transformer. We should standardize on a
single way to mark up invisible content to simplify serialization.

Gabriel

Gmane