leszekp | 7 Feb 10:57

javac lexer parser rewrite

Hello

Javac scanner and parser now are handwritten. The code, especially in parser is quite messy and
hard to read and modify.
It is possible to rewrite lexer and parser using some kind of java parser generator.
It would improve readability and allows for easier modifications.

There is a project 'compiler grammar' (which seems dormant). Java lexer and parser were rewritten
using antlr. But antrl generated parsers are very slow.

Many lexer and parser generators exists which are able to process 'classic' regular expressions for lexer or
context free grammars for parser and produce fast code (ie. jflex, beaver, jikes parser generator and more)

What do you think about it? Is there a need for such thing? Is it worth the effort?

Regards
Leszek Piotrowicz

Picon
Favicon

Re: javac lexer parser rewrite

Hi
let me start by saying that I agree with you - the current parser/lexer 
architecture is messy and it represent a barrier for other people to 
chime in and start to contribute. However, when I was working on a 
parser improvement related to lambda expressions (I added lookahead 
support), I was surprised to see how fast javac lexer/parser actually 
are. Here are some 'unofficial' numbers taken on my machine (each run 
correspond to lexing the 'jdk/src' folder of the JDK 8 repo):

Run1: 0m6.501s
Run2: 0m6.205s
Run3: 0m6.936s

AVG: 6.547
TOTAL FILES: 7846
AVG TIME/FILE: 0.83 * 10-6 s

So, is it messy? Sure - is it fast? Yes, like hell. So, to summarise,  I 
think that any effort to try to improve our parser/lexer architecture is 
definitively welcome - however, anyone embarking on such a project 
should keep the above numbers in mind - if you can achieve the same 
speed (well, even marginally slower would be acceptable) than it'd be an 
option well worth considering.

Maurizio

On 07/02/12 09:57, leszekp@... wrote:
> Hello
>
> Javac scanner and parser now are handwritten. The code, especially in parser is quite messy and
(Continue reading)

Per Bothner | 7 Feb 17:35
Picon
Favicon

Re: javac lexer parser rewrite

On 02/07/2012 02:27 AM, Maurizio Cimadamore wrote:
> So, is it messy? Sure - is it fast? Yes, like hell. So, to summarise, I
> think that any effort to try to improve our parser/lexer architecture is
> definitively welcome - however, anyone embarking on such a project
> should keep the above numbers in mind - if you can achieve the same
> speed (well, even marginally slower would be acceptable) than it'd be an
> option well worth considering.

Another advantage of a hard-written parser/lexer, besides speed, is that
it is flexible (if you need to handle special cases).

Furthermore, it is easy to understand what is going on, and easier to debug.
Imagine trying to set breakpoints in a generated parser or otherwise try
to figure out why something is parsed the way it is.  A hard-written
parser/lexer is more verbose and shows all the details - but at least
what you see is plain debuggable Java code, and you don't need a special
tool or understanding.
--

-- 
	--Per Bothner
per.bothner@...  
per@...   http://per.bothner.com/

leszekp | 7 Feb 12:29

Re: javac lexer parser rewrite

Hello
I would gladly experiment with lexing/parsing to see if machine generated code are of comparable speed
I wonder how you measure pure lexing time. Do you have some 'special' wrapper around javac to do it?

Leszek

-------- Original Message --------
From: Maurizio Cimadamore <maurizio.cimadamore@...>
To: leszekp@...
Cc: compiler-dev@...
Subject: Re: javac lexer parser rewrite
Date: Tue, 07 Feb 2012 10:27:10 +0000

> Hi
> let me start by saying that I agree with you - the current parser/lexer 
> architecture is messy and it represent a barrier for other people to 
> chime in and start to contribute. However, when I was working on a 
> parser improvement related to lambda expressions (I added lookahead 
> support), I was surprised to see how fast javac lexer/parser actually 
> are. Here are some 'unofficial' numbers taken on my machine (each run 
> correspond to lexing the 'jdk/src' folder of the JDK 8 repo):
> 
> Run1: 0m6.501s
> Run2: 0m6.205s
> Run3: 0m6.936s
> 
> AVG: 6.547
> TOTAL FILES: 7846
> AVG TIME/FILE: 0.83 * 10-6 s
> 
(Continue reading)

Jonathan Gibbons | 7 Feb 16:15
Picon
Favicon

Re: javac lexer parser rewrite

For javac, just write a small program to call the (internal) javac 
classes directly.

-- Jon

On 02/07/2012 03:29 AM, leszekp@... wrote:
> Hello
> I would gladly experiment with lexing/parsing to see if machine generated code are of comparable speed
> I wonder how you measure pure lexing time. Do you have some 'special' wrapper around javac to do it?
>
> Leszek
>
> -------- Original Message --------
> From: Maurizio Cimadamore<maurizio.cimadamore@...>
> To: leszekp@...
> Cc: compiler-dev@...
> Subject: Re: javac lexer parser rewrite
> Date: Tue, 07 Feb 2012 10:27:10 +0000
>
>> Hi
>> let me start by saying that I agree with you - the current parser/lexer
>> architecture is messy and it represent a barrier for other people to
>> chime in and start to contribute. However, when I was working on a
>> parser improvement related to lambda expressions (I added lookahead
>> support), I was surprised to see how fast javac lexer/parser actually
>> are. Here are some 'unofficial' numbers taken on my machine (each run
>> correspond to lexing the 'jdk/src' folder of the JDK 8 repo):
>>
>> Run1: 0m6.501s
>> Run2: 0m6.205s
(Continue reading)

Picon
Favicon

Re: javac lexer parser rewrite

On 07/02/12 11:29, leszekp-X8oORJ6WFSEJGwgDXS7ZQA@public.gmane.org wrote:
Hello I would gladly experiment with lexing/parsing to see if machine generated code are of comparable speed I wonder how you measure pure lexing time. Do you have some 'special' wrapper around javac to do it?

There are various way to do that; one way is to use a custom ScannerFactory that returns your own scanner and register it onto the javac context. By doing so you will be able to plug in a new Lexer inside javac (see [1] for an example). Alternatively, a more simpler way to test pure lexer time you could create your own lexer, and call 'nextToken' until the end of the file is reached. But, if you go for the micro-benchmark approach, remember to do something with the result of nextToken() [i.e. compute the hash of the token and add it to a shared counter] - if the return value of nextToken is unused, hotspot optimizations will kick in and there's a chance that the loop will be optimized away ;-).

[1] - http://hg.openjdk.java.net/jdk8/tl/langtools/file/tip/test/tools/javac/api/TestJavacTaskScanner.java

Maurizio
Leszek -------- Original Message -------- From: Maurizio Cimadamore <maurizio.cimadamore-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> To: leszekp-X8oORJ6WFSEJGwgDXS7ZQA@public.gmane.org Cc: compiler-dev-0nJqbsLSQw0FDOXUYO6UHQ@public.gmane.org Subject: Re: javac lexer parser rewrite Date: Tue, 07 Feb 2012 10:27:10 +0000
Hi let me start by saying that I agree with you - the current parser/lexer architecture is messy and it represent a barrier for other people to chime in and start to contribute. However, when I was working on a parser improvement related to lambda expressions (I added lookahead support), I was surprised to see how fast javac lexer/parser actually are. Here are some 'unofficial' numbers taken on my machine (each run correspond to lexing the 'jdk/src' folder of the JDK 8 repo): Run1: 0m6.501s Run2: 0m6.205s Run3: 0m6.936s AVG: 6.547 TOTAL FILES: 7846 AVG TIME/FILE: 0.83 * 10-6 s So, is it messy? Sure - is it fast? Yes, like hell. So, to summarise, I think that any effort to try to improve our parser/lexer architecture is definitively welcome - however, anyone embarking on such a project should keep the above numbers in mind - if you can achieve the same speed (well, even marginally slower would be acceptable) than it'd be an option well worth considering. Maurizio On 07/02/12 09:57, leszekp-X8oORJ6WFSEJGwgDXS7ZQA@public.gmane.org wrote:
Hello Javac scanner and parser now are handwritten. The code, especially in parser is quite messy and hard to read and modify. It is possible to rewrite lexer and parser using some kind of java parser generator. It would improve readability and allows for easier modifications. There is a project 'compiler grammar' (which seems dormant). Java lexer and parser were rewritten using antlr. But antrl generated parsers are very slow. Many lexer and parser generators exists which are able to process 'classic' regular expressions for lexer or context free grammars for parser and produce fast code (ie. jflex, beaver, jikes parser generator and more) What do you think about it? Is there a need for such thing? Is it worth the effort? Regards Leszek Piotrowicz

leszekp | 8 Feb 16:52

Re: javac lexer parser rewrite

Both hand-coded parser and generated one has some advantages and disadvantages.
Of course it is good to have plain java code and have the possiblitity to debug it.

But as level of complication rises, at some point hand-written parser becomes unamanageable anyway.
In example Pascal language was designed to be LL(1) and hand-written recursive descent
parser for this language is probably quite understandable. But java wasn't designed that way
and its hand-written parser has a lot of quirks which made it complicated to understand.

I am experimenting with jflex generated java lexer. It is very fast - comparable
to original javac Scanner, it is promising.

regards
Leszek

-------- Original Message --------
From: Per Bothner <per.bothner@...>
To: Maurizio Cimadamore <maurizio.cimadamore@...>
Cc: leszekp@..., compiler-dev@...
Subject: Re: javac lexer parser rewrite
Date: Tue, 07 Feb 2012 08:35:51 -0800

> On 02/07/2012 02:27 AM, Maurizio Cimadamore wrote:
> > So, is it messy? Sure - is it fast? Yes, like hell. So, to summarise, I
> > think that any effort to try to improve our parser/lexer architecture is
> > definitively welcome - however, anyone embarking on such a project
> > should keep the above numbers in mind - if you can achieve the same
> > speed (well, even marginally slower would be acceptable) than it'd be an
> > option well worth considering.
> 
> Another advantage of a hard-written parser/lexer, besides speed, is that
> it is flexible (if you need to handle special cases).
> 
> Furthermore, it is easy to understand what is going on, and easier to debug.
> Imagine trying to set breakpoints in a generated parser or otherwise try
> to figure out why something is parsed the way it is.  A hard-written
> parser/lexer is more verbose and shows all the details - but at least
> what you see is plain debuggable Java code, and you don't need a special
> tool or understanding.
> -- 
> 	--Per Bothner
> per.bothner@...  
per@...   http://per.bothner.com/

Rémi Forax | 8 Feb 17:55
Picon

Re: javac lexer parser rewrite

On 02/08/2012 04:52 PM, leszekp@... wrote:
> Both hand-coded parser and generated one has some advantages and disadvantages.
> Of course it is good to have plain java code and have the possiblitity to debug it.
>
> But as level of complication rises, at some point hand-written parser becomes unamanageable anyway.

Do you have taken a look to the source code. ?

The parser is readable mostly because most the methods correspond to
a LR state and that the dotted productions of that state
are available in the documentation of the methods.

> In example Pascal language was designed to be LL(1) and hand-written recursive descent
> parser for this language is probably quite understandable. But java wasn't designed that way
> and its hand-written parser has a lot of quirks which made it complicated to understand.

Sorry, Java 1.0 was designed to be LALR(1), there is some discussions in 
the JLS 1 about that.

The main issue, is more that the grammar of the spec was not updated 
correctly after that.
This had some painful drawbacks like by example the fact that some parts 
of the grammar
(the one allocating a generics inner class by example) was missing from 
the parser
when the jdk 1.5 was released.

Also while I know that the grammar of Java 5 is LALR, I've no idea in 
the one of
the upcoming Java 8 is still LALR.

>
> I am experimenting with jflex generated java lexer. It is very fast - comparable
> to original javac Scanner, it is promising.

The problem is more the parser than the lexer.

A good project should be to write a parser generator that takes the Java 
grammar
and generates the same code as the existing parser, by the way.

>
> regards
> Leszek

cheers,
Rémi

Jonathan Gibbons | 8 Feb 18:25
Picon
Favicon

Re: javac lexer parser rewrite

On 02/08/2012 08:55 AM, Rémi Forax wrote:
>
> The problem is more the parser than the lexer.
>
> A good project should be to write a parser generator that takes the 
> Java grammar
> and generates the same code as the existing parser, by the way. 

The outcome of the compiler-grammar project was that there are 
effectively 3 grammars, close but not necessarily equal.

1. The rules embedded in chapters 1-17 of JLS.

2. The rules in chapter 18 of JLS

3. The grammar accepted by javac.

A good project would be to identify the differences between these 
grammars. :-)

-- Jon

leszekp | 9 Feb 14:17

Re: javac lexer parser rewrite

-------- Original Message --------
From: Rémi Forax<forax@...>
Apparently from: compiler-dev-bounces@...
To: compiler-dev@...
Subject: Re: javac lexer parser rewrite
Date: Wed, 08 Feb 2012 17:55:33 +0100

> On 02/08/2012 04:52 PM, leszekp@... wrote:
> > Both hand-coded parser and generated one has some advantages and disadvantages.
> > Of course it is good to have plain java code and have the possiblitity to debug it.
> >
> > But as level of complication rises, at some point hand-written parser becomes unamanageable anyway.
> 
> Do you have taken a look to the source code. ?
> 
> The parser is readable mostly because most the methods correspond to
> a LR state and that the dotted productions of that state
> are available in the documentation of the methods.

It depends on what you define as readable. For a creator of javac parser it is probably very readable :)
but for a developer trying to experiment with it represents a high barrier to entry.

What I'm going to say is that yacc like specification (rules + attached actions)
would be more readable and easier to modify.

> 
> > In example Pascal language was designed to be LL(1) and hand-written recursive descent
> > parser for this language is probably quite understandable. But java wasn't designed that way
> > and its hand-written parser has a lot of quirks which made it complicated to understand.
> 
> Sorry, Java 1.0 was designed to be LALR(1), there is some discussions in 
> the JLS 1 about that.
> 
> The main issue, is more that the grammar of the spec was not updated 
> correctly after that.
> This had some painful drawbacks like by example the fact that some parts 
> of the grammar
> (the one allocating a generics inner class by example) was missing from 
> the parser
> when the jdk 1.5 was released.
> 
> Also while I know that the grammar of Java 5 is LALR, I've no idea in 
> the one of
> the upcoming Java 8 is still LALR.
> 
I realize that java was designed to be LALR(1) - at least initially.
LL(1) grammars fits well for hand-made recursive descent implementation
LALR grammars are much harder to implement by hand, thats why they are rather processed by parser generators.

> >
> > I am experimenting with jflex generated java lexer. It is very fast - comparable
> > to original javac Scanner, it is promising.
> 
> The problem is more the parser than the lexer.
> 
> A good project should be to write a parser generator that takes the Java 
> grammar
> and generates the same code as the existing parser, by the way.

I agree with you. java parser is far more complicated than lexer,
so making it more clean would be more beneficial than rewriting the lexer.
But I understand it in another way.
A good project would be to write LALR grammar which after processing generates java parser and:
1 Grammar should be more readable than existing java parser code
2 Generated parser should generate the same output (abstract syntax tree) for the same program
3 Generated parser should not be significantly slower - ideally faster
4 Ideally it should report the same set of errors.

regards
Leszek

Brian Goetz | 20 Feb 05:27
Picon
Favicon

Re: javac lexer parser rewrite

There was a project that was undertaken a while ago to create an ANTLR 
'reference' grammar for Java.  This project had some success (it passed 
a tree-for-tree comparison with javac on a reasonable corpus) but has 
since decayed.  Performance-wise, I think it was about 10x slower than 
the existing javac parser.

I think it would be fantastic for someone to pick up the work that was 
done, and bring it up to date.  A ready-to-use ANTLR grammar would be 
great for anyone building tools that need to ingest java source code, 
even if javac didn't use it.  Because so many people know ANTLR, an 
ANTLR grammar would probably be more generally useful than one generated 
with less popular tools, since that would enable more people to be able 
to customize it for their own purposes.

Are you interested in working on this?  I'm sure we could dig up the 
existing work and test suites as a starting point.

On 2/7/2012 4:57 AM, leszekp@... wrote:
> Hello
>
> Javac scanner and parser now are handwritten. The code, especially in parser is quite messy and
> hard to read and modify.
> It is possible to rewrite lexer and parser using some kind of java parser generator.
> It would improve readability and allows for easier modifications.
>
> There is a project 'compiler grammar' (which seems dormant). Java lexer and parser were rewritten
> using antlr. But antrl generated parsers are very slow.
>
> Many lexer and parser generators exists which are able to process 'classic' regular expressions for
lexer or
> context free grammars for parser and produce fast code (ie. jflex, beaver, jikes parser generator and more)
>
> What do you think about it? Is there a need for such thing? Is it worth the effort?
>
> Regards
> Leszek Piotrowicz

leszekp | 20 Feb 15:33

Re: javac lexer parser rewrite

Brian
I agree with you than java grammar would be beneficiary for
tool builders, compiler hackers etc, even if the official javac does not use it.

I would rather write grammar for LALR parser generator than LL one like antlr
We have at lest 3 suitable tools: Beaver, Byaccj and Java-cup

Advantages of LALR grammar
1 The grammar would be more similar to 'official' java grammar
as it would not need left factoring required by LL grammars
Left lefactoring is the main reason which makes LL grammar hard to read.

2 Genarated parser would be faster than Antlr parser, maybe comparable
to existing hand-written parser (for beaver or byaccj). Java-cup wasn't
written with performance as top priority.

Disadvantages
Antlr is more popular - more people use it. However LALR grammar consists
of grammar rules + actions, the tools mentioned above are not that hard to use. 

In my opinion it would be better to try to write lalr grammar than
bringing antlr grammar up to date. I'm interested in it - some time ago I experimented
with java 5 lalr grammar.

Regards
Leszek

-------- Original Message --------
From: Brian Goetz <brian.goetz@...>
To: leszekp@...
Cc: compiler-dev@...
Subject: Re: javac lexer parser rewrite
Date: Sun, 19 Feb 2012 23:27:16 -0500

> There was a project that was undertaken a while ago to create an ANTLR 
> 'reference' grammar for Java.  This project had some success (it passed 
> a tree-for-tree comparison with javac on a reasonable corpus) but has 
> since decayed.  Performance-wise, I think it was about 10x slower than 
> the existing javac parser.
> 
> I think it would be fantastic for someone to pick up the work that was 
> done, and bring it up to date.  A ready-to-use ANTLR grammar would be 
> great for anyone building tools that need to ingest java source code, 
> even if javac didn't use it.  Because so many people know ANTLR, an 
> ANTLR grammar would probably be more generally useful than one generated 
> with less popular tools, since that would enable more people to be able 
> to customize it for their own purposes.
> 
> Are you interested in working on this?  I'm sure we could dig up the 
> existing work and test suites as a starting point.
> 
> On 2/7/2012 4:57 AM, leszekp@... wrote:
> > Hello
> >
> > Javac scanner and parser now are handwritten. The code, especially in parser is quite messy and
> > hard to read and modify.
> > It is possible to rewrite lexer and parser using some kind of java parser generator.
> > It would improve readability and allows for easier modifications.
> >
> > There is a project 'compiler grammar' (which seems dormant). Java lexer and parser were rewritten
> > using antlr. But antrl generated parsers are very slow.
> >
> > Many lexer and parser generators exists which are able to process 'classic' regular expressions for
lexer or
> > context free grammars for parser and produce fast code (ie. jflex, beaver, jikes parser generator and more)
> >
> > What do you think about it? Is there a need for such thing? Is it worth the effort?
> >
> > Regards
> > Leszek Piotrowicz

Dawid Weiss | 20 Feb 16:06
Picon
Gravatar

Re: javac lexer parser rewrite

Leszek,

Sorry for chiming in so late about this.

> I would rather write grammar for LALR parser generator than LL one like antlr
> We have at lest 3 suitable tools: Beaver, Byaccj and Java-cup

You may want to take a peek at what folks did in qdox here --
http://qdox.codehaus.org/

The home seems to be out of date but maven will come up with the sources:
http://repo1.maven.org/maven2/com/thoughtworks/qdox/qdox/1.12/qdox-1.12-project.zip

>From what I can see it's jflex/byaccj so you can take this and
improve/ refactor to your needs.

Dawid


Gmane