Aleksey Shipilev | 8 Jun 2012 21:47
Picon
Gravatar

Fork/Join traces: instrumentation and rendering

Hi everyone,

I was crawling through nastly performance problem with FJP and my
particular use case. This forced me to do the instrumentation in FJP,
and also sketch up some FJP-specific analyzers.

While thinking around what to do with that next, I figured it would be
to ask if this tool is something community wants/needs, or I can just
throw it in the trunk, and leave it there. (Maybe it is not even worth
to contaminate GitHub with).

The bundle is at http://shipilev.net/pub/stuff/fjp-trace/. See README
there. In short, this provides instrumented FJP, which dumps the
execution traces, which then can be analyzed, down to fork-join
dependencies, steals, parks-unparks, etc.

For instance, this is one use case I was chasing:
 - http://shipilev.net/pub/stuff/fjp-trace/fjp-trace-sample.png
 - that's a 4x10x2 Nehalem running 8b39-lambda
 - doing FJP.submit().get(), hence waiting for external task to complete
 - note that FJP ramps up really quickly, in less than 5ms
 - two heavily out-balanced tasks are seen as green bars
 - six joiners are waiting on those tasks to complete
 - one could also estimate the actual integral parallelism

Thoughts?

-Aleksey.
Aleksey Shipilev | 9 Jun 2012 15:53
Picon
Gravatar

Re: Fork/Join traces: instrumentation and rendering

FYI, there is an updated version [1], which piggybacks heavily on
WorkQueue mechanics to avoid serial bottlenecks when writing the
trace, and Unsafe to provide binary representation for traces without
resorting to string ops. It is somewhat 20x faster on my laptop with
tracing enabled than previous one. This only visible when there are
lots of FJP events, which is always the case when tasks are tiny. With
modest task sizes, the overhead of tracing is bearable.

-Aleksey.

[1] http://shipilev.net/pub/stuff/fjp-trace/

On Fri, Jun 8, 2012 at 11:47 PM, Aleksey Shipilev
<aleksey.shipilev <at> gmail.com> wrote:
> Hi everyone,
>
> I was crawling through nastly performance problem with FJP and my
> particular use case. This forced me to do the instrumentation in FJP,
> and also sketch up some FJP-specific analyzers.
>
> While thinking around what to do with that next, I figured it would be
> to ask if this tool is something community wants/needs, or I can just
> throw it in the trunk, and leave it there. (Maybe it is not even worth
> to contaminate GitHub with).
>
> The bundle is at http://shipilev.net/pub/stuff/fjp-trace/. See README
> there. In short, this provides instrumented FJP, which dumps the
> execution traces, which then can be analyzed, down to fork-join
> dependencies, steals, parks-unparks, etc.
>
(Continue reading)

Chris Vest | 9 Jun 2012 19:28
Picon
Gravatar

Re: Fork/Join traces: instrumentation and rendering

I think it's an interesting way to visualise how the FJP crunches a task. As in your example, it can help show if the tasks are badly balanced, though I don't know how big a problem that is in practice. Github repositories are a pretty cheap commodity, so I don't think this would be contaminating it :)


One thing it makes me wonder is, if the FJP is executing some slow tasks and a number of joiners are waiting on them, and then another task is submitted, then there will be fewer threads to handle the new work?

On 9 June 2012 15:53, Aleksey Shipilev <aleksey.shipilev <at> gmail.com> wrote:
FYI, there is an updated version [1], which piggybacks heavily on
WorkQueue mechanics to avoid serial bottlenecks when writing the
trace, and Unsafe to provide binary representation for traces without
resorting to string ops. It is somewhat 20x faster on my laptop with
tracing enabled than previous one. This only visible when there are
lots of FJP events, which is always the case when tasks are tiny. With
modest task sizes, the overhead of tracing is bearable.

-Aleksey.

[1] http://shipilev.net/pub/stuff/fjp-trace/

On Fri, Jun 8, 2012 at 11:47 PM, Aleksey Shipilev
<aleksey.shipilev <at> gmail.com> wrote:
> Hi everyone,
>
> I was crawling through nastly performance problem with FJP and my
> particular use case. This forced me to do the instrumentation in FJP,
> and also sketch up some FJP-specific analyzers.
>
> While thinking around what to do with that next, I figured it would be
> to ask if this tool is something community wants/needs, or I can just
> throw it in the trunk, and leave it there. (Maybe it is not even worth
> to contaminate GitHub with).
>
> The bundle is at http://shipilev.net/pub/stuff/fjp-trace/. See README
> there. In short, this provides instrumented FJP, which dumps the
> execution traces, which then can be analyzed, down to fork-join
> dependencies, steals, parks-unparks, etc.
>
> For instance, this is one use case I was chasing:
>  - http://shipilev.net/pub/stuff/fjp-trace/fjp-trace-sample.png
>  - that's a 4x10x2 Nehalem running 8b39-lambda
>  - doing FJP.submit().get(), hence waiting for external task to complete
>  - note that FJP ramps up really quickly, in less than 5ms
>  - two heavily out-balanced tasks are seen as green bars
>  - six joiners are waiting on those tasks to complete
>  - one could also estimate the actual integral parallelism
>
> Thoughts?
>
> -Aleksey.

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Doug Lea | 10 Jun 2012 23:10
Favicon

Re: Fork/Join traces: instrumentation and rendering


> One thing it makes me wonder is, if the FJP is executing some slow tasks
> and a number of joiners are waiting on them, and then another task is
> submitted, then there will be fewer threads to handle the new work?
>

Sometimes. ForkJoinPool may internally generate new ones -- always
enough to avoid starvation, but usually fewer than the parallelism level.
When some subtasks are expected to block or take
a much longer time than others, it is more efficient
to use CountedCompleters instead, that avoid cascaded blocking
at the expense of harder-to-use APIs.

Thanks to Aleksey for making the profiler available!
It is, among other things, a helpful tool for showing
the impact of these kinds of design decisions.

-Doug
Aleksey Shipilev | 17 Jun 2012 17:21
Picon
Gravatar

Re: Fork/Join traces: instrumentation and rendering

On 06/11/2012 01:10 AM, Doug Lea wrote:
> Thanks to Aleksey for making the profiler available!
> It is, among other things, a helpful tool for showing
> the impact of these kinds of design decisions.

FYI, I had pushed fjp-trace to GitHub [1]. You can use the issue tracker
there to file bugs against it.

Additionally, I had pulled in vanilla JSR166 FJP into jsr166 branch, so
one can see the changes between baseline and instrumented FJP by
comparing the branches [2]. Doug, can you consider merging
instrumentation code directly into JSR166?

-Aleksey.

[1] https://github.com/shipilev/fjp-trace
[2] https://github.com/shipilev/fjp-trace/compare/jsr166...master
Doug Lea | 17 Jun 2012 22:39
Favicon

Re: Fork/Join traces: instrumentation and rendering

On 06/17/12 11:21, Aleksey Shipilev wrote:
> On 06/11/2012 01:10 AM, Doug Lea wrote:
>> Thanks to Aleksey for making the profiler available!
>> It is, among other things, a helpful tool for showing
>> the impact of these kinds of design decisions.
>
> FYI, I had pushed fjp-trace to GitHub [1].

Thanks!

> Additionally, I had pulled in vanilla JSR166 FJP into jsr166 branch, so
> one can see the changes between baseline and instrumented FJP by
> comparing the branches [2]. Doug, can you consider merging
> instrumentation code directly into JSR166?

I/We don't see a way to do it with zero guaranteed impact vs
non-instrumented code. But after giving fjp-trace some time to
settle in, we'll find some reasonable way to semi-integrate
to make it easier to use for diagnostics.

-Doug

> [1] https://github.com/shipilev/fjp-trace
> [2] https://github.com/shipilev/fjp-trace/compare/jsr166...master
>
Aleksey Shipilev | 17 Jun 2012 22:52
Picon
Gravatar

Re: Fork/Join traces: instrumentation and rendering

On 06/18/2012 12:39 AM, Doug Lea wrote:
> I/We don't see a way to do it with zero guaranteed impact vs
> non-instrumented code. But after giving fjp-trace some time to
> settle in, we'll find some reasonable way to semi-integrate
> to make it easier to use for diagnostics.

Sure, let it brew. In the mean time, let's consider more light-weight
ways to hook up the instrumentation code. I would go the code weaving
route, but that will require specific junction points within FJP to
weave to. Would calling dummy (static?) FJP methods in the places where
registerEvent() is called in current code work as zero-impact marker?

(Now the crazy talk). Other options are:
 - protecting instrumentation calls with asserts and weave them into
existence when tracing is enabled? (Not really different from in-place
static final boolean check).
 - local variables with special names get written with the event; hoping
smart JIT will otherwise optimize the write away. (Is not really working
for interpreter and inferior "java" VMs).
 - anything else?

-Aleksey.

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Mohan Radhakrishnan | 18 Jun 2012 16:41
Picon

Re: Fork/Join traces: instrumentation and rendering

I have used AspectJ and tried it with FJ. It works but the change to the code after weaving aspects could introduce subtle concurrency bugs. I thought that merging a framework written by experts and aspects written by me is cause for trouble. The technology will work very well because I have woven aspects into even containers like WebSphere.


My idea was to use JavaFX to show a GUI by weaving into FJ. The POC worked.


Thanks,
Mohan 

On Mon, Jun 18, 2012 at 2:22 AM, Aleksey Shipilev <aleksey.shipilev <at> gmail.com> wrote:
On 06/18/2012 12:39 AM, Doug Lea wrote:
> I/We don't see a way to do it with zero guaranteed impact vs
> non-instrumented code. But after giving fjp-trace some time to
> settle in, we'll find some reasonable way to semi-integrate
> to make it easier to use for diagnostics.

Sure, let it brew. In the mean time, let's consider more light-weight
ways to hook up the instrumentation code. I would go the code weaving
route, but that will require specific junction points within FJP to
weave to. Would calling dummy (static?) FJP methods in the places where
registerEvent() is called in current code work as zero-impact marker?

(Now the crazy talk). Other options are:
 - protecting instrumentation calls with asserts and weave them into
existence when tracing is enabled? (Not really different from in-place
static final boolean check).
 - local variables with special names get written with the event; hoping
smart JIT will otherwise optimize the write away. (Is not really working
for interpreter and inferior "java" VMs).
 - anything else?

-Aleksey.


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Nathan Reynolds | 25 Jun 2012 18:24
Picon
Favicon

Re: Fork/Join traces: instrumentation and rendering

Code weaving is a good route.  BTrace, AspectJ or other tools (?) will make this simple.  Static and member empty methods will be inlined by JIT optimization and hence no instructions will be added to the optimized code.  These methods will impact interpreted code and hence impact start up and warm up.  These methods will add a bit more memory usage since there are additional bytecodes and metadata about the methods.  I kind of doubt these latter 2 points will have any significant impact.

Static final booleans initialized to a constant will cause javac to discard the check and dead code.  No interpreter impact and no bytecode impact.  Static final booleans initialized at runtime will *hopefully* cause JIT optimization to throw out the check and dead code.  I haven't checked this latter point.

Nathan Reynolds | Consulting Member of Technical Staff | 602.333.9091
Oracle PSR Engineering | Server Technology
On 6/17/2012 1:52 PM, Aleksey Shipilev wrote:
On 06/18/2012 12:39 AM, Doug Lea wrote:
I/We don't see a way to do it with zero guaranteed impact vs non-instrumented code. But after giving fjp-trace some time to settle in, we'll find some reasonable way to semi-integrate to make it easier to use for diagnostics.
Sure, let it brew. In the mean time, let's consider more light-weight ways to hook up the instrumentation code. I would go the code weaving route, but that will require specific junction points within FJP to weave to. Would calling dummy (static?) FJP methods in the places where registerEvent() is called in current code work as zero-impact marker? (Now the crazy talk). Other options are: - protecting instrumentation calls with asserts and weave them into existence when tracing is enabled? (Not really different from in-place static final boolean check). - local variables with special names get written with the event; hoping smart JIT will otherwise optimize the write away. (Is not really working for interpreter and inferior "java" VMs). - anything else? -Aleksey.

_______________________________________________ Concurrency-interest mailing list Concurrency-interest <at> cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Mark Thornton | 25 Jun 2012 19:46
Favicon

Re: Fork/Join traces: instrumentation and rendering

On 25/06/12 17:24, Nathan Reynolds wrote:
 Static final booleans initialized at runtime will *hopefully* cause JIT optimization to throw out the check and dead code.  I haven't checked this latter point.
They do in my experience. My tracing code relies on it to give zero cost when disabled. I think this also applies to assertions.

Mark Thornton

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Aleksey Shipilev | 25 Jun 2012 20:18
Picon
Gravatar

Re: Fork/Join traces: instrumentation and rendering

It is somewhat different with concurrency code and pessimistic
compilers, and IMO that is why Doug is paranoid about this (not that I
disagree with him). The current code for fjp-trace does the similar
thing: checking static final variable first thing after the
instrumented call. Granted, there is still possible overhead for
calling the virtual method, which smart JITs are able to optimize. At
this point, re-read the first sentence.

I'm not sure if there exists any practical way to guarantee zero
overheads without bringing "smart compiler" argument into the
discussion. Bytecode rewriting is one questionable solution, due to
rather complicated hotpaths in the FJP code. I would like to thing in
this direction without assuming smart compilers.

-Aleksey.

On Mon, Jun 25, 2012 at 9:46 PM, Mark Thornton <mthornton <at> optrak.com> wrote:
> On 25/06/12 17:24, Nathan Reynolds wrote:
>
>  Static final booleans initialized at runtime will *hopefully* cause JIT
> optimization to throw out the check and dead code.  I haven't checked this
> latter point.
>
> They do in my experience. My tracing code relies on it to give zero cost
> when disabled. I think this also applies to assertions.
>
> Mark Thornton
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest <at> cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Mohan Radhakrishnan | 26 Jun 2012 06:09
Picon

Re: Fork/Join traces: instrumentation and rendering

 
"Static and member empty methods will be inlined by JIT optimization and hence no instructions will be added to the optimized code. "
 
Methods like 'buildQueues'  that the new AspectJ compilers use is a empty method.
 
<at> Aspect
public class ForkJoinProjector {
  <at> Pointcut( "execution ( int java.util.concurrent.ForkJoinPool.registerWorker(java.util.concurrent.      ForkJoinWorkerThread)) &&" +
                   " args(thread) &&" +
                   " target(pool)" )

    public void buildQueues( ForkJoinWorkerThread thread,
                      ForkJoinPool pool){}
 
    <at> After("buildQueues( thread,pool)")
    public void build( ForkJoinWorkerThread thread,
                                   ForkJoinPool pool ) {
     System.out.println( "ID " + thread.getId() + " Name " + thread.getName() );
    }

}
 
Thanks.


On Mon, Jun 25, 2012 at 11:48 PM, Aleksey Shipilev <aleksey.shipilev <at> gmail.com> wrote:
It is somewhat different with concurrency code and pessimistic
compilers, and IMO that is why Doug is paranoid about this (not that I
disagree with him). The current code for fjp-trace does the similar
thing: checking static final variable first thing after the
instrumented call. Granted, there is still possible overhead for
calling the virtual method, which smart JITs are able to optimize. At
this point, re-read the first sentence.

I'm not sure if there exists any practical way to guarantee zero
overheads without bringing "smart compiler" argument into the
discussion. Bytecode rewriting is one questionable solution, due to
rather complicated hotpaths in the FJP code. I would like to thing in
this direction without assuming smart compilers.

-Aleksey.

On Mon, Jun 25, 2012 at 9:46 PM, Mark Thornton <mthornton <at> optrak.com> wrote:
> On 25/06/12 17:24, Nathan Reynolds wrote:
>
>  Static final booleans initialized at runtime will *hopefully* cause JIT
> optimization to throw out the check and dead code.  I haven't checked this
> latter point.
>
> They do in my experience. My tracing code relies on it to give zero cost
> when disabled. I think this also applies to assertions.
>
> Mark Thornton
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest <at> cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Favicon
Gravatar

Re: Fork/Join traces: instrumentation and rendering

Hi Nathan,

I would actually like to see instrumentation being added into many of the core libraries but not in the way proposed here because I agree with you that this is best left to method level (typically timing/tracing/metering based) agents based on BCI (AspectJ, BCEL, ASM,...). The type of instrumentation I would like to see added would not be possible with such approaches - at least not in a maintainable fashion. The approach I have in mind will allow library developers to expose (and control exposure of) underlying signals which (may or may not be) driving factors in the performance of the library in servicing a particular request/call. I have actually created this approach for extreme low latency environments in which timing is not reasonable (sub microsecond) and used it in our own software to allow the software to observe its own behavior, learn & understood what signals drive its performance (throughput or latency) under the conditions (not under its control) and then to adapt. I had hoped to present the approach at the JVM lang summit this year but unfortunately I am traveling (around Asia) at that time. We plan to release it, Signals, into the public in August but hopefully next week or the week after I can get out a blog which introduces some of the concepts and the reasoning behind it from our experience building a lmetering engine that measures performance of low latency (though variable depending on context/state) Java libraries.

Getting back on track a little the reason for my response is that I do believe it is time to revisit the approach to serviceability (in particular observation) in the runtime and to (re)consider/focus runtime adaption (in particular optimization) above the hotspot engine itself. I would like to see the JVM & JDK enhanced to make this far easier and standardized across libraries, frameworks and stacks. But to do this in a way that minimizes risks due to changes we might very well need some special help from the lang and JVM itself. Which means the JVM/JDK team(s) need to be open to instrumentation in a manual form though probably not how they are accustomed to seeing such.

Today a return (or exception) gives a result of an interaction. Signals will look to expose (though to some degree under control of library developer) what actually transpired in doing so which might very well involve multiple thread execution contexts. Think of it a bit like someone not just asking for the work they gave you to do but also asking how did it go in delivering the work for the purpose of improvement (which can take many forms).

William

On 25/06/2012 18:24, Nathan Reynolds wrote:
Code weaving is a good route.  BTrace, AspectJ or other tools (?) will make this simple.  Static and member empty methods will be inlined by JIT optimization and hence no instructions will be added to the optimized code.  These methods will impact interpreted code and hence impact start up and warm up.  These methods will add a bit more memory usage since there are additional bytecodes and metadata about the methods.  I kind of doubt these latter 2 points will have any significant impact.

Static final booleans initialized to a constant will cause javac to discard the check and dead code.  No interpreter impact and no bytecode impact.  Static final booleans initialized at runtime will *hopefully* cause JIT optimization to throw out the check and dead code.  I haven't checked this latter point.

Nathan Reynolds | Consulting Member of Technical Staff | 602.333.9091
Oracle PSR Engineering | Server Technology
On 6/17/2012 1:52 PM, Aleksey Shipilev wrote:
On 06/18/2012 12:39 AM, Doug Lea wrote:
I/We don't see a way to do it with zero guaranteed impact vs non-instrumented code. But after giving fjp-trace some time to settle in, we'll find some reasonable way to semi-integrate to make it easier to use for diagnostics.
Sure, let it brew. In the mean time, let's consider more light-weight ways to hook up the instrumentation code. I would go the code weaving route, but that will require specific junction points within FJP to weave to. Would calling dummy (static?) FJP methods in the places where registerEvent() is called in current code work as zero-impact marker? (Now the crazy talk). Other options are: - protecting instrumentation calls with asserts and weave them into existence when tracing is enabled? (Not really different from in-place static final boolean check). - local variables with special names get written with the event; hoping smart JIT will otherwise optimize the write away. (Is not really working for interpreter and inferior "java" VMs). - anything else? -Aleksey.

_______________________________________________ Concurrency-interest mailing list Concurrency-interest <at> cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest




_______________________________________________ Concurrency-interest mailing list Concurrency-interest <at> cs.oswego.edu http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

Gmane