Scott Dillard | 24 Jun 21:07

Library-vs-local performance

Hi,

I've got a library that I'm in the process of uploading to hackage (waiting for account) but the darcs repo is here:

http://graphics.cs.ucdavis.edu/~sdillard/Vec

I notice a slight drop in performance when I install the library using cabal. Maybe 10-20%, on one particular function. This is in comparison to when the library is 'local', as in, the source files are in the same directory as the client application.

What could be causing the performance drop? The function in question requires impractical amounts of inlining (This is something of an experiment), but I don't see how installing it as a library affects that. The functions to be inlined are small, surely available in the .hi files. Its only when they are applied do they agglomerate into a big mess - 80-200K lines of core.

The function in question is invertMany in examples/Examples.hs.

Scott

_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users <at> haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Don Stewart | 24 Jun 21:50
Gravatar

Re: Library-vs-local performance

sedillard:
>    Hi,
> 
>    I've got a library that I'm in the process of uploading to hackage
>    (waiting for account) but the darcs repo is here:
> 
>    [1]http://graphics.cs.ucdavis.edu/~sdillard/Vec
> 
>    I notice a slight drop in performance when I install the library using
>    cabal. Maybe 10-20%, on one particular function. This is in comparison to
>    when the library is 'local', as in, the source files are in the same
>    directory as the client application.

Lack of unfolding and inlining when compiled in a package? Try compiling
with -O2, for maximum unfolding.

>    What could be causing the performance drop? The function in question
>    requires impractical amounts of inlining (This is something of an
>    experiment), but I don't see how installing it as a library affects that.
>    The functions to be inlined are small, surely available in the .hi files.

You can check this via -show-iface

>    Its only when they are applied do they agglomerate into a big mess -
>    80-200K lines of core.
> 
>    The function in question is invertMany in examples/Examples.hs.

-- Don
Don Stewart | 24 Jun 21:56
Gravatar

Re: Library-vs-local performance

dons:
> sedillard:
> >    Hi,
> > 
> >    I've got a library that I'm in the process of uploading to hackage
> >    (waiting for account) but the darcs repo is here:
> > 
> >    [1]http://graphics.cs.ucdavis.edu/~sdillard/Vec
> > 
> >    I notice a slight drop in performance when I install the library using
> >    cabal. Maybe 10-20%, on one particular function. This is in comparison to
> >    when the library is 'local', as in, the source files are in the same
> >    directory as the client application.
> 
> Lack of unfolding and inlining when compiled in a package? Try compiling
> with -O2, for maximum unfolding.
>   
> >    What could be causing the performance drop? The function in question
> >    requires impractical amounts of inlining (This is something of an
> >    experiment), but I don't see how installing it as a library affects that.
> >    The functions to be inlined are small, surely available in the .hi files.
> 
> You can check this via -show-iface
>  
> >    Its only when they are applied do they agglomerate into a big mess -
> >    80-200K lines of core.
> > 
> >    The function in question is invertMany in examples/Examples.hs.

Some general remarks, GHC isn't (yet) a whole program compiler by
default. So it doesn't, by default, inling entire packages across
package boundaries. So you can lose some specialisation/inlining,
sometimes, by breaking things across module boundaries.

That said, it's entirely possible to program libraries in a way to
specifically allow full inlining of the libraries. The Data.Binary and
Data.Array.Vector libraries are written in this style for example,
which means lots of {-# INLINE #-} pragmas, maximum unfolding and high
optimisation levels.

-- Don
Scott Dillard | 24 Jun 22:56

Re: Library-vs-local performance


That said, it's entirely possible to program libraries in a way to
specifically allow full inlining of the libraries. The Data.Binary and
Data.Array.Vector libraries are written in this style for example,
which means lots of {-# INLINE #-} pragmas, maximum unfolding and high
optimisation levels.

-- Don

Every function has an inline pragma. Adding -O2 -funfolding-use-threshold999 -funfolding-creation-threshold999 does not significantly change the produced .hi files (--show-iface produces roughly the same files, just different headers)  This makes sense to me because the library doesn't actually _do_ anything. There are no significant compiled functions, everything should be inlined. And since the .hi files are the same, I don't see why they wouldn't be. The two scenarios are these:

1) Library is installed via cabal.
2) Library source lives in the same directory as the application, so that ghc --make Examples.hs also builds the library.

When compiling the application I set all knobs to 11. In case 1, ./Examples 3000000 runs in 6.9s, case 2 in 5.2s. The module structure is the same in both cases, so I don't know what inlining across module boundaries has to do with it.

By the way, the library is now on hackage,
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Vec
but the documentation does not show up. What do I have to do to make this happen?


Scott
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users <at> haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Don Stewart | 24 Jun 23:01
Gravatar

Re: Library-vs-local performance

sedillard:
>      That said, it's entirely possible to program libraries in a way to
>      specifically allow full inlining of the libraries. The Data.Binary and
>      Data.Array.Vector libraries are written in this style for example,
>      which means lots of {-# INLINE #-} pragmas, maximum unfolding and high
>      optimisation levels.
>      -- Don
> 
>    Every function has an inline pragma. Adding -O2
>    -funfolding-use-threshold999 -funfolding-creation-threshold999 does not
>    significantly change the produced .hi files (--show-iface produces roughly
>    the same files, just different headers)  This makes sense to me because
>    the library doesn't actually _do_ anything. There are no significant
>    compiled functions, everything should be inlined. And since the .hi files
>    are the same, I don't see why they wouldn't be. The two scenarios are
>    these:
> 
>    1) Library is installed via cabal.
>    2) Library source lives in the same directory as the application, so that
>    ghc --make Examples.hs also builds the library.

That's compiling Examples with full access to the source though!
So ghc has the entire source available. Once you've installed the
library, however, only what is exposed via the .hi files can be used
for optimisation purposes. So *something* is not being inlined fully (or
some other optimisation is interferring)

Boiling this down to the smallest test case you can would be *really* 
useful!!

> 
>    When compiling the application I set all knobs to 11. In case 1,
>    ./Examples 3000000 runs in 6.9s, case 2 in 5.2s. The module structure is
>    the same in both cases, so I don't know what inlining across module
>    boundaries has to do with it.

>    By the way, the library is now on hackage,
>    [1]http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Vec
>    but the documentation does not show up. What do I have to do to make this
>    happen?

Oh assuming haddock can process it, it'll appear in a few hours. Hadock
is run periodically.
Ian Lynagh | 24 Jun 23:16
Gravatar

Re: Library-vs-local performance

On Tue, Jun 24, 2008 at 02:01:58PM -0700, Donald Bruce Stewart wrote:
> > 
> >    1) Library is installed via cabal.
> >    2) Library source lives in the same directory as the application, so that
> >    ghc --make Examples.hs also builds the library.
> 
> That's compiling Examples with full access to the source though!
> So ghc has the entire source available.

That shouldn't make any difference. I suspect a flag difference is to
blame - giving cabal build the -v flag will show which flags it is
using.

Thanks
Ian
Scott Dillard | 24 Jun 23:45

Re: Library-vs-local performance

I can't reproduce the behavior on any of the less egregiously inlined functions. For everything else the running times are the same using either local packages or installed libraries.

On Tue, Jun 24, 2008 at 3:16 PM, Ian Lynagh <igloo <at> earth.li> wrote:
On Tue, Jun 24, 2008 at 02:01:58PM -0700, Donald Bruce Stewart wrote:
> >
> >    1) Library is installed via cabal.
> >    2) Library source lives in the same directory as the application, so that
> >    ghc --make Examples.hs also builds the library.
>
> That's compiling Examples with full access to the source though!
> So ghc has the entire source available.

That shouldn't make any difference. I suspect a flag difference is to
blame - giving cabal build the -v flag will show which flags it is
using.

I've taken all optimization flags out of the .cabal file. They don't have any effect. My understanding of things is this: (please correct if wrong) All functions have inline pragmas, and all are small (1 or 2 lines) so their definitions are all spewed into the .hi files. So in both scenarios (library vs local) GHC can "see" the whole library. Since every function is inlined, it doesn't matter what flags the library is compiled with. That compiled code will never be used so long as the application is compiled with optimization on. 

Now the particulars of the situation are this: the function in question is inlined very deeply, it has many instance constraints, and during simplification the core blows up to _ridiculous_ sizes. (Compilation with -ddump-simpl is taking about 5-10 min.) I think I'm pushing the compiler to unreasonable limits, and I think maybe something non-obvious is going on inside.

On the otherhand, pushing the compiler in this way gets me a 3x speedup, which is nothing to sneeze at. In the meantime I'll see what I can do to make this function (gaussian elimination) more amenable to simplification. The rest of the library works great.

Scott
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users <at> haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Don Stewart | 24 Jun 23:51
Gravatar

Re: Library-vs-local performance

sedillard:
>    I can't reproduce the behavior on any of the less egregiously inlined
>    functions. For everything else the running times are the same using either
>    local packages or installed libraries.
> 
>    On Tue, Jun 24, 2008 at 3:16 PM, Ian Lynagh <[1]igloo <at> earth.li> wrote:
> 
>      On Tue, Jun 24, 2008 at 02:01:58PM -0700, Donald Bruce Stewart wrote:
>      > >
>      > >    1) Library is installed via cabal.
>      > >    2) Library source lives in the same directory as the application,
>      so that
>      > >    ghc --make Examples.hs also builds the library.
>      >
>      > That's compiling Examples with full access to the source though!
>      > So ghc has the entire source available.
> 
>      That shouldn't make any difference. I suspect a flag difference is to
>      blame - giving cabal build the -v flag will show which flags it is
>      using.
> 
>    I've taken all optimization flags out of the .cabal file. They don't have
>    any effect. My understanding of things is this: (please correct if wrong)
>    All functions have inline pragmas, and all are small (1 or 2 lines) so
>    their definitions are all spewed into the .hi files. So in both scenarios
>    (library vs local) GHC can "see" the whole library. Since every function
>    is inlined, it doesn't matter what flags the library is compiled with.
>    That compiled code will never be used so long as the application is
>    compiled with optimization on. 
> 
>    Now the particulars of the situation are this: the function in question is
>    inlined very deeply, it has many instance constraints, and during
>    simplification the core blows up to _ridiculous_ sizes. (Compilation with
>    -ddump-simpl is taking about 5-10 min.) I think I'm pushing the compiler
>    to unreasonable limits, and I think maybe something non-obvious is going
>    on inside.
> 
>    On the otherhand, pushing the compiler in this way gets me a 3x speedup,
>    which is nothing to sneeze at. In the meantime I'll see what I can do to
>    make this function (gaussian elimination) more amenable to simplification.
>    The rest of the library works great.

You might want to give the simplifier enough time to unwind things.
I use, e.g.

            -O2
            -fvia-C -optc-O2
            -fdicts-cheap
            -fno-method-sharing
            -fmax-simplifier-iterations10
            -fliberate-case-threshold100

in my ghc-options for 'whole program' libraries.

Raise these limits if you find they're having an effect

-- Don
Scott Dillard | 25 Jun 00:05

Re: Library-vs-local performance



On Tue, Jun 24, 2008 at 3:51 PM, Don Stewart <dons <at> galois.com> wrote:
>
>    I've taken all optimization flags out of the .cabal file. They don't have
>    any effect. My understanding of things is this: (please correct if wrong)
>    All functions have inline pragmas, and all are small (1 or 2 lines) so
>    their definitions are all spewed into the .hi files. So in both scenarios
>    (library vs local) GHC can "see" the whole library. Since every function
>    is inlined, it doesn't matter what flags the library is compiled with.
>    That compiled code will never be used so long as the application is
>    compiled with optimization on.


You might want to give the simplifier enough time to unwind things.
I use, e.g.

           -O2
           -fvia-C -optc-O2
           -fdicts-cheap
           -fno-method-sharing
           -fmax-simplifier-iterations10
           -fliberate-case-threshold100

in my ghc-options for 'whole program' libraries.

Raise these limits if you find they're having an effect

-- Don


Yeah I saw those in your uvector library and was going to ask: what do they do? Are they documented anywhere? I can't find any info on them. Speicifcally, what is the case liberation threshold? (Can't even find that on google.) That sounds germane because the function in question is one of the few with branches.

And what effect does -fvia-C -optc-O2 have? Those refer to the generation of machine code, do they not? If the library is essentially a core-only library, why use them? As far as I can tell, even -O2 is ineffectual when compiling the library. 'Compiling' here is even a misnomer. We're just transliterating from haskell to core.

Scott

_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users <at> haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Don Stewart | 25 Jun 00:15
Gravatar

Re: Library-vs-local performance

sedillard:
>    On Tue, Jun 24, 2008 at 3:51 PM, Don Stewart <[1]dons <at> galois.com> wrote:
> 
>      >
>      >    I've taken all optimization flags out of the .cabal file. They
>      don't have
>      >    any effect. My understanding of things is this: (please correct if
>      wrong)
>      >    All functions have inline pragmas, and all are small (1 or 2 lines)
>      so
>      >    their definitions are all spewed into the .hi files. So in both
>      scenarios
>      >    (library vs local) GHC can "see" the whole library. Since every
>      function
>      >    is inlined, it doesn't matter what flags the library is compiled
>      with.
>      >    That compiled code will never be used so long as the application is
>      >    compiled with optimization on.
> 
>      You might want to give the simplifier enough time to unwind things.
>      I use, e.g.
> 
>                 -O2
>                 -fvia-C -optc-O2
>                 -fdicts-cheap
>                 -fno-method-sharing
>                 -fmax-simplifier-iterations10
>                 -fliberate-case-threshold100
> 
>      in my ghc-options for 'whole program' libraries.
> 
>      Raise these limits if you find they're having an effect
>      -- Don
> 
>    Yeah I saw those in your uvector library and was going to ask: what do
>    they do? Are they documented anywhere? I can't find any info on them.
>    Speicifcally, what is the case liberation threshold? (Can't even find that
>    on google.) That sounds germane because the function in question is one of
>    the few with branches.
> 
>    And what effect does -fvia-C -optc-O2 have? Those refer to the generation
>    of machine code, do they not? If the library is essentially a core-only
>    library, why use them? As far as I can tell, even -O2 is ineffectual when
>    compiling the library. 'Compiling' here is even a misnomer. We're just
>    transliterating from haskell to core.

Nope, there's a lot of optimisations taking place on the core-to-core
phase, to ensure the core that gets unfolded into your .hi files is as
nice as possible. And then still there's things that actually stay as
calls into your compiled library -- for those, you'll want direct jumps
and so forth, which you get with -fvia-C -optc-O2 and above.

See my recent post on micro optimisations.

-- Don
Scott Dillard | 25 Jun 01:18

Re: Library-vs-local performance



On Tue, Jun 24, 2008 at 4:15 PM, Don Stewart <dons <at> galois.com> wrote:

Nope, there's a lot of optimisations taking place on the core-to-core
phase, to ensure the core that gets unfolded into your .hi files is as
nice as possible. And then still there's things that actually stay as
calls into your compiled library -- for those, you'll want direct jumps
and so forth, which you get with -fvia-C -optc-O2 and above.

See my recent post on micro optimisations.

-- Don

Fair enough, but I don't think that's whats going on here specifically. I can't get ghc-options to effect any change, one way or the other. I guess its a mystery for now.

Thanks for the replies.

Scott

_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users <at> haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Gmane