Justin Bailey | 21 Dec 00:04

Optimizing cellular automata & the beauty of unlifted types

I'm back with another version of my cellular automata simulator. Since the last iteration I discovered GHC's unlifted types and the primitive operations that go with them. Using these types, rather than Ints, sped my code up by 2x or more.

http://hpaste.org/4151#a2 -- first half of program
http://hpaste.org/4151#a3 -- remaining (note 3 lines or so from first half are repeated)

The key observation came from looking at the Core files, which showed a lot of int2word# and word2int# conversions going on. Figuring out how to remove those led me to the unlifted types. Coding with these types is not easy (for example, I couldn't see a way to write a Word# value directly - I had to write stuff like "int2Word# 1#"), but because I had an existing algorithm to guide me, combined with type checking, it was a fairly straightforward implementation.

At first I felt kind of bad about using these operations, but then I noticed they are used pretty heavily by GHC itself. If it's good enough for GHC, it's good enough for me. The 2x performance gain didn't hurt either. Finally, the safety that comes from using the ST monad is just awesome. I've got low-level bit munging combined with high-level abstraction where I need it. So cool!

I was disappointed to find early on that using higher-order functions in tight loops was a performance killer. It's unfortunate because I started with a very elegant, short implementation based on a simple Ring buffer and map. The current version is certainly harder to understand and has some weird limitations. However, having the simple implementation let me use quickcheck to compare their results on random rules and inputs, which gave me high confidence that my complex implemenation is correct.

One thing I absolutely love about this program is its memory performance. It manages to stay within 1 - 10 MB of memory, depending on how much output is produced. How cool is that?

On Dec 3, 2007 2:44 AM, Mirko Rahn <rahn <at> ira.uka.de > wrote:
It is interesting, that the naive implementation

...

is only 3 times slower than your quite complex, hard to follow and hard
to debug implementation.

Now the naive implementation is 100x slower, so I don't feel so bad about this comment any more.
 

As always, I prefer to write most code in Haskell, quick, easy, nice,
reasonable fast, ... If speed matters, I switch to some lower level
language, as you did staying inside Haskell.

I have to take exception here - I *want* to write my code in Haskell. If Haskell isn't fast enough to beat the C implementation, I'd rather find a way to make my Haskell program faster than switch to some other language.

Justin
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Sterling Clover | 21 Dec 04:42

Re: Optimizing cellular automata & the beauty of unlifted types

I'm curious how much of the unboxing helped performance and how much  
didn't. In my experience playing with this stuff, GHC's strictness  
analyzer has consistently been really excellent, given the right  
hints. Unboxed tuples are generally a win, but GHC was often smarter  
at unboxing ints than I was. Also, higher-order functions can be used  
fine, assuming they're not awful recursive, as long as you set GHC's  
unfolding threshold high enough. You can also manually force their  
inlining, to really get control. I found there tended to be a "sweet  
spot" for any given program, where much higher or lower would degrade  
performance. As far as removing the word2int# and soforth, you could  
probably just use words throughout?

Also, as we discovered the other week, you may want to be careful  
with bang patterns, as forcing strictness on already evaluated values  
takes some time. Although, SPJ did point out that this was  
significantly improved in the new GHC.

But yes, I found that going through and manually unboxing a working  
implementation with the help of type errors was actually a sort of  
relaxing and zenlike exercise.

--S

On Dec 20, 2007, at 6:07 PM, Justin Bailey wrote:

> I'm back with another version of my cellular automata simulator.  
> Since the last iteration I discovered GHC's unlifted types and the  
> primitive operations that go with them. Using these types, rather  
> than Ints, sped my code up by 2x or more.
>
> http://hpaste.org/4151#a2 -- first half of program
> http://hpaste.org/4151#a3 -- remaining (note 3 lines or so from  
> first half are repeated)
>
> The key observation came from looking at the Core files, which  
> showed a lot of int2word# and word2int# conversions going on.  
> Figuring out how to remove those led me to the unlifted types.  
> Coding with these types is not easy (for example, I couldn't see a  
> way to write a Word# value directly - I had to write stuff like  
> "int2Word# 1#"), but because I had an existing algorithm to guide  
> me, combined with type checking, it was a fairly straightforward  
> implementation.
>
> At first I felt kind of bad about using these operations, but then  
> I noticed they are used pretty heavily by GHC itself. If it's good  
> enough for GHC, it's good enough for me. The 2x performance gain  
> didn't hurt either. Finally, the safety that comes from using the  
> ST monad is just awesome. I've got low-level bit munging combined  
> with high-level abstraction where I need it. So cool!
>
> I was disappointed to find early on that using higher-order  
> functions in tight loops was a performance killer. It's unfortunate  
> because I started with a very elegant, short implementation based  
> on a simple Ring buffer and map. The current version is certainly  
> harder to understand and has some weird limitations. However,  
> having the simple implementation let me use quickcheck to compare  
> their results on random rules and inputs, which gave me high  
> confidence that my complex implemenation is correct.
>
> One thing I absolutely love about this program is its memory  
> performance. It manages to stay within 1 - 10 MB of memory,  
> depending on how much output is produced. How cool is that?
>
> On Dec 3, 2007 2:44 AM, Mirko Rahn <rahn <at> ira.uka.de > wrote:
> It is interesting, that the naive implementation
>
> ...
>
> is only 3 times slower than your quite complex, hard to follow and  
> hard
> to debug implementation.
>
> Now the naive implementation is 100x slower, so I don't feel so bad  
> about this comment any more.
>
>
> As always, I prefer to write most code in Haskell, quick, easy, nice,
> reasonable fast, ... If speed matters, I switch to some lower level
> language, as you did staying inside Haskell.
>
> I have to take exception here - I *want* to write my code in  
> Haskell. If Haskell isn't fast enough to beat the C implementation,  
> I'd rather find a way to make my Haskell program faster than switch  
> to some other language.
>
> Justin
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
Justin Bailey | 21 Dec 18:00

Re: Optimizing cellular automata & the beauty of unlifted types

On Dec 20, 2007 7:42 PM, Sterling Clover <s.clover <at> gmail.com> wrote:
> I'm curious how much of the unboxing helped performance and how much
> didn't. In my experience playing with this stuff, GHC's strictness
> analyzer has consistently been really excellent, given the right
> hints. Unboxed tuples are generally a win, but GHC was often smarter
> at unboxing ints than I was.

It really did help. I started with an implementation that used Ints,
and this sped the program up by at least 2x. I think that's because of
the bit-manipulation I'm doing. For example, Data.Bits defines the
bitwise and operation on Ints as:

  (I# x#) .&.   (I# y#)  = I# (word2Int# (int2Word# x# `and#` int2Word# y#))

Which you can see has 3 conversions in it. The I#
deconstruction/construction is also suspicious to me, but I don't know
if there are performance costs there or not. Regardless, using Word#
directly lets me write (assuming w1# and w2# are already words):

  w1# `and#` w2#

Maybe better rewrite rules in the Data.Bits library would eliminate
unnecessary conversions and this wouldn't be necessary/

> Also, higher-order functions can be used
> fine, assuming they're not awful recursive, as long as you set GHC's
> unfolding threshold high enough. You can also manually force their
> inlining, to really get control. I found there tended to be a "sweet

I didn't know about unfolding, and I'll have to try that sometime. I
found that inlining seemed to have negative affects as often as it did
positive. I spent most of my time so far coming up with different
algorithms - now that I've got a decent one, I'll have to play with
inlining.

> spot" for any given program, where much higher or lower would degrade
> performance. As far as removing the word2int# and soforth, you could
> probably just use words throughout?

I try to. I just couldn't figure out how to write a Word# value
literally. For example, eqWord# compares Word# values. But this gives
a type error:

  1# `eqWord#` 1#

I have to write

  int2word# 1# `eqWord#` int2word# 1#

> Also, as we discovered the other week, you may want to be careful
> with bang patterns, as forcing strictness on already evaluated values
> takes some time. Although, SPJ did point out that this was
> significantly improved in the new GHC.

Good point, thanks.

Justin
Felipe Lessa | 21 Dec 20:22

Re: Optimizing cellular automata & the beauty of unlifted types

On Dec 21, 2007 3:00 PM, Justin Bailey <jgbailey <at> gmail.com> wrote:
> It really did help. I started with an implementation that used Ints,
> and this sped the program up by at least 2x. I think that's because of
> the bit-manipulation I'm doing. For example, Data.Bits defines the
> bitwise and operation on Ints as:
>
>   (I# x#) .&.   (I# y#)  = I# (word2Int# (int2Word# x# `and#` int2Word# y#))
>
> Which you can see has 3 conversions in it. The I#
> deconstruction/construction is also suspicious to me, but I don't know
> if there are performance costs there or not. Regardless, using Word#
> directly lets me write (assuming w1# and w2# are already words):
>
>   w1# `and#` w2#
>
> Maybe better rewrite rules in the Data.Bits library would eliminate
> unnecessary conversions and this wouldn't be necessary

Wouldn't it be sufficient to use Data.Word and rely on GHC unboxing
capabilities? Compiling

import Data.Bits
import Data.Word

main = do
    a <- getLine
    b <- getLine
    print (read a .&. read b :: Word)

with GHC 6.6.1 gives me

	   (case GHC.Read.read @ GHC.Word.Word GHC.Word.$f40 a_afE
	    of wild1_a1tT { GHC.Word.W# x#_a1tV ->
	    case GHC.Read.read @ GHC.Word.Word GHC.Word.$f40 a87_a1ub
	    of wild11_a1tW { GHC.Word.W# y#_a1tY ->
	    GHC.Word.$w$dmshow1 (GHC.Prim.and# x#_a1tV y#_a1tY)
	    }
	    })

which of course does the right thing.

Cheers,

--

-- 
Felipe.
Bertram Felgenhauer | 21 Dec 23:55

Re: Optimizing cellular automata & the beauty of unlifted types

Justin Bailey wrote:
> On Dec 20, 2007 7:42 PM, Sterling Clover <s.clover <at> gmail.com> wrote:
> > I'm curious how much of the unboxing helped performance and how much
> > didn't. In my experience playing with this stuff, GHC's strictness
> > analyzer has consistently been really excellent, given the right
> > hints. Unboxed tuples are generally a win, but GHC was often smarter
> > at unboxing ints than I was.
> 
> It really did help. I started with an implementation that used Ints,
> and this sped the program up by at least 2x. I think that's because of
> the bit-manipulation I'm doing. For example, Data.Bits defines the
> bitwise and operation on Ints as:
> 
>   (I# x#) .&.   (I# y#)  = I# (word2Int# (int2Word# x# `and#` int2Word# y#))
> 
> Which you can see has 3 conversions in it.

I believe you're barking up the wrong tree here. Consider

| module Q (f, g) where
| 
| import GHC.Base
| 
| f :: Word# -> Word# -> Word#
| f a b = a `and#` b
| 
| g :: Int# -> Int# -> Int#
| g a b = word2Int# (int2Word# a `and#` int2Word# b)

If you look at the generated machine code, you'll find that f and g
are identical functions. The sole purpose of the int2Word# and
word2Int# operations is to satisfy the type checker. (This is
even true at the core level. The core language is typed, so you
still need to do the conversions there. No code is generated for
them though.)

> The I# deconstruction/construction is also suspicious to me, but I
> don't know if there are performance costs there or not. Regardless,
> using Word# directly lets me write (assuming w1# and w2# are already
> words):
> 
>   w1# `and#` w2#

The I# deconstruction has a cost, but with proper strictness annotations
ghc should optimize those away - check the intermediate Core to see
whether it actually does.

In fact most of the speedup you got seems to come from the use of
uncheckedShiftL# and uncheckedShiftRL# - just using

  shiftL, shiftR :: Int -> Int -> Int
  I# a `shiftR` I# b = I# (word2Int# (int2Word# a `uncheckedShiftRL#` b))
  I# a `shiftL` I# b = I# (word2Int# (int2Word# a `uncheckedShiftL#` b))

speeds up the stepWithUArray code by a factor of 2 here, starting with
the first program at http://hpaste.org/4151.

The normal shiftL and shiftR implementations include bounds checks to
get the results for shifts that are bigger than the word size correct.
(For example  1 `shiftL` 65 = 0, while  1 `uncheckedShiftL#` 65 = 2
(probably. it depends on your architecture)). Eliminating those checks
results in a major speedup.

Apparently those bounds checks aren't eliminated even if the shifting
width is constant. I wonder why, there's clearly some room for
optimization here.

(I used ghc 6.8.1 on an Athlon XP. I used ghc -O2, no fancy options.)

enjoy,

Bertram
Justin Bailey | 22 Dec 00:40

Re: Optimizing cellular automata & the beauty of unlifted types

On Dec 21, 2007 2:55 PM, Bertram Felgenhauer
<bertram.felgenhauer <at> googlemail.com> wrote:

>
> If you look at the generated machine code, you'll find that f and g
> are identical functions. The sole purpose of the int2Word# and
> word2Int# operations is to satisfy the type checker. (This is
> even true at the core level. The core language is typed, so you
> still need to do the conversions there. No code is generated for
> them though.)

Good to know. They are scary!

>
> The I# deconstruction has a cost, but with proper strictness annotations
> ghc should optimize those away - check the intermediate Core to see
> whether it actually does.

If I see things like GHC.Prim.Intzh, is that a clue its the "unlifted" type?

>
> In fact most of the speedup you got seems to come from the use of
> uncheckedShiftL# and uncheckedShiftRL# - just using
>
>   shiftL, shiftR :: Int -> Int -> Int
>   I# a `shiftR` I# b = I# (word2Int# (int2Word# a `uncheckedShiftRL#` b))
>   I# a `shiftL` I# b = I# (word2Int# (int2Word# a `uncheckedShiftL#` b))
>
> speeds up the stepWithUArray code by a factor of 2 here, starting with
> the first program at http://hpaste.org/4151.

That is great. I tried your speedup and you are right - just
redefining those makes the "lifted" version faster than the unlifted.
Too bad there isn't an "unsafe" version of the shifts available.

Justin

Gmane