14 Nov 2012 22:43

## I killed performance of my code with Eval and Strategies

```Dear Haskellers,

I am reading Simon Marlow's tutorial on parallelism and I have problems with correctly using Eval
monad and Strategies. I *thought* I understand them but after writing some code it turns out that
obviously I don't because parallelized code is about 20 times slower. Here's a short example
(code + criterion benchmarks):

{-# LANGUAGE BangPatterns #-}
module Main where

import Control.Parallel.Strategies
import Criterion.Main

main :: IO ()
main = defaultMain [
bench "Seq" \$ nf calculateSeq xs
, bench "Par" \$ nf calculatePar xs ]
where xs = [1..16384]

calculateSeq :: [Double] -> [Double]
calculateSeq [] = []
calculateSeq (x:xs) = (sin . sqrt \$ x) : xs

calculatePar :: [Double] -> [Double]
calculatePar xss = runEval \$ go xss
where
go :: Strategy [Double]
go [] = return []
go xs = do
lsh <- (rpar `dot` rdeepseq) \$ calculateSeq as
```

15 Nov 2012 02:07

### Re: I killed performance of my code with Eval and Strategies

On Nov 14, 2012 10:44 PM, "Janek S." <fremenzone <at> poczta.onet.pl> wrote:
> calculateSeq :: [Double] -> [Double]
> calculateSeq [] = []
> calculateSeq (x:xs) = (sin . sqrt \$ x) : xs

Do you really mean to calculate the 'sin . sqrt' of just the head of the list, or do you mean:

calculateSeq = map (sin . sqrt)   ?

Bas

```_______________________________________________
```
15 Nov 2012 13:01

### Re: I killed performance of my code with Eval and Strategies

```> Do you really mean to calculate the 'sin . sqrt' of just the head of the list, or do you mean:
> calculateSeq = map (sin . sqrt)   ?
Argh.. of course not! That's what you get when you code in the middle of a night. But in my code I
will not be able to use map because elements will be processed in pairs, so let's say that my
sequential function looks like this:

calculateSeq :: [Double] -> [Double]
calculateSeq [] = []
calculateSeq [x] = [sin . sqrt \$ x]
calculateSeq (x:y:xs) = (sin . sqrt \$ x) : (cos . sqrt \$ y) : calculateSeq xs

> I don't think there's a memory leak. It looks more like you're just
> allocating much more than is sane for such a simple function.
> On a recent processor, sin . sqrt is two instructions. Meanwhile, you have
> a list of (boxed?) integers being split up, then recombined. That's bound
> to hurt the GC.
I am not entirely convinced that my idea of using eval+strategies is bound to be slow, because
there are functions like parListChunk that do exactly this: split the list into chunks, process
them in parallel and then concatenate the result. Functions in Control.Parallel.Strategies were
designed to deal with list so I assume it is possible to process lists in parallel without GC
problems. However I do not see a way to apply these functions in my setting where elements of
lists are processed in pairs, not one at a time (parList and parMap will not do). Also, working
on a list of tuples will not do.

> Also, you might want to configure criterion to GC between
> runs. That might help.
The -g flag passed to criterion executable does that.

> What I'd suggest doing instead, is breaking the input into chucks of, say,
> 1024, and representing it with a [Vector]. Then, run your sin.sqrt's on
> each vector in parallel. Finally, use Data.Vector.concat to combine your
> result.
As stated in my post scriptum I am aware of that solution :) Here I'm trying to figure what am I
doing wrong with Eval.

Thanks!
Janek

>
> Hope that helps,
>   - Clark
>
> On Wed, Nov 14, 2012 at 4:43 PM, Janek S. <fremenzone <at> poczta.onet.pl> wrote:
> >
> > I am reading Simon Marlow's tutorial on parallelism and I have problems
> > with correctly using Eval
> > monad and Strategies. I *thought* I understand them but after writing
> > some code it turns out that
> > obviously I don't because parallelized code is about 20 times slower.
> > Here's a short example
> > (code + criterion benchmarks):
> >
> > {-# LANGUAGE BangPatterns #-}
> > module Main where
> >
> > import Control.Parallel.Strategies
> > import Criterion.Main
> >
> > main :: IO ()
> > main = defaultMain [
> >         bench "Seq" \$ nf calculateSeq xs
> >       , bench "Par" \$ nf calculatePar xs ]
> >     where xs = [1..16384]
> >
> > calculateSeq :: [Double] -> [Double]
> > calculateSeq [] = []
> > calculateSeq (x:xs) = (sin . sqrt \$ x) : xs
> >
> > calculatePar :: [Double] -> [Double]
> > calculatePar xss = runEval \$ go xss
> >     where
> >       go :: Strategy [Double]
> >       go [] = return []
> >       go xs = do
> >           lsh <- (rpar `dot` rdeepseq) \$ calculateSeq as
> >           lst <- go bs
> >           return (lsh ++ lst)
> >           where
> >             !(as, bs) = splitAt 8192 xs
> >
> > Compiling and running with:
> >
> > ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog evalleak.hs
> > ./evalleak -oreport.html -g +RTS -N2 -ls -s
> >
> > I get:
> >
> > benchmarking Seq
> > mean: 100.5990 us, lb 100.1937 us, ub 101.1521 us, ci 0.950
> > std dev: 2.395003 us, lb 1.860923 us, ub 3.169562 us, ci 0.950
> >
> > benchmarking Par
> > mean: 2.233127 ms, lb 2.169669 ms, ub 2.296155 ms, ci 0.950
> > std dev: 323.5201 us, lb 310.2844 us, ub 344.8252 us, ci 0.950
> >
> > That's a hopeless result. Looking at the spark allocation everything
> > looks fine:
> >
> > SPARKS: 202 (202 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
> >
> > But analyzing eventlog with ThreadScope I see that parallel function
> > spends most of the time doing
> > garbage collection, which suggests that I have a memory leak somewhere. I
> > suspected that problem
> > might be caused by appending two lists together in the parallel
> > implementation, but replacing
> > this with difference lists doesn't help. Changing granularity (e.g.
> > splitAt 512) also brings no
> > improvement. Can anyone point me to what am I doing wrong?
> >
> > Janek
> >
> > PS. This is of course not a real world code - I know that I'd be better
> > of using unboxed data
> > structures for doing computations on Doubles.
> >
> > _______________________________________________

_______________________________________________
```
16 Nov 2012 11:05

### Re: I killed performance of my code with Eval and Strategies

Just another definition of calculateSeq:

calculateSeq = zipWith (\$) (cycle [sin,cos]) . map sqrt

2012/11/15 Janek S.
> Do you really mean to calculate the 'sin . sqrt' of just the head of the list, or do you mean:
> calculateSeq = map (sin . sqrt)   ?
Argh.. of course not! That's what you get when you code in the middle of a night. But in my code I
will not be able to use map because elements will be processed in pairs, so let's say that my
sequential function looks like this:

calculateSeq :: [Double] -> [Double]
calculateSeq [] = []
calculateSeq [x] = [sin . sqrt \$ x]
calculateSeq (x:y:xs) = (sin . sqrt \$ x) : (cos . sqrt \$ y) : calculateSeq xs

> I don't think there's a memory leak. It looks more like you're just
> allocating much more than is sane for such a simple function.
> On a recent processor, sin . sqrt is two instructions. Meanwhile, you have
> a list of (boxed?) integers being split up, then recombined. That's bound
> to hurt the GC.
I am not entirely convinced that my idea of using eval+strategies is bound to be slow, because
there are functions like parListChunk that do exactly this: split the list into chunks, process
them in parallel and then concatenate the result. Functions in Control.Parallel.Strategies were
designed to deal with list so I assume it is possible to process lists in parallel without GC
problems. However I do not see a way to apply these functions in my setting where elements of
lists are processed in pairs, not one at a time (parList and parMap will not do). Also, working
on a list of tuples will not do.

> Also, you might want to configure criterion to GC between
> runs. That might help.
The -g flag passed to criterion executable does that.

> What I'd suggest doing instead, is breaking the input into chucks of, say,
> 1024, and representing it with a [Vector]. Then, run your sin.sqrt's on
> each vector in parallel. Finally, use Data.Vector.concat to combine your
> result.
As stated in my post scriptum I am aware of that solution :) Here I'm trying to figure what am I
doing wrong with Eval.

Thanks!
Janek

>
> Hope that helps,
>   - Clark
>
> On Wed, Nov 14, 2012 at 4:43 PM, Janek S. <fremenzone <at> poczta.onet.pl> wrote:
> >
> > I am reading Simon Marlow's tutorial on parallelism and I have problems
> > with correctly using Eval
> > monad and Strategies. I *thought* I understand them but after writing
> > some code it turns out that
> > obviously I don't because parallelized code is about 20 times slower.
> > Here's a short example
> > (code + criterion benchmarks):
> >
> > {-# LANGUAGE BangPatterns #-}
> > module Main where
> >
> > import Control.Parallel.Strategies
> > import Criterion.Main
> >
> > main :: IO ()
> > main = defaultMain [
> >         bench "Seq" \$ nf calculateSeq xs
> >       , bench "Par" \$ nf calculatePar xs ]
> >     where xs = [1..16384]
> >
> > calculateSeq :: [Double] -> [Double]
> > calculateSeq [] = []
> > calculateSeq (x:xs) = (sin . sqrt \$ x) : xs
> >
> > calculatePar :: [Double] -> [Double]
> > calculatePar xss = runEval \$ go xss
> >     where
> >       go :: Strategy [Double]
> >       go [] = return []
> >       go xs = do
> >           lsh <- (rpar `dot` rdeepseq) \$ calculateSeq as
> >           lst <- go bs
> >           return (lsh ++ lst)
> >           where
> >             !(as, bs) = splitAt 8192 xs
> >
> > Compiling and running with:
> >
> > ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog evalleak.hs
> > ./evalleak -oreport.html -g +RTS -N2 -ls -s
> >
> > I get:
> >
> > benchmarking Seq
> > mean: 100.5990 us, lb 100.1937 us, ub 101.1521 us, ci 0.950
> > std dev: 2.395003 us, lb 1.860923 us, ub 3.169562 us, ci 0.950
> >
> > benchmarking Par
> > mean: 2.233127 ms, lb 2.169669 ms, ub 2.296155 ms, ci 0.950
> > std dev: 323.5201 us, lb 310.2844 us, ub 344.8252 us, ci 0.950
> >
> > That's a hopeless result. Looking at the spark allocation everything
> > looks fine:
> >
> > SPARKS: 202 (202 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
> >
> > But analyzing eventlog with ThreadScope I see that parallel function
> > spends most of the time doing
> > garbage collection, which suggests that I have a memory leak somewhere. I
> > suspected that problem
> > might be caused by appending two lists together in the parallel
> > implementation, but replacing
> > this with difference lists doesn't help. Changing granularity (e.g.
> > splitAt 512) also brings no
> > improvement. Can anyone point me to what am I doing wrong?
> >
> > Janek
> >
> > PS. This is of course not a real world code - I know that I'd be better
> > of using unboxed data
> > structures for doing computations on Doubles.
> >
> > _______________________________________________

_______________________________________________

```_______________________________________________
```
17 Nov 2012 11:20

### Re: I killed performance of my code with Eval and Strategies

```> Just another definition of calculateSeq:
>
> calculateSeq = zipWith (\$) (cycle [sin,cos]) . map sqrt
This is just slightly slower than my implementation.

I came up with a better implementation of parallel function:

calculatePar2 :: [Double] -> [Double]
calculatePar2 xss = runEval \$
concat `fmap` parList (rdeepseq . calculateSeq) (chunk 2048 xss)

chunk :: Int -> [a] -> [[a]]
chunk _ [] = []
chunk n xs = as : chunk n bs where !(as, bs) = splitAt n xs

This mimics behaviour of parListChunk method. It is faster than my original calculatePar, but
still 3-4 times slower than sequential version.

Janek

>
>
> 2012/11/15 Janek S. <fremenzone <at> poczta.onet.pl>
>
> > > Do you really mean to calculate the 'sin . sqrt' of just the head of
> > > the
> >
> > list, or do you mean:
> > > calculateSeq = map (sin . sqrt)   ?
> >
> > Argh.. of course not! That's what you get when you code in the middle of
> > a night. But in my code I
> > will not be able to use map because elements will be processed in pairs,
> > so let's say that my
> > sequential function looks like this:
> >
> > calculateSeq :: [Double] -> [Double]
> > calculateSeq [] = []
> > calculateSeq [x] = [sin . sqrt \$ x]
> > calculateSeq (x:y:xs) = (sin . sqrt \$ x) : (cos . sqrt \$ y) :
> > calculateSeq xs
> >
> > > I don't think there's a memory leak. It looks more like you're just
> > > allocating much more than is sane for such a simple function.
> > > On a recent processor, sin . sqrt is two instructions. Meanwhile, you
> >
> > have
> >
> > > a list of (boxed?) integers being split up, then recombined. That's
> > > bound to hurt the GC.
> >
> > I am not entirely convinced that my idea of using eval+strategies is
> > bound to be slow, because
> > there are functions like parListChunk that do exactly this: split the
> > list into chunks, process
> > them in parallel and then concatenate the result. Functions in
> > Control.Parallel.Strategies were
> > designed to deal with list so I assume it is possible to process lists in
> > parallel without GC
> > problems. However I do not see a way to apply these functions in my
> > setting where elements of
> > lists are processed in pairs, not one at a time (parList and parMap will
> > not do). Also, working
> > on a list of tuples will not do.
> >
> > > Also, you might want to configure criterion to GC between
> > > runs. That might help.
> >
> > The -g flag passed to criterion executable does that.
> >
> > > What I'd suggest doing instead, is breaking the input into chucks of,
> >
> > say,
> >
> > > 1024, and representing it with a [Vector]. Then, run your sin.sqrt's on
> > > each vector in parallel. Finally, use Data.Vector.concat to combine
> >
> > As stated in my post scriptum I am aware of that solution :) Here I'm
> > trying to figure what am I
> > doing wrong with Eval.
> >
> > Thanks!
> > Janek
> >
> > > Hope that helps,
> > >   - Clark
> > >
> > > On Wed, Nov 14, 2012 at 4:43 PM, Janek S. <fremenzone <at> poczta.onet.pl>
> >
> > wrote:
> > > > Dear Haskellers,
> > > >
> > > > I am reading Simon Marlow's tutorial on parallelism and I have
> > > > problems with correctly using Eval
> > > > monad and Strategies. I *thought* I understand them but after writing
> > > > some code it turns out that
> > > > obviously I don't because parallelized code is about 20 times slower.
> > > > Here's a short example
> > > > (code + criterion benchmarks):
> > > >
> > > > {-# LANGUAGE BangPatterns #-}
> > > > module Main where
> > > >
> > > > import Control.Parallel.Strategies
> > > > import Criterion.Main
> > > >
> > > > main :: IO ()
> > > > main = defaultMain [
> > > >         bench "Seq" \$ nf calculateSeq xs
> > > >       , bench "Par" \$ nf calculatePar xs ]
> > > >     where xs = [1..16384]
> > > >
> > > > calculateSeq :: [Double] -> [Double]
> > > > calculateSeq [] = []
> > > > calculateSeq (x:xs) = (sin . sqrt \$ x) : xs
> > > >
> > > > calculatePar :: [Double] -> [Double]
> > > > calculatePar xss = runEval \$ go xss
> > > >     where
> > > >       go :: Strategy [Double]
> > > >       go [] = return []
> > > >       go xs = do
> > > >           lsh <- (rpar `dot` rdeepseq) \$ calculateSeq as
> > > >           lst <- go bs
> > > >           return (lsh ++ lst)
> > > >           where
> > > >             !(as, bs) = splitAt 8192 xs
> > > >
> > > > Compiling and running with:
> > > >
> > > > ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog evalleak.hs
> > > > ./evalleak -oreport.html -g +RTS -N2 -ls -s
> > > >
> > > > I get:
> > > >
> > > > benchmarking Seq
> > > > mean: 100.5990 us, lb 100.1937 us, ub 101.1521 us, ci 0.950
> > > > std dev: 2.395003 us, lb 1.860923 us, ub 3.169562 us, ci 0.950
> > > >
> > > > benchmarking Par
> > > > mean: 2.233127 ms, lb 2.169669 ms, ub 2.296155 ms, ci 0.950
> > > > std dev: 323.5201 us, lb 310.2844 us, ub 344.8252 us, ci 0.950
> > > >
> > > > That's a hopeless result. Looking at the spark allocation everything
> > > > looks fine:
> > > >
> > > > SPARKS: 202 (202 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
> > > >
> > > > But analyzing eventlog with ThreadScope I see that parallel function
> > > > spends most of the time doing
> > > > garbage collection, which suggests that I have a memory leak
> >
> > somewhere. I
> >
> > > > suspected that problem
> > > > might be caused by appending two lists together in the parallel
> > > > implementation, but replacing
> > > > this with difference lists doesn't help. Changing granularity (e.g.
> > > > splitAt 512) also brings no
> > > > improvement. Can anyone point me to what am I doing wrong?
> > > >
> > > > Janek
> > > >
> > > > PS. This is of course not a real world code - I know that I'd be
> > > > better of using unboxed data
> > > > structures for doing computations on Doubles.
> > > >
> > > > _______________________________________________
> > > > Haskell-Cafe mailing list
> >
> > _______________________________________________
```
18 Nov 2012 02:41

### Re: I killed performance of my code with Eval and Strategies

```Dear Janek,

> I am reading Simon Marlow's tutorial on parallelism and I have problems
> with correctly using Eval monad and Strategies. I *thought* I understand
> them but after writing some code it turns out that  obviously I don't
> because parallelized code is about 20 times slower. Here's a short
> example  (code + criterion benchmarks):

Actually, (sin . sqrt) is simply too cheap. The overhead of constructing
chunks (which have to be constructed on the heap) and concatenating the
results far outweighs the cost of computing the list elements.

If, for example, you replace sin . sqrt by f defined by

f :: Double -> Double
f x | x < 10 = x*x
| otherwise = sin x * f (x-100)

the picture will change. The loss also becomes far less dramatic if
you construct the chunks outside of the benchmark:

main :: IO ()
main = defaultMain [
bench "Seq" \$ nf (map calculateSeq) xs
, bench "Par" \$ nf calculatePar xs ]
where xs = chunk 2048 [1..16384]

f, f' :: Double -> Double
f x = sqrt (sin x)
f' x | x < 10 = x*x
| otherwise = sin x * f' (x-100)

calculateSeq :: [Double] -> [Double]
calculateSeq [] = []
calculateSeq (x:xs) = f x : calculateSeq xs

calculatePar :: [[Double]] -> [[Double]]
calculatePar xss = runEval \$ parList (rdeepseq . calculateSeq) xss

chunk :: Int -> [a] -> [[a]]
chunk _ [] = []
chunk n xs = as : chunk n bs where !(as, bs) = splitAt n xs

The parallel version (with f = sqrt . sin) is still somewhat slower
than the sequential version with -N1 -- probably due to rdeepseq.

Best regards,

Bertram
```

Gmane