Branimir Maksimovic | 1 Dec 17:09 2012
Picon

Why is boxed mutable array so slow?

I have made benchmark test inspired by http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/

What surprised me is that unboxed array is much faster than boxed array.
Actually boxed array performance is on par with standard Haskell list
which is very slow.
All in all even unboxed array is about 10 times slower than Java version.
I don't understand why is even unboxed array so slow.
But! unboxed array consumes least amount of RAM.
(warning, program consumes more than 3gb of ram)

 bmaxa <at> maxa:~/examples$ time ./Cumul
boxed array
last 262486571 seconds 4.972
unboxed array
last 262486571 seconds 0.776
list
last 262486571 seconds 6.812

real    0m13.086s
user    0m11.996s
sys     0m1.080s

-------------------------------------------------------------------------
{-# LANGUAGE CPP, BangPatterns #-}
import System.CPUTime
import Text.Printf
import Data.Array.IO
import Data.Array.Base
import Data.Int
import Control.DeepSeq
import System.Mem

main :: IO()
main = do
(newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array" 
performGC
(newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array" 
performGC
begin <- getCPUTime
printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
end <- getCPUTime
let diff = (fromIntegral (end - begin)) / (10^12)
printf " seconds %.3f\n" (diff::Double)

test s a = do
putStrLn s
begin <- getCPUTime
init' a
partial_sum a
end <- getCPUTime
let diff = (fromIntegral (end - begin)) / (10^12)
last <- readArray a (n'-1)
printf "last %d seconds %.3f\n" last (diff::Double)

n' :: Int
n' = 50 * 1000 * 1000

type A = IOArray Int Int32
type B = IOUArray Int Int32

init' a = do
(_,n) <- getBounds a
init a 0 n
where 
init a k n 
| k > n = return ()
| otherwise = do
let  !x = fromIntegral $ k + k `div` 3
unsafeWrite a k x
init a (k+1) n

partial_sum a = do
(_,n) <- getBounds a
k <- unsafeRead a 0
ps a 1 n k
where 
ps a i n s
| i > n = return ()
| otherwise = do
k <- unsafeRead a i
let !l = fromIntegral $ s + k
unsafeWrite a i l
ps a (i+1) n l

data' :: [Int32]
data' = [k + k `div` 3 | k <- [0..] ]

sum' = scanl1 (+)

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
KC | 1 Dec 17:30 2012
Picon

Re: Why is boxed mutable array so slow?

Boxed arrays have a wrapper (extra layer of indirection) to allow for
a fully evaluated value, an unevaluated thunk, or the special value
bottom (a value that can contain bottom is referred to as lifted).

Unboxed arrays always have some value; that is, they cannot represent
a thunk nor bottom.

On Sat, Dec 1, 2012 at 8:09 AM, Branimir Maksimovic <bmaxa <at> hotmail.com> wrote:
> I have made benchmark test inspired by
> http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
>
> What surprised me is that unboxed array is much faster than boxed array.
> Actually boxed array performance is on par with standard Haskell list
> which is very slow.
> All in all even unboxed array is about 10 times slower than Java version.
> I don't understand why is even unboxed array so slow.
> But! unboxed array consumes least amount of RAM.
> (warning, program consumes more than 3gb of ram)
>
>  bmaxa <at> maxa:~/examples$ time ./Cumul
> boxed array
> last 262486571 seconds 4.972
> unboxed array
> last 262486571 seconds 0.776
> list
> last 262486571 seconds 6.812
>
> real    0m13.086s
> user    0m11.996s
> sys     0m1.080s
>
> -------------------------------------------------------------------------
> {-# LANGUAGE CPP, BangPatterns #-}
> import System.CPUTime
> import Text.Printf
> import Data.Array.IO
> import Data.Array.Base
> import Data.Int
> import Control.DeepSeq
> import System.Mem
>
> main :: IO()
> main = do
> (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
> performGC
> (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
> performGC
> begin <- getCPUTime
> printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
> end <- getCPUTime
> let diff = (fromIntegral (end - begin)) / (10^12)
> printf " seconds %.3f\n" (diff::Double)
>
> test s a = do
> putStrLn s
> begin <- getCPUTime
> init' a
> partial_sum a
> end <- getCPUTime
> let diff = (fromIntegral (end - begin)) / (10^12)
> last <- readArray a (n'-1)
> printf "last %d seconds %.3f\n" last (diff::Double)
>
> n' :: Int
> n' = 50 * 1000 * 1000
>
> type A = IOArray Int Int32
> type B = IOUArray Int Int32
>
> init' a = do
> (_,n) <- getBounds a
> init a 0 n
> where
> init a k n
> | k > n = return ()
> | otherwise = do
> let  !x = fromIntegral $ k + k `div` 3
> unsafeWrite a k x
> init a (k+1) n
>
> partial_sum a = do
> (_,n) <- getBounds a
> k <- unsafeRead a 0
> ps a 1 n k
> where
> ps a i n s
> | i > n = return ()
> | otherwise = do
> k <- unsafeRead a i
> let !l = fromIntegral $ s + k
> unsafeWrite a i l
> ps a (i+1) n l
>
> data' :: [Int32]
> data' = [k + k `div` 3 | k <- [0..] ]
>
> sum' = scanl1 (+)
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

--

-- 
--
Regards,
KC
Don Stewart | 1 Dec 18:20 2012
Picon

Re: Why is boxed mutable array so slow?

The obvious difference between boxed and unboxed arrays is that the
boxed arrays are full of pointers to heap allocated objects. This
means you pay indirection to access the values, much more time in GC
spent chasing pointers (though card marking helps), and generally do
more allocation.

Compare the GC stats below, for

* Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
* Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s

So there's your main answer. The increased data density of unboxed
arrays also helps a too.

Now, you can help out  the GC signifcantly by hinting at how much
you're going to allocated in the youngest generation (see the
ghc-gc-tune app for a methodical approach to this, though it needs
updating to ghc 7 --
http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
 and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
).

Use the +RTS -A flag to set an initial youngest generation heap size
to the size of your array, and watch the GC cost disappear. For our
boxed vector, we'd use +RTS -A50M, resulting in:

* Boxed vector: 8k copied, 1% of time in GC, 0.157s

So not bad. 3x speedup through a RTS flag. -A is very useful if you
are working with boxed, mutable arrays.

For reference, there's a generic version below that specializes based
on the vector type parameter.

---------------------------------

{-# LANGUAGE BangPatterns #-}

import System.CPUTime
import Text.Printf
import Data.Int
import Control.DeepSeq
import System.Mem

import qualified Data.Vector.Mutable as V
import qualified Data.Vector.Unboxed.Mutable as U
import qualified Data.Vector.Generic.Mutable as G

main :: IO()
main = do

--   (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
--   performGC
   (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
   performGC

test' s a = do
    putStrLn s
    begin <- getCPUTime
    init'' a
    partial_sum' a
    end <- getCPUTime
    let diff = (fromIntegral (end - begin)) / (10**12)
    last <- G.read a (n'-1)
    printf "last %d seconds %.3f\n" last (diff::Double)

n' :: Int
n' = 1000 * 1000

init'' !a = init 0 (n'-1)
  where
    init :: Int -> Int -> IO ()
    init !k !n
        | k > n     = return ()
        | otherwise = do
            let !x = fromIntegral $ k + k `div` 3
            G.write a k x
            init (k+1) n

partial_sum' !a = do
    k <- G.read a 0
    ps 1 (n'-1) k
  where
    ps :: Int -> Int -> Int32 -> IO ()
    ps i n s
        | i > n     = return ()
        | otherwise = do
            k <- G.read a i
            let !l = fromIntegral $ s + k
            G.write a i l
            ps (i+1) n l

---------------------------------

$ time ./A +RTS -s
boxed vector
last 945735787 seconds 0.420
      40,121,448 bytes allocated in the heap
      88,355,272 bytes copied during GC
      24,036,456 bytes maximum residency (6 sample(s))
         380,632 bytes maximum slop
              54 MB total memory in use (0 MB lost due to fragmentation)

  %GC     time      75.2%  (75.9% elapsed)

  Alloc rate    359,655,602 bytes per MUT second

./A +RTS -s  0.40s user 0.07s system 98% cpu 0.475 total

$ time ./A +RTS -s
unboxed vector
last 945735787 seconds 0.080
       4,113,568 bytes allocated in the heap
          11,288 bytes copied during GC
       4,003,256 bytes maximum residency (3 sample(s))
         182,856 bytes maximum slop
               5 MB total memory in use (0 MB lost due to fragmentation)

  %GC     time       1.3%  (1.3% elapsed)

  Alloc rate    51,416,660 bytes per MUT second

./A +RTS -s  0.08s user 0.01s system 98% cpu 0.088 total

$ time ./A +RTS -A50M -s
boxed vector
last 945735787 seconds 0.127
      40,121,504 bytes allocated in the heap
           8,032 bytes copied during GC
          44,704 bytes maximum residency (2 sample(s))
          20,832 bytes maximum slop
              59 MB total memory in use (0 MB lost due to fragmentation)

  %GC     time       1.0%  (1.0% elapsed)

  Productivity  97.4% of total user, 99.6% of total elapsed

./A +RTS -A50M -s  0.10s user 0.05s system 97% cpu 0.157 total

---------------------------------

On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa <at> hotmail.com> wrote:
> I have made benchmark test inspired by
> http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
>
> What surprised me is that unboxed array is much faster than boxed array.
> Actually boxed array performance is on par with standard Haskell list
> which is very slow.
> All in all even unboxed array is about 10 times slower than Java version.
> I don't understand why is even unboxed array so slow.
> But! unboxed array consumes least amount of RAM.
> (warning, program consumes more than 3gb of ram)
>
>  bmaxa <at> maxa:~/examples$ time ./Cumul
> boxed array
> last 262486571 seconds 4.972
> unboxed array
> last 262486571 seconds 0.776
> list
> last 262486571 seconds 6.812
>
> real    0m13.086s
> user    0m11.996s
> sys     0m1.080s
>
> -------------------------------------------------------------------------
> {-# LANGUAGE CPP, BangPatterns #-}
> import System.CPUTime
> import Text.Printf
> import Data.Array.IO
> import Data.Array.Base
> import Data.Int
> import Control.DeepSeq
> import System.Mem
>
> main :: IO()
> main = do
> (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
> performGC
> (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
> performGC
> begin <- getCPUTime
> printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
> end <- getCPUTime
> let diff = (fromIntegral (end - begin)) / (10^12)
> printf " seconds %.3f\n" (diff::Double)
>
> test s a = do
> putStrLn s
> begin <- getCPUTime
> init' a
> partial_sum a
> end <- getCPUTime
> let diff = (fromIntegral (end - begin)) / (10^12)
> last <- readArray a (n'-1)
> printf "last %d seconds %.3f\n" last (diff::Double)
>
> n' :: Int
> n' = 50 * 1000 * 1000
>
> type A = IOArray Int Int32
> type B = IOUArray Int Int32
>
> init' a = do
> (_,n) <- getBounds a
> init a 0 n
> where
> init a k n
> | k > n = return ()
> | otherwise = do
> let  !x = fromIntegral $ k + k `div` 3
> unsafeWrite a k x
> init a (k+1) n
>
> partial_sum a = do
> (_,n) <- getBounds a
> k <- unsafeRead a 0
> ps a 1 n k
> where
> ps a i n s
> | i > n = return ()
> | otherwise = do
> k <- unsafeRead a i
> let !l = fromIntegral $ s + k
> unsafeWrite a i l
> ps a (i+1) n l
>
> data' :: [Int32]
> data' = [k + k `div` 3 | k <- [0..] ]
>
> sum' = scanl1 (+)
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
Branimir Maksimovic | 1 Dec 19:04 2012
Picon

Re: Why is boxed mutable array so slow?

Wow that sped it up 5 times.
I see that boxed Vector is 25% faster than IOArray.
What is the difference and when to use Vector,
when IOArray?
Thanks!

bmaxa <at> maxa:~/examples$ time ./Cumul +RTS -A1600M
boxed array
last 262486571 seconds 1.196
unboxed array
last 262486571 seconds 0.748
boxed vector
last 262486571 seconds 0.908
unboxed vector
last 262486571 seconds 0.720

real    0m3.805s
user    0m3.428s
sys     0m0.372s


> Date: Sat, 1 Dec 2012 12:20:37 -0500
> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?
> From: dons00 <at> gmail.com
> To: bmaxa <at> hotmail.com
> CC: haskell-cafe <at> haskell.org
>
> The obvious difference between boxed and unboxed arrays is that the
> boxed arrays are full of pointers to heap allocated objects. This
> means you pay indirection to access the values, much more time in GC
> spent chasing pointers (though card marking helps), and generally do
> more allocation.
>
> Compare the GC stats below, for
>
> * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
>
> So there's your main answer. The increased data density of unboxed
> arrays also helps a too.
>
> Now, you can help out the GC signifcantly by hinting at how much
> you're going to allocated in the youngest generation (see the
> ghc-gc-tune app for a methodical approach to this, though it needs
> updating to ghc 7 --
> http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
> and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> ).
>
> Use the +RTS -A flag to set an initial youngest generation heap size
> to the size of your array, and watch the GC cost disappear. For our
> boxed vector, we'd use +RTS -A50M, resulting in:
>
> * Boxed vector: 8k copied, 1% of time in GC, 0.157s
>
> So not bad. 3x speedup through a RTS flag. -A is very useful if you
> are working with boxed, mutable arrays.
>
> For reference, there's a generic version below that specializes based
> on the vector type parameter.
>
> ---------------------------------
>
> {-# LANGUAGE BangPatterns #-}
>
> import System.CPUTime
> import Text.Printf
> import Data.Int
> import Control.DeepSeq
> import System.Mem
>
> import qualified Data.Vector.Mutable as V
> import qualified Data.Vector.Unboxed.Mutable as U
> import qualified Data.Vector.Generic.Mutable as G
>
> main :: IO()
> main = do
>
> -- (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> -- performGC
> (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
> performGC
>
> test' s a = do
> putStrLn s
> begin <- getCPUTime
> init'' a
> partial_sum' a
> end <- getCPUTime
> let diff = (fromIntegral (end - begin)) / (10**12)
> last <- G.read a (n'-1)
> printf "last %d seconds %.3f\n" last (diff::Double)
>
> n' :: Int
> n' = 1000 * 1000
>
> init'' !a = init 0 (n'-1)
> where
> init :: Int -> Int -> IO ()
> init !k !n
> | k > n = return ()
> | otherwise = do
> let !x = fromIntegral $ k + k `div` 3
> G.write a k x
> init (k+1) n
>
>
>
> partial_sum' !a = do
> k <- G.read a 0
> ps 1 (n'-1) k
> where
> ps :: Int -> Int -> Int32 -> IO ()
> ps i n s
> | i > n = return ()
> | otherwise = do
> k <- G.read a i
> let !l = fromIntegral $ s + k
> G.write a i l
> ps (i+1) n l
>
>
> ---------------------------------
>
> $ time ./A +RTS -s
> boxed vector
> last 945735787 seconds 0.420
> 40,121,448 bytes allocated in the heap
> 88,355,272 bytes copied during GC
> 24,036,456 bytes maximum residency (6 sample(s))
> 380,632 bytes maximum slop
> 54 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 75.2% (75.9% elapsed)
>
> Alloc rate 359,655,602 bytes per MUT second
>
> ./A +RTS -s 0.40s user 0.07s system 98% cpu 0.475 total
>
>
> $ time ./A +RTS -s
> unboxed vector
> last 945735787 seconds 0.080
> 4,113,568 bytes allocated in the heap
> 11,288 bytes copied during GC
> 4,003,256 bytes maximum residency (3 sample(s))
> 182,856 bytes maximum slop
> 5 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 1.3% (1.3% elapsed)
>
> Alloc rate 51,416,660 bytes per MUT second
>
> ./A +RTS -s 0.08s user 0.01s system 98% cpu 0.088 total
>
>
> $ time ./A +RTS -A50M -s
> boxed vector
> last 945735787 seconds 0.127
> 40,121,504 bytes allocated in the heap
> 8,032 bytes copied during GC
> 44,704 bytes maximum residency (2 sample(s))
> 20,832 bytes maximum slop
> 59 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 1.0% (1.0% elapsed)
>
> Productivity 97.4% of total user, 99.6% of total elapsed
>
> ./A +RTS -A50M -s 0.10s user 0.05s system 97% cpu 0.157 total
>
>
>
> ---------------------------------
>
>
> On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa <at> hotmail.com> wrote:
> > I have made benchmark test inspired by
> > http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
> >
> > What surprised me is that unboxed array is much faster than boxed array.
> > Actually boxed array performance is on par with standard Haskell list
> > which is very slow.
> > All in all even unboxed array is about 10 times slower than Java version.
> > I don't understand why is even unboxed array so slow.
> > But! unboxed array consumes least amount of RAM.
> > (warning, program consumes more than 3gb of ram)
> >
> > bmaxa <at> maxa:~/examples$ time ./Cumul
> > boxed array
> > last 262486571 seconds 4.972
> > unboxed array
> > last 262486571 seconds 0.776
> > list
> > last 262486571 seconds 6.812
> >
> > real 0m13.086s
> > user 0m11.996s
> > sys 0m1.080s
> >
> > -------------------------------------------------------------------------
> > {-# LANGUAGE CPP, BangPatterns #-}
> > import System.CPUTime
> > import Text.Printf
> > import Data.Array.IO
> > import Data.Array.Base
> > import Data.Int
> > import Control.DeepSeq
> > import System.Mem
> >
> > main :: IO()
> > main = do
> > (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
> > performGC
> > (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
> > performGC
> > begin <- getCPUTime
> > printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10^12)
> > printf " seconds %.3f\n" (diff::Double)
> >
> > test s a = do
> > putStrLn s
> > begin <- getCPUTime
> > init' a
> > partial_sum a
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10^12)
> > last <- readArray a (n'-1)
> > printf "last %d seconds %.3f\n" last (diff::Double)
> >
> > n' :: Int
> > n' = 50 * 1000 * 1000
> >
> > type A = IOArray Int Int32
> > type B = IOUArray Int Int32
> >
> > init' a = do
> > (_,n) <- getBounds a
> > init a 0 n
> > where
> > init a k n
> > | k > n = return ()
> > | otherwise = do
> > let !x = fromIntegral $ k + k `div` 3
> > unsafeWrite a k x
> > init a (k+1) n
> >
> > partial_sum a = do
> > (_,n) <- getBounds a
> > k <- unsafeRead a 0
> > ps a 1 n k
> > where
> > ps a i n s
> > | i > n = return ()
> > | otherwise = do
> > k <- unsafeRead a i
> > let !l = fromIntegral $ s + k
> > unsafeWrite a i l
> > ps a (i+1) n l
> >
> > data' :: [Int32]
> > data' = [k + k `div` 3 | k <- [0..] ]
> >
> > sum' = scanl1 (+)
> >
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe <at> haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Don Stewart | 1 Dec 19:22 2012
Picon

Re: Why is boxed mutable array so slow?

Regarding when to use mutable arrays versus vectors, I would always
use vectors -- they optimize better, and have a better interface.

Also, I have updated and released a new version of the tool mentioned below.
You can get it on Hackage, updated to ghc 7 series.

    http://hackage.haskell.org/package/ghc-gc-tune-0.3

For your boxed vector program, we get results that show a clear
performance peak with a -A of around 64M, about  the size of the
allocated array ...

http://i.imgur.com/dZ2Eo.png

Best settings for Running time:
0.16s:  +RTS -A67108864 -H1048576
0.16s:  +RTS -A67108864 -H2097152
0.16s:  +RTS -A67108864 -H8388608

E.g.

$ time ./A +RTS -A67M -H1M
boxed vector
last 945735787 seconds 0.123

-- Don

On Sat, Dec 1, 2012 at 12:20 PM, Don Stewart <dons00 <at> gmail.com> wrote:
> The obvious difference between boxed and unboxed arrays is that the
> boxed arrays are full of pointers to heap allocated objects. This
> means you pay indirection to access the values, much more time in GC
> spent chasing pointers (though card marking helps), and generally do
> more allocation.
>
> Compare the GC stats below, for
>
> * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
>
> So there's your main answer. The increased data density of unboxed
> arrays also helps a too.
>
> Now, you can help out  the GC signifcantly by hinting at how much
> you're going to allocated in the youngest generation (see the
> ghc-gc-tune app for a methodical approach to this, though it needs
> updating to ghc 7 --
> http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
>  and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> ).
>
> Use the +RTS -A flag to set an initial youngest generation heap size
> to the size of your array, and watch the GC cost disappear. For our
> boxed vector, we'd use +RTS -A50M, resulting in:
>
> * Boxed vector: 8k copied, 1% of time in GC, 0.157s
>
> So not bad. 3x speedup through a RTS flag. -A is very useful if you
> are working with boxed, mutable arrays.
>
> For reference, there's a generic version below that specializes based
> on the vector type parameter.
>
> ---------------------------------
>
> {-# LANGUAGE BangPatterns #-}
>
> import System.CPUTime
> import Text.Printf
> import Data.Int
> import Control.DeepSeq
> import System.Mem
>
> import qualified Data.Vector.Mutable as V
> import qualified Data.Vector.Unboxed.Mutable as U
> import qualified Data.Vector.Generic.Mutable as G
>
> main :: IO()
> main = do
>
> --   (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> --   performGC
>    (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
>    performGC
>
> test' s a = do
>     putStrLn s
>     begin <- getCPUTime
>     init'' a
>     partial_sum' a
>     end <- getCPUTime
>     let diff = (fromIntegral (end - begin)) / (10**12)
>     last <- G.read a (n'-1)
>     printf "last %d seconds %.3f\n" last (diff::Double)
>
> n' :: Int
> n' = 1000 * 1000
>
> init'' !a = init 0 (n'-1)
>   where
>     init :: Int -> Int -> IO ()
>     init !k !n
>         | k > n     = return ()
>         | otherwise = do
>             let !x = fromIntegral $ k + k `div` 3
>             G.write a k x
>             init (k+1) n
>
>
>
> partial_sum' !a = do
>     k <- G.read a 0
>     ps 1 (n'-1) k
>   where
>     ps :: Int -> Int -> Int32 -> IO ()
>     ps i n s
>         | i > n     = return ()
>         | otherwise = do
>             k <- G.read a i
>             let !l = fromIntegral $ s + k
>             G.write a i l
>             ps (i+1) n l
>
>
> ---------------------------------
>
> $ time ./A +RTS -s
> boxed vector
> last 945735787 seconds 0.420
>       40,121,448 bytes allocated in the heap
>       88,355,272 bytes copied during GC
>       24,036,456 bytes maximum residency (6 sample(s))
>          380,632 bytes maximum slop
>               54 MB total memory in use (0 MB lost due to fragmentation)
>
>   %GC     time      75.2%  (75.9% elapsed)
>
>   Alloc rate    359,655,602 bytes per MUT second
>
> ./A +RTS -s  0.40s user 0.07s system 98% cpu 0.475 total
>
>
> $ time ./A +RTS -s
> unboxed vector
> last 945735787 seconds 0.080
>        4,113,568 bytes allocated in the heap
>           11,288 bytes copied during GC
>        4,003,256 bytes maximum residency (3 sample(s))
>          182,856 bytes maximum slop
>                5 MB total memory in use (0 MB lost due to fragmentation)
>
>   %GC     time       1.3%  (1.3% elapsed)
>
>   Alloc rate    51,416,660 bytes per MUT second
>
> ./A +RTS -s  0.08s user 0.01s system 98% cpu 0.088 total
>
>
> $ time ./A +RTS -A50M -s
> boxed vector
> last 945735787 seconds 0.127
>       40,121,504 bytes allocated in the heap
>            8,032 bytes copied during GC
>           44,704 bytes maximum residency (2 sample(s))
>           20,832 bytes maximum slop
>               59 MB total memory in use (0 MB lost due to fragmentation)
>
>   %GC     time       1.0%  (1.0% elapsed)
>
>   Productivity  97.4% of total user, 99.6% of total elapsed
>
> ./A +RTS -A50M -s  0.10s user 0.05s system 97% cpu 0.157 total
>
>
>
> ---------------------------------
>
>
> On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa <at> hotmail.com> wrote:
>> I have made benchmark test inspired by
>> http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
>>
>> What surprised me is that unboxed array is much faster than boxed array.
>> Actually boxed array performance is on par with standard Haskell list
>> which is very slow.
>> All in all even unboxed array is about 10 times slower than Java version.
>> I don't understand why is even unboxed array so slow.
>> But! unboxed array consumes least amount of RAM.
>> (warning, program consumes more than 3gb of ram)
>>
>>  bmaxa <at> maxa:~/examples$ time ./Cumul
>> boxed array
>> last 262486571 seconds 4.972
>> unboxed array
>> last 262486571 seconds 0.776
>> list
>> last 262486571 seconds 6.812
>>
>> real    0m13.086s
>> user    0m11.996s
>> sys     0m1.080s
>>
>> -------------------------------------------------------------------------
>> {-# LANGUAGE CPP, BangPatterns #-}
>> import System.CPUTime
>> import Text.Printf
>> import Data.Array.IO
>> import Data.Array.Base
>> import Data.Int
>> import Control.DeepSeq
>> import System.Mem
>>
>> main :: IO()
>> main = do
>> (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
>> performGC
>> (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
>> performGC
>> begin <- getCPUTime
>> printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
>> end <- getCPUTime
>> let diff = (fromIntegral (end - begin)) / (10^12)
>> printf " seconds %.3f\n" (diff::Double)
>>
>> test s a = do
>> putStrLn s
>> begin <- getCPUTime
>> init' a
>> partial_sum a
>> end <- getCPUTime
>> let diff = (fromIntegral (end - begin)) / (10^12)
>> last <- readArray a (n'-1)
>> printf "last %d seconds %.3f\n" last (diff::Double)
>>
>> n' :: Int
>> n' = 50 * 1000 * 1000
>>
>> type A = IOArray Int Int32
>> type B = IOUArray Int Int32
>>
>> init' a = do
>> (_,n) <- getBounds a
>> init a 0 n
>> where
>> init a k n
>> | k > n = return ()
>> | otherwise = do
>> let  !x = fromIntegral $ k + k `div` 3
>> unsafeWrite a k x
>> init a (k+1) n
>>
>> partial_sum a = do
>> (_,n) <- getBounds a
>> k <- unsafeRead a 0
>> ps a 1 n k
>> where
>> ps a i n s
>> | i > n = return ()
>> | otherwise = do
>> k <- unsafeRead a i
>> let !l = fromIntegral $ s + k
>> unsafeWrite a i l
>> ps a (i+1) n l
>>
>> data' :: [Int32]
>> data' = [k + k `div` 3 | k <- [0..] ]
>>
>> sum' = scanl1 (+)
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe <at> haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
Branimir Maksimovic | 1 Dec 19:58 2012
Picon

Re: Why is boxed mutable array so slow?

Thanks! I have downloaded tool and playing with it.
I will use boxed vectors in the future ;)



> Date: Sat, 1 Dec 2012 13:22:47 -0500
> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?
> From: dons00 <at> gmail.com
> To: bmaxa <at> hotmail.com
> CC: haskell-cafe <at> haskell.org
>
> Regarding when to use mutable arrays versus vectors, I would always
> use vectors -- they optimize better, and have a better interface.
>
> Also, I have updated and released a new version of the tool mentioned below.
> You can get it on Hackage, updated to ghc 7 series.
>
> http://hackage.haskell.org/package/ghc-gc-tune-0.3
>
> For your boxed vector program, we get results that show a clear
> performance peak with a -A of around 64M, about the size of the
> allocated array ...
>
> http://i.imgur.com/dZ2Eo.png
>
> Best settings for Running time:
> 0.16s: +RTS -A67108864 -H1048576
> 0.16s: +RTS -A67108864 -H2097152
> 0.16s: +RTS -A67108864 -H8388608
>
> E.g.
>
> $ time ./A +RTS -A67M -H1M
> boxed vector
> last 945735787 seconds 0.123
>
> -- Don
>
> On Sat, Dec 1, 2012 at 12:20 PM, Don Stewart <dons00 <at> gmail.com> wrote:
> > The obvious difference between boxed and unboxed arrays is that the
> > boxed arrays are full of pointers to heap allocated objects. This
> > means you pay indirection to access the values, much more time in GC
> > spent chasing pointers (though card marking helps), and generally do
> > more allocation.
> >
> > Compare the GC stats below, for
> >
> > * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> > * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
> >
> > So there's your main answer. The increased data density of unboxed
> > arrays also helps a too.
> >
> > Now, you can help out the GC signifcantly by hinting at how much
> > you're going to allocated in the youngest generation (see the
> > ghc-gc-tune app for a methodical approach to this, though it needs
> > updating to ghc 7 --
> > http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
> > and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> > ).
> >
> > Use the +RTS -A flag to set an initial youngest generation heap size
> > to the size of your array, and watch the GC cost disappear. For our
> > boxed vector, we'd use +RTS -A50M, resulting in:
> >
> > * Boxed vector: 8k copied, 1% of time in GC, 0.157s
> >
> > So not bad. 3x speedup through a RTS flag. -A is very useful if you
> > are working with boxed, mutable arrays.
> >
> > For reference, there's a generic version below that specializes based
> > on the vector type parameter.
> >
> > ---------------------------------
> >
> > {-# LANGUAGE BangPatterns #-}
> >
> > import System.CPUTime
> > import Text.Printf
> > import Data.Int
> > import Control.DeepSeq
> > import System.Mem
> >
> > import qualified Data.Vector.Mutable as V
> > import qualified Data.Vector.Unboxed.Mutable as U
> > import qualified Data.Vector.Generic.Mutable as G
> >
> > main :: IO()
> > main = do
> >
> > -- (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> > -- performGC
> > (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
> > performGC
> >
> > test' s a = do
> > putStrLn s
> > begin <- getCPUTime
> > init'' a
> > partial_sum' a
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10**12)
> > last <- G.read a (n'-1)
> > printf "last %d seconds %.3f\n" last (diff::Double)
> >
> > n' :: Int
> > n' = 1000 * 1000
> >
> > init'' !a = init 0 (n'-1)
> > where
> > init :: Int -> Int -> IO ()
> > init !k !n
> > | k > n = return ()
> > | otherwise = do
> > let !x = fromIntegral $ k + k `div` 3
> > G.write a k x
> > init (k+1) n
> >
> >
> >
> > partial_sum' !a = do
> > k <- G.read a 0
> > ps 1 (n'-1) k
> > where
> > ps :: Int -> Int -> Int32 -> IO ()
> > ps i n s
> > | i > n = return ()
> > | otherwise = do
> > k <- G.read a i
> > let !l = fromIntegral $ s + k
> > G.write a i l
> > ps (i+1) n l
> >
> >
> > ---------------------------------
> >
> > $ time ./A +RTS -s
> > boxed vector
> > last 945735787 seconds 0.420
> > 40,121,448 bytes allocated in the heap
> > 88,355,272 bytes copied during GC
> > 24,036,456 bytes maximum residency (6 sample(s))
> > 380,632 bytes maximum slop
> > 54 MB total memory in use (0 MB lost due to fragmentation)
> >
> > %GC time 75.2% (75.9% elapsed)
> >
> > Alloc rate 359,655,602 bytes per MUT second
> >
> > ./A +RTS -s 0.40s user 0.07s system 98% cpu 0.475 total
> >
> >
> > $ time ./A +RTS -s
> > unboxed vector
> > last 945735787 seconds 0.080
> > 4,113,568 bytes allocated in the heap
> > 11,288 bytes copied during GC
> > 4,003,256 bytes maximum residency (3 sample(s))
> > 182,856 bytes maximum slop
> > 5 MB total memory in use (0 MB lost due to fragmentation)
> >
> > %GC time 1.3% (1.3% elapsed)
> >
> > Alloc rate 51,416,660 bytes per MUT second
> >
> > ./A +RTS -s 0.08s user 0.01s system 98% cpu 0.088 total
> >
> >
> > $ time ./A +RTS -A50M -s
> > boxed vector
> > last 945735787 seconds 0.127
> > 40,121,504 bytes allocated in the heap
> > 8,032 bytes copied during GC
> > 44,704 bytes maximum residency (2 sample(s))
> > 20,832 bytes maximum slop
> > 59 MB total memory in use (0 MB lost due to fragmentation)
> >
> > %GC time 1.0% (1.0% elapsed)
> >
> > Productivity 97.4% of total user, 99.6% of total elapsed
> >
> > ./A +RTS -A50M -s 0.10s user 0.05s system 97% cpu 0.157 total
> >
> >
> >
> > ---------------------------------
> >
> >
> > On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa <at> hotmail.com> wrote:
> >> I have made benchmark test inspired by
> >> http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
> >>
> >> What surprised me is that unboxed array is much faster than boxed array.
> >> Actually boxed array performance is on par with standard Haskell list
> >> which is very slow.
> >> All in all even unboxed array is about 10 times slower than Java version.
> >> I don't understand why is even unboxed array so slow.
> >> But! unboxed array consumes least amount of RAM.
> >> (warning, program consumes more than 3gb of ram)
> >>
> >> bmaxa <at> maxa:~/examples$ time ./Cumul
> >> boxed array
> >> last 262486571 seconds 4.972
> >> unboxed array
> >> last 262486571 seconds 0.776
> >> list
> >> last 262486571 seconds 6.812
> >>
> >> real 0m13.086s
> >> user 0m11.996s
> >> sys 0m1.080s
> >>
> >> -------------------------------------------------------------------------
> >> {-# LANGUAGE CPP, BangPatterns #-}
> >> import System.CPUTime
> >> import Text.Printf
> >> import Data.Array.IO
> >> import Data.Array.Base
> >> import Data.Int
> >> import Control.DeepSeq
> >> import System.Mem
> >>
> >> main :: IO()
> >> main = do
> >> (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
> >> performGC
> >> (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
> >> performGC
> >> begin <- getCPUTime
> >> printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
> >> end <- getCPUTime
> >> let diff = (fromIntegral (end - begin)) / (10^12)
> >> printf " seconds %.3f\n" (diff::Double)
> >>
> >> test s a = do
> >> putStrLn s
> >> begin <- getCPUTime
> >> init' a
> >> partial_sum a
> >> end <- getCPUTime
> >> let diff = (fromIntegral (end - begin)) / (10^12)
> >> last <- readArray a (n'-1)
> >> printf "last %d seconds %.3f\n" last (diff::Double)
> >>
> >> n' :: Int
> >> n' = 50 * 1000 * 1000
> >>
> >> type A = IOArray Int Int32
> >> type B = IOUArray Int Int32
> >>
> >> init' a = do
> >> (_,n) <- getBounds a
> >> init a 0 n
> >> where
> >> init a k n
> >> | k > n = return ()
> >> | otherwise = do
> >> let !x = fromIntegral $ k + k `div` 3
> >> unsafeWrite a k x
> >> init a (k+1) n
> >>
> >> partial_sum a = do
> >> (_,n) <- getBounds a
> >> k <- unsafeRead a 0
> >> ps a 1 n k
> >> where
> >> ps a i n s
> >> | i > n = return ()
> >> | otherwise = do
> >> k <- unsafeRead a i
> >> let !l = fromIntegral $ s + k
> >> unsafeWrite a i l
> >> ps a (i+1) n l
> >>
> >> data' :: [Int32]
> >> data' = [k + k `div` 3 | k <- [0..] ]
> >>
> >> sum' = scanl1 (+)
> >>
> >>
> >> _______________________________________________
> >> Haskell-Cafe mailing list
> >> Haskell-Cafe <at> haskell.org
> >> http://www.haskell.org/mailman/listinfo/haskell-cafe
> >>
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Daniel Fischer | 1 Dec 21:12 2012

Re: Why is boxed mutable array so slow?

On Samstag, 1. Dezember 2012, 16:09:05, Branimir Maksimovic wrote:
> All in all even unboxed array is about 10 times slower than Java version.
> I don't understand why is even unboxed array so slow.

It's not the unboxed arrays that are slow.

Your code has a couple of weak spots, and GHC's native code generator has a 
weakness that bites here.

On my box, I don't quite have a 10× difference to my translation to Java, it's 
a bit less than 7× (0.82s vs 0.12s - I don't want to bring my box to its knees 
by running something that takes 3GB+ of RAM, so I run the unboxed array part 
only) with the LLVM backend and 8× (0.93s) with the native code generator. 
That's in the same ballpark, though.

So what's the deal?

Main.main_$s$wa1 [Occ=LoopBreaker]
  :: GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> GHC.Types.Int
     -> GHC.Types.Int
     -> GHC.Types.Int
     -> ...

Your loops carry boxed Ints around, that's always a bad sign. In this case it 
doesn't hurt too much, however, since these values are neither read nor 
substituted during the loop (they're first and last index of the array and 
number of elements). Additionally, they carry an IOUArray constructor around. 
That is unnecessary. Eliminating a couple of dead parameters

init' a = do
    (_,n) <- getBounds a
    let init k
          | k > n     = return ()
          | otherwise = do
              let x = fromIntegral $ k + k `div` 3
              unsafeWrite a k x
              init (k+1)
    init 0

partial_sum a = do
    (_,n) <- getBounds a
    let ps i s
          | i > n     = return ()
          | otherwise = do
              k <- unsafeRead a i
              let l = s + k
              unsafeWrite a i l
              ps (i+1) l
    k <- unsafeRead a 0
    ps 1 k

brings the time for the native code generator down to 0.82s, and for the LLVM 
backend the time remains the same.

Next problem, you're using `div` for the division.

`div` does some checking and potentially fixup (not here, since everything is 
non-negative) after the machine division because `div` is specified to satisfy

a = (a `div` b) * b + (a `mod` b)

with 0 <= a `mod` b < abs b.

That is in itself slower than the pure machine division you get with quot.

So let's see what we get with `quot`.

0.65s with the native code generator, and 0.13 with the LLVM backend.

Whoops, what's that?

The problem is, as can be seen by manually replacing k `quot` 3 with

(k *2863311531) `shiftR` 33

(requires 64-bit Ints; equivalent in Java: k*28..1L >> 33), when the native 
backend, the LLVM backend and Java (as well as C) all take more or less the 
same time [well, the NCG is a bit slower than the other two, 0.11s, 0.11s, 
0.14s], that division is a **very** slow operation.

Java and LLVM know how to replace the division by the constant 3 with a 
mulitplication, a couple of shifts and an addition (since we never have 
negative numbers here, just one multiplication and shift suffice, but neither 
Java nor LLVM can do that on their own because it's not guaranteed by the 
type). The native code generator doesn't - not yet.

So the programme spends the majority of the time dividing. The array reads and 
writes are on par with Java's (and, for that matter, C's).

If you make the divisor a parameter instead of a compile time constant, the 
NCG is not affected at all, the LLVM backend gives you equal performance (it 
can't optimise a division by a divisor it doesn't know). Java is at an 
advantage there, after a while the JIT sees that it might be a good idea to 
optimise the division and so its time only trebles.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Branimir Maksimovic | 2 Dec 06:57 2012
Picon

Re: Why is boxed mutable array so slow?

Wow, now performance is on par with Java ;)
So slow division was main problem, that and GC .

Thanks!


> From: daniel.is.fischer <at> googlemail.com
> To: haskell-cafe <at> haskell.org
> CC: bmaxa <at> hotmail.com
> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?
> Date: Sat, 1 Dec 2012 21:12:29 +0100
>
> On Samstag, 1. Dezember 2012, 16:09:05, Branimir Maksimovic wrote:
> > All in all even unboxed array is about 10 times slower than Java version.
> > I don't understand why is even unboxed array so slow.
>
> It's not the unboxed arrays that are slow.
>
> Your code has a couple of weak spots, and GHC's native code generator has a
> weakness that bites here.
>
> On my box, I don't quite have a 10× difference to my translation to Java, it's
> a bit less than 7× (0.82s vs 0.12s - I don't want to bring my box to its knees
> by running something that takes 3GB+ of RAM, so I run the unboxed array part
> only) with the LLVM backend and 8× (0.93s) with the native code generator.
> That's in the same ballpark, though.
>
> So what's the deal?
>
> Main.main_$s$wa1 [Occ=LoopBreaker]
> :: GHC.Prim.Int#
> -> GHC.Prim.Int#
> -> GHC.Prim.State# GHC.Prim.RealWorld
> -> GHC.Types.Int
> -> GHC.Types.Int
> -> GHC.Types.Int
> -> ...
>
> Your loops carry boxed Ints around, that's always a bad sign. In this case it
> doesn't hurt too much, however, since these values are neither read nor
> substituted during the loop (they're first and last index of the array and
> number of elements). Additionally, they carry an IOUArray constructor around.
> That is unnecessary. Eliminating a couple of dead parameters
>
>
> init' a = do
> (_,n) <- getBounds a
> let init k
> | k > n = return ()
> | otherwise = do
> let x = fromIntegral $ k + k `div` 3
> unsafeWrite a k x
> init (k+1)
> init 0
>
> partial_sum a = do
> (_,n) <- getBounds a
> let ps i s
> | i > n = return ()
> | otherwise = do
> k <- unsafeRead a i
> let l = s + k
> unsafeWrite a i l
> ps (i+1) l
> k <- unsafeRead a 0
> ps 1 k
>
> brings the time for the native code generator down to 0.82s, and for the LLVM
> backend the time remains the same.
>
> Next problem, you're using `div` for the division.
>
> `div` does some checking and potentially fixup (not here, since everything is
> non-negative) after the machine division because `div` is specified to satisfy
>
> a = (a `div` b) * b + (a `mod` b)
>
> with 0 <= a `mod` b < abs b.
>
> That is in itself slower than the pure machine division you get with quot.
>
> So let's see what we get with `quot`.
>
> 0.65s with the native code generator, and 0.13 with the LLVM backend.
>
> Whoops, what's that?
>
> The problem is, as can be seen by manually replacing k `quot` 3 with
>
> (k *2863311531) `shiftR` 33
>
> (requires 64-bit Ints; equivalent in Java: k*28..1L >> 33), when the native
> backend, the LLVM backend and Java (as well as C) all take more or less the
> same time [well, the NCG is a bit slower than the other two, 0.11s, 0.11s,
> 0.14s], that division is a **very** slow operation.
>
> Java and LLVM know how to replace the division by the constant 3 with a
> mulitplication, a couple of shifts and an addition (since we never have
> negative numbers here, just one multiplication and shift suffice, but neither
> Java nor LLVM can do that on their own because it's not guaranteed by the
> type). The native code generator doesn't - not yet.
>
> So the programme spends the majority of the time dividing. The array reads and
> writes are on par with Java's (and, for that matter, C's).
>
> If you make the divisor a parameter instead of a compile time constant, the
> NCG is not affected at all, the LLVM backend gives you equal performance (it
> can't optimise a division by a divisor it doesn't know). Java is at an
> advantage there, after a while the JIT sees that it might be a good idea to
> optimise the division and so its time only trebles.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Gmane