David Powell | 29 Jun 13:30 2012
Picon

Slowdown with GHC when using multiple CPUs

I recently ran into a problem writing a program that I hoped I could speed up by running over multiple CPUs.  I want non-haskell users to be able to run the tool and take advantage of multiple CPUs.  *But* there is a serious slowdown when the RTS is run with -N and some of the CPUs are already busy.

This is already mentioned in the GHC docs, but the problem I experienced was more serious than I expected causing a slowdown of around 2x compared to running with a single CPU. 

I reproduced the problem with the following code from the haskell wiki:

{-# LANGUAGE BangPatterns #-}
    import Data.Digest.Pure.MD5
    import qualified Data.ByteString.Lazy as L
    import System.Environment
    import Control.Concurrent
    import Control.Monad (replicateM_)
 
    main = do
        files <- getArgs
        str <- newEmptyMVar
        mapM_ (forkIO . hashAndPrint str) files
        printNrResults (length files) str
 
    printNrResults i var = replicateM_ i (takeMVar var >>= putStrLn)
 
    hashAndPrint str f = do
        bs <- L.readFile f
        let !h = show $ md5 bs
        putMVar str (f ++ ": " ++ h)


When run on 4 idle CPU cores, I get the following wall clock times:
  ./run +RTS -N1   : 20.4 sec
  ./run +RTS -N2   : 11.0 sec
  ./run +RTS -N4   : 6.7 sec

When run on the same 4 core machine, but with 2 cores already busy:
  ./run +RTS -N1   : 23.5 sec
  ./run +RTS -N2   : 14.1 sec
  ./run +RTS -N4   : 57.8 sec   <---- Blowout...

This is quite a problem in practice when running on a shared server.  Is there anything that can be done to address this?


Thanks,

--
David Powell
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Alexander Morozov | 29 Jun 23:00 2012
Picon

Re: Slowdown with GHC when using multiple CPUs

Garbage collector have to stop all threads while it's running. Then
each thread collects garbage in its private zone. If some thread is
paused by OS before it's done, all other threads have to do nothing
and wait while it gets resheduled by OS some timeslices later and
completes its work.

You can try to use single-threaded garbage collector (-gq), it should
work better as only one garbage-collecting thread is running and it's
unlikely to be sheduled out by os since there are still an other free
core.

On my 4-core machine with 2 cores busy by other processes:
Threaded collector:
-N1 : 12.0
-N2 : 6.4
-N3 : 8.3
-N4 : 14.6

Single-threaded (-qg):
-N1 : 11.8
-N2 : 7.1
-N3 : 7.1
-N4 : 7.0

On Fri, Jun 29, 2012 at 3:30 PM, David Powell <david <at> drp.id.au> wrote:
> I recently ran into a problem writing a program that I hoped I could speed
> up by running over multiple CPUs.  I want non-haskell users to be able to
> run the tool and take advantage of multiple CPUs.  *But* there is a serious
> slowdown when the RTS is run with -N and some of the CPUs are already busy.
>
> This is already mentioned in the GHC docs, but the problem I experienced was
> more serious than I expected causing a slowdown of around 2x compared to
> running with a single CPU.
>
> I reproduced the problem with the following code from the haskell wiki:
>
> {-# LANGUAGE BangPatterns #-}
>     import Data.Digest.Pure.MD5
>     import qualified Data.ByteString.Lazy as L
>     import System.Environment
>     import Control.Concurrent
>     import Control.Monad (replicateM_)
>
>     main = do
>         files <- getArgs
>         str <- newEmptyMVar
>         mapM_ (forkIO . hashAndPrint str) files
>         printNrResults (length files) str
>
>     printNrResults i var = replicateM_ i (takeMVar var >>= putStrLn)
>
>     hashAndPrint str f = do
>         bs <- L.readFile f
>         let !h = show $ md5 bs
>         putMVar str (f ++ ": " ++ h)
>
>
> When run on 4 idle CPU cores, I get the following wall clock times:
>   ./run +RTS -N1   : 20.4 sec
>   ./run +RTS -N2   : 11.0 sec
>   ./run +RTS -N4   : 6.7 sec
>
> When run on the same 4 core machine, but with 2 cores already busy:
>   ./run +RTS -N1   : 23.5 sec
>   ./run +RTS -N2   : 14.1 sec
>   ./run +RTS -N4   : 57.8 sec   <---- Blowout...
>
> This is quite a problem in practice when running on a shared server.  Is
> there anything that can be done to address this?
>
> (I wrote up a few more details
> here: http://thunking.drp.id.au/2012/06/slowdown-with-ghc-when-using-multiple.html)
>
> Thanks,
>
> --
> David Powell
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

Gmane