Branimir Maksimovic | 7 Jan 19:55 2014
Picon

Slow mvar when compiled with threaded

I have test network client, something like apache bench tool.
It uses mvars to synchronize and everything is ok when
compiled without -threaded.
real    0m2.995s
user    0m0.601s
sys    0m2.391s

With -threaded compile option result is following:
real    0m18.196s
user    0m2.054s
sys    0m3.313s

Seems that program is sleeping most of the time for some
reason. I can't explain behavior as it seems that
program is ok. It starts `concurrency` threads which
wait on mvar to process next task.

Program follows:

{-# Language OverloadedStrings #-}
import System.CPUTime
import System.IO
--import System.IO.Error
import Network.Socket hiding(recv)
import Network.Socket.ByteString
import System.Environment
import Control.Concurrent
import Control.Exception

main = do
(Continue reading)

Bob Ippolito | 7 Jan 21:39 2014

Re: Slow mvar when compiled with threaded

Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:

{-# Language OverloadedStrings #-}
import System.CPUTime
import Network.Socket hiding(recv)
import Network.Socket.ByteString
import Control.Exception (handle, IOException)
import System.Environment
import Control.Concurrent
import Control.Monad
import Data.Either (partitionEithers)
import Data.List (intercalate)

import Criterion.Measurement (time, time_, secs)

main :: IO ()
main = do
  (host, port, conc, reqs) <- fmap parse getArgs
  putStrLn $ "Connecting to " ++ host ++ " " ++ port
  (servAddr:_) <- getAddrInfo Nothing (Just host) (Just port)
  (diff, results) <- time $ process servAddr conc reqs
  let (errs, succs) = partitionEithers results
      numSuccs = length succs
      numErrs = length errs
      succTime = sum succs
      succAvg = succTime / fromIntegral numSuccs
  putStrLn $ unwords
    [show numSuccs, "successes,", show numErrs, "errors in", secs diff]
  when (numSuccs > 0) $ do
    putStrLn $ "min/max/avg request time: " ++
      intercalate " / " (map secs [minimum succs, maximum succs, succAvg])
    putStrLn $ show (round (fromIntegral reqs / diff) :: Int) ++ " r/s"

parse :: [String] -> (String, String, Int, Int)
parse [h,p,conc,reqs] = (h, p, read conc, read reqs)
parse _ = error "usage client host port concurrency requests"

process :: AddrInfo -> Int -> Int -> IO [Either IOException Double]
process servAddr conc reqs = do
  reqChan <- newChan
  ackChan <- newChan
  let processThread = forever $ do
        _ <- readChan reqChan
        handle (return . Left) (fmap Right socketAction) >>= writeChan ackChan
      socketAction = time_ $ do
        sock <- socket (addrFamily servAddr) Stream defaultProtocol
        connect sock (addrAddress servAddr)
        sendAll sock "GET /\r\n\r\n"
        void $ recv sock 1024
        close sock
  replicateM_ reqs (writeChan reqChan ())
  replicateM_ (min conc reqs) (forkIO processThread)
  replicateM reqs (readChan ackChan)



On Tue, Jan 7, 2014 at 10:55 AM, Branimir Maksimovic <branimir.maksimovic <at> gmail.com> wrote:
I have test network client, something like apache bench tool.
It uses mvars to synchronize and everything is ok when
compiled without -threaded.
real    0m2.995s
user    0m0.601s
sys    0m2.391s

With -threaded compile option result is following:
real    0m18.196s
user    0m2.054s
sys    0m3.313s

Seems that program is sleeping most of the time for some
reason. I can't explain behavior as it seems that
program is ok. It starts `concurrency` threads which
wait on mvar to process next task.

Program follows:

{-# Language OverloadedStrings #-}
import System.CPUTime
import System.IO
--import System.IO.Error
import Network.Socket hiding(recv)
import Network.Socket.ByteString
import System.Environment
import Control.Concurrent
import Control.Exception

main = do
    n <- getArgs
    let (host,port,conc,reqs) = parse n
    putStrLn $ "Connecting to " ++ host ++ " " ++ port
    s <- getAddrInfo Nothing (Just host) (Just port)
    let servAddr = head s
    begin <- getCPUTime
    process servAddr conc reqs
    end <- getCPUTime
    let diff = (fromIntegral (end - begin))/(10^12) :: Double
    putStrLn $ show (round (fromIntegral reqs / diff)) ++ " r/s"

parse [h,p,conc,reqs] = (h,p,read conc::Int,read reqs::Int)
parse _ = error "usage client host port concurrency requests"

process servAddr conc reqs = do
    let niter = if reqs >= conc then conc else reqs
    putStrLn $ "loop " ++ show niter
    mvars <- initThreads niter []
    putStrLn $ "Initialized " ++ show niter
    let loop n (m:mvs) f | n>0  = do
            flag <- isEmptyMVar m
            if f > length mvars then putStrLn "busy" else return ()
            if flag || f > length mvars
                then do
                    putMVar m ()
                    loop (n-1) mvs 0
                else loop n mvs (f+1)
            | otherwise = return ()
        loop n [] f = if n>0 then loop n mvars f else return ()
    putStrLn $ "length " ++ show (length mvars)
    loop (reqs-niter) mvars 0
    where
        initThreads niter vars | niter > 0 =  do
            mvar <- newMVar ()
            forkIO $ process mvar
            initThreads (niter-1) (mvar:vars)
            | otherwise = return vars
        process mvar = do
            sock <- socket (addrFamily servAddr) Stream defaultProtocol
            connect sock (addrAddress servAddr)
            sendAll sock "Hello World!\n"
            buf <- recv sock 1024
            close sock
            takeMVar mvar
            process mvar

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Branimir Maksimovic | 8 Jan 04:01 2014
Picon

Re: Slow mvar when compiled with threaded

On 01/07/2014 09:39 PM, Bob Ippolito wrote:
> Here's a much simpler implementation for that sort of pattern, using 
> channels to fan out work to threads. I added a dependency on Criterion 
> because getCPUTime is basically useless for this kind of measurement 
> on Mac OS X since it doesn't include the time that the process spent 
> waiting on IO:
Great, thank you very much. You gave me material for learning ;)
However, my version is significantly faster when compiling without 
-threaded.
With -threaded option, your version is much faster than mine, but both 
are significantly slower
then compile without -threaded.

Thanks!
Bob Ippolito | 8 Jan 04:15 2014

Re: Slow mvar when compiled with threaded

On Tuesday, January 7, 2014, Branimir Maksimovic wrote:

On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;)
However, my version is significantly faster when compiling without -threaded.
With -threaded option, your version is much faster than mine, but both are significantly slower
then compile without -threaded.

Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:

How exactly are you compiling?
Which OS?
What version of GHC and Haskell Platform? What is the exact command line you execute it with?
What timings do you get?
What's the code for the server are you connecting to?
Loopback, local network, or internet?

-bob
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Branimir Maksimovic | 8 Jan 04:27 2014
Picon

Re: Slow mvar when compiled with threaded

On 01/08/2014 04:15 AM, Bob Ippolito wrote:
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;)
However, my version is significantly faster when compiling without -threaded.
With -threaded option, your version is much faster than mine, but both are significantly slower
then compile without -threaded.

Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:

How exactly are you compiling?

ghc-7.6.3 --make -O2 client.hs

Which OS?

Ubuntu 13.10 with 3.13-rc7 kernel.

What version of GHC and Haskell Platform?

bmaxa <at> maxa:~$ apt-cache policy haskell-platform
haskell-platform:
  Installed: 2013.2.0.0
  Candidate: 2013.2.0.0
  Version table:
 *** 2013.2.0.0 0
        500 http://archive.ubuntu.com/ubuntu/ saucy/universe amd64 Packages
        100 /var/lib/dpkg/status


What is the exact command line you execute it with?

time ./client maxa 5055 1000 100000

What timings do you get?
with your version:
real    0m4.235s
user    0m1.589s
sys    0m2.642s

with my version
real    0m3.010s
user    0m0.590s
sys    0m2.417s
that is, of course, without -threaded

What's the code for the server are you connecting to?
import Network (listenOn,PortID(..))
import Network.Socket (accept,close)
import Network.Socket.ByteString
import System.Environment
import Control.Concurrent (forkIO)

main = do
    n <- getArgs
    let nn = (read.head) n :: Int
    putStrLn $ "Listening on " ++ show nn
    sock <- listenOn $ PortNumber $ fromIntegral nn
    serve sock
   
serve sock = do
    (s,_) <- accept sock
    forkIO $ process s
    serve sock

process sock = do
    buf <- recv sock 1024
    sendAll sock buf
    close sock

Loopback, local network, or internet?

loopback. Testing is on same computer.

Thanks!
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Bob Ippolito | 8 Jan 04:47 2014

Re: Slow mvar when compiled with threaded



On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/08/2014 04:15 AM, Bob Ippolito wrote:
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;)
However, my version is significantly faster when compiling without -threaded.
With -threaded option, your version is much faster than mine, but both are significantly slower
then compile without -threaded.

Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:

How exactly are you compiling?

ghc-7.6.3 --make -O2 client.hs

Which OS?

Ubuntu 13.10 with 3.13-rc7 kernel.

What version of GHC and Haskell Platform?

bmaxa <at> maxa:~$ apt-cache policy haskell-platform
haskell-platform:
  Installed: 2013.2.0.0
  Candidate: 2013.2.0.0
  Version table:
 *** 2013.2.0.0 0
        500 http://archive.ubuntu.com/ubuntu/ saucy/universe amd64 Packages
        100 /var/lib/dpkg/status


What is the exact command line you execute it with?

time ./client maxa 5055 1000 100000

What happens if you add +RTS -N to the end of that command line?
 


What timings do you get?
with your version:
real    0m4.235s
user    0m1.589s
sys    0m2.642s

with my version
real    0m3.010s
user    0m0.590s
sys    0m2.417s
that is, of course, without -threaded

What's the code for the server are you connecting to?
import Network (listenOn,PortID(..))
import Network.Socket (accept,close)
import Network.Socket.ByteString
import System.Environment
import Control.Concurrent (forkIO)

main = do
    n <- getArgs
    let nn = (read.head) n :: Int
    putStrLn $ "Listening on " ++ show nn
    sock <- listenOn $ PortNumber $ fromIntegral nn
    serve sock
   
serve sock = do
    (s,_) <- accept sock
    forkIO $ process s
    serve sock

process sock = do
    buf <- recv sock 1024
    sendAll sock buf
    close sock

Loopback, local network, or internet?

loopback. Testing is on same computer.

Thanks!

Great, I'll look closer later tonight or tomorrow morning. 
 
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Bob Ippolito | 8 Jan 16:50 2014

Re: Slow mvar when compiled with threaded

I haven't yet been able to sort out the performance difference, and probably won't have time to dig in deeper with profiling tools today. I've put together a gist with all of the files and a cabal file so it's easy to build with the right options.



On Tue, Jan 7, 2014 at 7:47 PM, Bob Ippolito <bob <at> redivi.com> wrote:


On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/08/2014 04:15 AM, Bob Ippolito wrote:
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;)
However, my version is significantly faster when compiling without -threaded.
With -threaded option, your version is much faster than mine, but both are significantly slower
then compile without -threaded.

Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:

How exactly are you compiling?

ghc-7.6.3 --make -O2 client.hs

Which OS?

Ubuntu 13.10 with 3.13-rc7 kernel.

What version of GHC and Haskell Platform?

bmaxa <at> maxa:~$ apt-cache policy haskell-platform
haskell-platform:
  Installed: 2013.2.0.0
  Candidate: 2013.2.0.0
  Version table:
 *** 2013.2.0.0 0
        500 http://archive.ubuntu.com/ubuntu/ saucy/universe amd64 Packages
        100 /var/lib/dpkg/status


What is the exact command line you execute it with?

time ./client maxa 5055 1000 100000

What happens if you add +RTS -N to the end of that command line?
 


What timings do you get?
with your version:
real    0m4.235s
user    0m1.589s
sys    0m2.642s

with my version
real    0m3.010s
user    0m0.590s
sys    0m2.417s
that is, of course, without -threaded

What's the code for the server are you connecting to?
import Network (listenOn,PortID(..))
import Network.Socket (accept,close)
import Network.Socket.ByteString
import System.Environment
import Control.Concurrent (forkIO)

main = do
    n <- getArgs
    let nn = (read.head) n :: Int
    putStrLn $ "Listening on " ++ show nn
    sock <- listenOn $ PortNumber $ fromIntegral nn
    serve sock
   
serve sock = do
    (s,_) <- accept sock
    forkIO $ process s
    serve sock

process sock = do
    buf <- recv sock 1024
    sendAll sock buf
    close sock

Loopback, local network, or internet?

loopback. Testing is on same computer.

Thanks!

Great, I'll look closer later tonight or tomorrow morning. 
 

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Bob Ippolito | 10 Jan 01:03 2014

Re: Slow mvar when compiled with threaded

So I dug a bit deeper (trying N different ways to write the code, ThreadScope, etc.) and I have some good news some bad news.

The good news is that I figured out how to fix it, and it's an "easy" fix.

The bad news is that the fix is to compile it with GHC HEAD (~7.8, I built ec4af3f), which has the new Mio high-performance multicore IO manager [1]. Apparently the old IO manager wasn't well suited to this use case.

With the latest GHC, the -threaded version outperforms the single threaded version, and the numbers for both are better than with GHC 7.6.3.

I've updated the gist with a README that has some more details:


-bob



On Wed, Jan 8, 2014 at 7:50 AM, Bob Ippolito <bob <at> redivi.com> wrote:
I haven't yet been able to sort out the performance difference, and probably won't have time to dig in deeper with profiling tools today. I've put together a gist with all of the files and a cabal file so it's easy to build with the right options.



On Tue, Jan 7, 2014 at 7:47 PM, Bob Ippolito <bob <at> redivi.com> wrote:


On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/08/2014 04:15 AM, Bob Ippolito wrote:
On Tuesday, January 7, 2014, Branimir Maksimovic wrote:
On 01/07/2014 09:39 PM, Bob Ippolito wrote:
Here's a much simpler implementation for that sort of pattern, using channels to fan out work to threads. I added a dependency on Criterion because getCPUTime is basically useless for this kind of measurement on Mac OS X since it doesn't include the time that the process spent waiting on IO:
Great, thank you very much. You gave me material for learning ;)
However, my version is significantly faster when compiling without -threaded.
With -threaded option, your version is much faster than mine, but both are significantly slower
then compile without -threaded.

Happy to look into it, I didn't have time today to do benchmarks (and Mac OS X is the worst platform to do this kind of testing on regardless of language, its network stack is inconsistent at best). I need to know more:

How exactly are you compiling?

ghc-7.6.3 --make -O2 client.hs

Which OS?

Ubuntu 13.10 with 3.13-rc7 kernel.

What version of GHC and Haskell Platform?

bmaxa <at> maxa:~$ apt-cache policy haskell-platform
haskell-platform:
  Installed: 2013.2.0.0
  Candidate: 2013.2.0.0
  Version table:
 *** 2013.2.0.0 0
        500 http://archive.ubuntu.com/ubuntu/ saucy/universe amd64 Packages
        100 /var/lib/dpkg/status


What is the exact command line you execute it with?

time ./client maxa 5055 1000 100000

What happens if you add +RTS -N to the end of that command line?
 


What timings do you get?
with your version:
real    0m4.235s
user    0m1.589s
sys    0m2.642s

with my version
real    0m3.010s
user    0m0.590s
sys    0m2.417s
that is, of course, without -threaded

What's the code for the server are you connecting to?
import Network (listenOn,PortID(..))
import Network.Socket (accept,close)
import Network.Socket.ByteString
import System.Environment
import Control.Concurrent (forkIO)

main = do
    n <- getArgs
    let nn = (read.head) n :: Int
    putStrLn $ "Listening on " ++ show nn
    sock <- listenOn $ PortNumber $ fromIntegral nn
    serve sock
   
serve sock = do
    (s,_) <- accept sock
    forkIO $ process s
    serve sock

process sock = do
    buf <- recv sock 1024
    sendAll sock buf
    close sock

Loopback, local network, or internet?

loopback. Testing is on same computer.

Thanks!

Great, I'll look closer later tonight or tomorrow morning. 
 


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Branimir Maksimovic | 10 Jan 01:13 2014
Picon

Re: Slow mvar when compiled with threaded

On 01/10/2014 01:03 AM, Bob Ippolito wrote:
>
> The bad news is that the fix is to compile it with GHC HEAD (~7.8, I 
> built ec4af3f), which has the new Mio high-performance multicore IO 
> manager [1]. Apparently the old IO manager wasn't well suited to this 
> use case.
>
> With the latest GHC, the -threaded version outperforms the single 
> threaded version, and the numbers for both are better than with GHC 7.6.3.
>
Thanks. Glad to hear that!

Gmane