Konstantin Litvinenko | 18 Mar 09:53 2013
Picon

Streaming bytes and performance

Hi All!

I tune my toy project for performance and hit the wall on simple, in 
imperative world, task. Here is the code that model what I'm trying to 
achieve

import qualified Data.ByteString.Lazy as L
import Data.Word8(isSpace)
import Data.Word
import Control.Monad.State

type Stream = State L.ByteString

get_byte :: Stream (Maybe Word8)
get_byte = do
     s <- get
     case L.uncons s of
         Nothing -> return Nothing
         Just (x, xs) -> put xs >> return (Just x)

main = do
     f <- L.readFile "test.txt"
     let r = evalState count_spaces f
     print r
   where
     count_spaces = go 0
       where
         go a = do
             x <- get_byte
             case x of
(Continue reading)

Gregory Collins | 18 Mar 13:14 2013
Picon

Re: Streaming bytes and performance

Put a bang pattern on your accumulator in "go". Since the value is not demanded until the end of the program, you're actually just building up a huge space leak there.

Secondly, unconsing from the lazy bytestring will cause a lot of allocation churn in the garbage collector -- each byte read in the input forces the creation of a new "L.ByteString", which is many times larger.

Also please consider trying the "io-streams" library that I wrote (http://hackage.haskell.org/package/io-streams). It provides primitives for streaming IO in "basic Haskell" style. To provide a Word8 stream (which is probably a bad idea performance-wise) it would be most efficient allocation-wise to implement a mutable index cursor (i.e. IORef Int) that pointed to your current position within the ByteString chunk, other strategies will probably allocate too much.

G



On Mon, Mar 18, 2013 at 9:53 AM, Konstantin Litvinenko <to.darkangel <at> gmail.com> wrote:
Hi All!

I tune my toy project for performance and hit the wall on simple, in imperative world, task. Here is the code that model what I'm trying to achieve

import qualified Data.ByteString.Lazy as L
import Data.Word8(isSpace)
import Data.Word
import Control.Monad.State

type Stream = State L.ByteString

get_byte :: Stream (Maybe Word8)
get_byte = do
    s <- get
    case L.uncons s of
        Nothing -> return Nothing
        Just (x, xs) -> put xs >> return (Just x)

main = do
    f <- L.readFile "test.txt"
    let r = evalState count_spaces f
    print r
  where
    count_spaces = go 0
      where
        go a = do
            x <- get_byte
            case x of
                Just x' ->  if isSpace x' then go (a + 1) else go a
                Nothing -> return a

It takes the file and count spaces, in imperative way, consuming bytes one by one. The problem is: How to rewrite this to get rid of constant allocation of state but still working with stream of bytes? I can rewrite this as one-liner L.foldl, but that doesn't help me in any way to optimize my toy project where all algorithms build upon consuming stream of bytes.

PS. My main lang is C++ over 10 years and I only learn Haskell :)


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe



--
Gregory Collins <greg <at> gregorycollins.net>
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Konstantin Litvinenko | 19 Mar 14:40 2013
Picon

Re: Streaming bytes and performance

On 03/18/2013 02:14 PM, Gregory Collins wrote:
> Put a bang pattern on your accumulator in "go". Since the value is not
> demanded until the end of the program, you're actually just building up
> a huge space leak there.

Fixed that

> Secondly, unconsing from the lazy bytestring will cause a lot of
> allocation churn in the garbage collector -- each byte read in the input
> forces the creation of a new "L.ByteString", which is many times larger.

Nope. L.ByteString is created along with strict ByteString but content 
not copied. And, in fact, that not a problem. The problem is that GHC 
unable to optimize constantly changing state in State monad. I don't 
know is it posible or not and if it is than what should I do to allow 
such optimization.

import Control.Monad.State.Strict

data S6 = S6 Int Int

main_6 = do
     let r = evalState go (S6 10000 0)
     print r
   where
     go = do
         (S6 i a) <- get
         if (i == 0) then return a else put (S6 (i - 1) (a + i)) >> go

main_7 = do
     let r = go (S6 10000 0)
     print r
   where
     go (S6 i a)
         | i == 0 = a
         | otherwise = go $ S6 (i - 1) (a + i)

main = main_7

If I run main_6 I get constant allocations. If I run main_7 I get no 
allocations.

Does anybody know how to overcome this inefficiency?
Don Stewart | 19 Mar 20:36 2013
Picon

Re: Streaming bytes and performance

Just for fun. Here's some improvements. about 6x faster. 
I'd be interested to see what io-streams could do on this.

Using a 250M test file.

-- strict state monad and bang patterns on the uncons and accumulator argument:

$ time ./A
4166680
./A  8.42s user 0.57s system 99% cpu 9.037 total

-- just write a loop

$ time ./A 
4166680
./A  3.84s user 0.26s system 99% cpu 4.121 total

-- switch to Int

$ time ./A 
4166680
./A  1.89s user 0.23s system 99% cpu 2.134 total

-- custom isSpace function

$ time ./A
4166680
./A  1.56s user 0.24s system 99% cpu 1.808 total

-- mmap IO

$ time ./A
4166680
./A  1.54s user 0.09s system 99% cpu 1.636 total

Here's the final program:


{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString        as S
import qualified Data.ByteString.Lazy.Char8 as L
import System.IO.Posix.MMap.Lazy

main = do
    f <- unsafeMMapFile "test.txt"
    print $ go 0 f
  where
    go :: Int -> L.ByteString -> Int
    go !a !s = case L.uncons s of
        Nothing     -> a
        Just (x,xs) | isSpaceChar8 x -> go (a+1) xs
                    | otherwise      -> go a     xs

isSpaceChar8 c = c == '\n'    || c == ' '
{-# INLINE isSpaceChar8 #-}


On Mon, Mar 18, 2013 at 8:53 AM, Konstantin Litvinenko <to.darkangel <at> gmail.com> wrote:
Hi All!

I tune my toy project for performance and hit the wall on simple, in imperative world, task. Here is the code that model what I'm trying to achieve

import qualified Data.ByteString.Lazy as L
import Data.Word8(isSpace)
import Data.Word
import Control.Monad.State

type Stream = State L.ByteString

get_byte :: Stream (Maybe Word8)
get_byte = do
    s <- get
    case L.uncons s of
        Nothing -> return Nothing
        Just (x, xs) -> put xs >> return (Just x)

main = do
    f <- L.readFile "test.txt"
    let r = evalState count_spaces f
    print r
  where
    count_spaces = go 0
      where
        go a = do
            x <- get_byte
            case x of
                Just x' ->  if isSpace x' then go (a + 1) else go a
                Nothing -> return a

It takes the file and count spaces, in imperative way, consuming bytes one by one. The problem is: How to rewrite this to get rid of constant allocation of state but still working with stream of bytes? I can rewrite this as one-liner L.foldl, but that doesn't help me in any way to optimize my toy project where all algorithms build upon consuming stream of bytes.

PS. My main lang is C++ over 10 years and I only learn Haskell :)


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Don Stewart | 19 Mar 21:32 2013
Picon

Re: Streaming bytes and performance

Oh, I forgot the technique of inlining the lazy bytestring chunks, and processing each chunk seperately.


$ time ./fast
4166680
./fast  1.25s user 0.07s system 99% cpu 1.325 total

Essentially inline Lazy.foldlChunks and specializes is (the inliner should really get that).
And now we have a nice unboxed inner loop, which llvm might spot:

$ ghc -O2 -funbox-strict-fields fast.hs  --make -fllvm 
$ time ./fast                                                          
4166680
./fast  1.07s user 0.06s system 98% cpu 1.146 total

So about 8x faster. Waiting for some non-lazy bytestring benchmarks... :)

{-# LANGUAGE BangPatterns #-}

import Data.ByteString.Internal
import Data.ByteString.Unsafe
import qualified Data.ByteString.Char8      as S
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.ByteString.Lazy.Internal as L
import System.IO.Posix.MMap.Lazy

main = do
    f <- unsafeMMapFile "test.txt"
    print . new 0 $ L.toChunks f

new :: Int -> [ByteString] -> Int
new i [] = i
new i (x:xs) = new (add i x) xs

-- jump into the fast path
{-# INLINE add #-}
add :: Int -> ByteString -> Int
add !i !s | S.null s   = i
          | isSpace' x = add (i+1) xs
          | otherwise  = add i     xs
  where T x xs = uncons s

data T = T !Char ByteString

uncons s = T (w2c (unsafeHead s)) (unsafeTail s)

isSpace' c = c == '\n'    || c == ' '
{-# INLINE isSpace' #-}




On Tue, Mar 19, 2013 at 7:36 PM, Don Stewart <dons00 <at> gmail.com> wrote:
Just for fun. Here's some improvements. about 6x faster. 
I'd be interested to see what io-streams could do on this.

Using a 250M test file.

-- strict state monad and bang patterns on the uncons and accumulator argument:

$ time ./A
4166680
./A  8.42s user 0.57s system 99% cpu 9.037 total

-- just write a loop

$ time ./A 
4166680
./A  3.84s user 0.26s system 99% cpu 4.121 total

-- switch to Int

$ time ./A 
4166680
./A  1.89s user 0.23s system 99% cpu 2.134 total

-- custom isSpace function

$ time ./A
4166680
./A  1.56s user 0.24s system 99% cpu 1.808 total

-- mmap IO

$ time ./A
4166680
./A  1.54s user 0.09s system 99% cpu 1.636 total

Here's the final program:


{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString        as S
import qualified Data.ByteString.Lazy.Char8 as L
import System.IO.Posix.MMap.Lazy

main = do
    f <- unsafeMMapFile "test.txt"
    print $ go 0 f
  where
    go :: Int -> L.ByteString -> Int
    go !a !s = case L.uncons s of
        Nothing     -> a
        Just (x,xs) | isSpaceChar8 x -> go (a+1) xs
                    | otherwise      -> go a     xs

isSpaceChar8 c = c == '\n'    || c == ' '
{-# INLINE isSpaceChar8 #-}


On Mon, Mar 18, 2013 at 8:53 AM, Konstantin Litvinenko <to.darkangel <at> gmail.com> wrote:
Hi All!

I tune my toy project for performance and hit the wall on simple, in imperative world, task. Here is the code that model what I'm trying to achieve

import qualified Data.ByteString.Lazy as L
import Data.Word8(isSpace)
import Data.Word
import Control.Monad.State

type Stream = State L.ByteString

get_byte :: Stream (Maybe Word8)
get_byte = do
    s <- get
    case L.uncons s of
        Nothing -> return Nothing
        Just (x, xs) -> put xs >> return (Just x)

main = do
    f <- L.readFile "test.txt"
    let r = evalState count_spaces f
    print r
  where
    count_spaces = go 0
      where
        go a = do
            x <- get_byte
            case x of
                Just x' ->  if isSpace x' then go (a + 1) else go a
                Nothing -> return a

It takes the file and count spaces, in imperative way, consuming bytes one by one. The problem is: How to rewrite this to get rid of constant allocation of state but still working with stream of bytes? I can rewrite this as one-liner L.foldl, but that doesn't help me in any way to optimize my toy project where all algorithms build upon consuming stream of bytes.

PS. My main lang is C++ over 10 years and I only learn Haskell :)


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Konstantin Litvinenko | 19 Mar 21:49 2013
Picon

Re: Streaming bytes and performance

On 03/19/2013 10:32 PM, Don Stewart wrote:
> Oh, I forgot the technique of inlining the lazy bytestring chunks, and
> processing each chunk seperately.
>
> $ time ./fast
> 4166680
> ./fast  1.25s user 0.07s system 99% cpu 1.325 total
>
> Essentially inline Lazy.foldlChunks and specializes is (the inliner
> should really get that).
> And now we have a nice unboxed inner loop, which llvm might spot:
>
> $ ghc -O2 -funbox-strict-fields fast.hs  --make -fllvm
> $ time ./fast
> 4166680
> ./fast  1.07s user 0.06s system 98% cpu *1.146 total*
>
> So about 8x faster. Waiting for some non-lazy bytestring benchmarks... :)

Thanks Don, but after some investigation I came to conclusion that 
problem is in State monad

{-# LANGUAGE BangPatterns #-}

import Control.Monad.State.Strict

data S6 = S6 !Int !Int

main_6 = do
     let r = evalState go (S6 10000 0)
     print r
   where
     go = do
         (S6 i a) <- get
         if (i == 0) then return a else (put (S6 (i - 1) (a + i))) >> go

main_7 = do
     let r = go (S6 10000 0)
     print r
   where
     go (S6 i a)
         | i == 0 = a
         | otherwise = go $ S6 (i - 1) (a + i)

main = main_6

main_6 doing constant allocations while main_7 run in constant space. 
Can you suggest something that improve situation? I don't want to 
manually unfold all my code that I want to be fast :(.
Konstantin Litvinenko | 19 Mar 22:27 2013
Picon

Re: Streaming bytes and performance

On 03/19/2013 10:49 PM, Konstantin Litvinenko wrote:
> {-# LANGUAGE BangPatterns #-}
>
> import Control.Monad.State.Strict
>
> data S6 = S6 !Int !Int
>
> main_6 = do
>      let r = evalState go (S6 10000 0)
>      print r
>    where
>      go = do
>          (S6 i a) <- get
>          if (i == 0) then return a else (put (S6 (i - 1) (a + i))) >> go
>
> main_7 = do
>      let r = go (S6 10000 0)
>      print r
>    where
>      go (S6 i a)
>          | i == 0 = a
>          | otherwise = go $ S6 (i - 1) (a + i)
>
> main = main_6
>
> main_6 doing constant allocations while main_7 run in constant space.
> Can you suggest something that improve situation? I don't want to
> manually unfold all my code that I want to be fast :(.

Correction - they both run in constant space, that's not a problem. The 
problem is main_6 doing constant allocation/destroying and main_7 doesn't.
Branimir Maksimovic | 19 Mar 23:47 2013
Picon

Re: Streaming bytes and performance



> To: haskell-cafe <at> haskell.org
> From: to.darkangel <at> gmail.com
> Date: Tue, 19 Mar 2013 23:27:09 +0200
> Subject: Re: [Haskell-cafe] Streaming bytes and performance
>
> On 03/19/2013 10:49 PM, Konstantin Litvinenko wrote:
> > {-# LANGUAGE BangPatterns #-}
> >
> > import Control.Monad.State.Strict
> >
> > data S6 = S6 !Int !Int
> >
> > main_6 = do
> > let r = evalState go (S6 10000 0)
> > print r
> > where
> > go = do
> > (S6 i a) <- get
> > if (i == 0) then return a else (put (S6 (i - 1) (a + i))) >> go
> >
> > main_7 = do
> > let r = go (S6 10000 0)
> > print r
> > where
> > go (S6 i a)
> > | i == 0 = a
> > | otherwise = go $ S6 (i - 1) (a + i)
> >
> > main = main_6
> >
> > main_6 doing constant allocations while main_7 run in constant space.
> > Can you suggest something that improve situation? I don't want to
> > manually unfold all my code that I want to be fast :(.

Your problem is that main_6 thunks 'i' and 'a' .
If you write (S6 !i !a) <- get
than there is no problem any more...

>
> Correction - they both run in constant space, that's not a problem. The
> problem is main_6 doing constant allocation/destroying and main_7 doesn't.

No main_6 does not runs in constant space if you dont use bang patterns...


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Konstantin Litvinenko | 20 Mar 07:04 2013
Picon

Re: Streaming bytes and performance

On 03/20/2013 12:47 AM, Branimir Maksimovic wrote:
> Your problem is that main_6 thunks 'i' and 'a' .
> If you write (S6 !i !a) <- get
> than there is no problem any more...
>

Nope :( Unfortunately that doesn't change anything. Still allocating...
Branimir Maksimovic | 20 Mar 10:17 2013
Picon

Re: Streaming bytes and performance

Are you sure? I use ghc 7.6.2 (compiled with -O2) and without bang patterns
for 1million iterations it blows stack space.
With bang patterns it runs in constant space , same as 
other version?

bmaxa <at> maxa:~/haskell$ ./state +RTS -s
500000500000
          52,080 bytes allocated in the heap
           3,512 bytes copied during GC
          44,416 bytes maximum residency (1 sample(s))
          17,024 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.00s  (  0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.00s  (  0.00s elapsed)

  %GC     time       0.0%  (6.2% elapsed)

  Alloc rate    0 bytes per MUT second

  Productivity 100.0% of total user, 0.0% of total elapsed

> Date: Wed, 20 Mar 2013 08:04:01 +0200
> From: to.darkangel <at> gmail.com
> To: bmaxa <at> hotmail.com
> CC: haskell-cafe <at> haskell.org
> Subject: Re: [Haskell-cafe] Streaming bytes and performance
>
> On 03/20/2013 12:47 AM, Branimir Maksimovic wrote:
> > Your problem is that main_6 thunks 'i' and 'a' .
> > If you write (S6 !i !a) <- get
> > than there is no problem any more...
> >
>
> Nope :( Unfortunately that doesn't change anything. Still allocating...
>
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Konstantin Litvinenko | 20 Mar 11:41 2013
Picon

Re: Streaming bytes and performance

On 03/20/2013 11:17 AM, Branimir Maksimovic wrote:
> Are you sure? I use ghc 7.6.2

Huh, I use 7.4.2, and if 7.6.2 can handle this I will try to switch. Not 
sure how to do that on ubuntu 12.10...
Bas van Dijk | 20 Mar 13:52 2013
Picon

Re: Streaming bytes and performance

On 20 March 2013 11:41, Konstantin Litvinenko <to.darkangel <at> gmail.com> wrote:
> On 03/20/2013 11:17 AM, Branimir Maksimovic wrote:
>>
>> Are you sure? I use ghc 7.6.2
>
>
> Huh, I use 7.4.2, and if 7.6.2 can handle this I will try to switch. Not
> sure how to do that on ubuntu 12.10...

I always install ghcs under my home directory:

wget http://www.haskell.org/ghc/dist/7.6.2/ghc-7.6.2-x86_64-unknown-linux.tar.bz2
tar -xf ghc-7.6.2-x86_64-unknown-linux.tar.bz2
cd ghc-7.6.2
configure --prefix=$HOME/ghcs/7.6.2
make install

Then put $HOME/ghcs/7.6.2/bin in front of your $PATH.

You could also use:
hsenv --ghc=ghc-7.6.2-x86_64-unknown-linux.tar.bz2 for this:
http://hackage.haskell.org/package/hsenv

My colleague Jason just made  a nice improvement:
https://github.com/tmhedberg/hsenv/pull/22

which allows you to do:

hsenv --ghc=7.6.2

which will automatically download the right ghc for your platform and
install it in a new fresh environment isolated from the rest of your
system.

Bas
Konstantin Litvinenko | 20 Mar 17:21 2013
Picon

Re: Streaming bytes and performance

On 03/20/2013 11:17 AM, Branimir Maksimovic wrote:

> Are you sure? I use ghc 7.6.2 (compiled with -O2) and without bang patterns
> for 1million iterations it blows stack space.
> With bang patterns it runs in constant space , same as
> other version?

Okay, I have found the root of allocation problem. It is not because of 
7.4.2. If I use -auto-all it somehow change code generation and start 
allocating. If I remove -auto-all from command line than no allocation 
occurs. That really weird because now I don't know how to profile and 
get meaningful results :(
Nicolas Trangez | 19 Mar 21:53 2013

Re: Streaming bytes and performance

On Tue, 2013-03-19 at 20:32 +0000, Don Stewart wrote:
> Oh, I forgot the technique of inlining the lazy bytestring chunks, and
> processing each chunk seperately.
> 
> $ time ./fast
> 4166680
> ./fast  1.25s user 0.07s system 99% cpu 1.325 total
> 
> Essentially inline Lazy.foldlChunks and specializes is (the inliner should
> really get that).
> And now we have a nice unboxed inner loop, which llvm might spot:
> 
> $ ghc -O2 -funbox-strict-fields fast.hs  --make -fllvm
> $ time ./fast
> 4166680
> ./fast  1.07s user 0.06s system 98% cpu *1.146 total*
> 
> So about 8x faster. Waiting for some non-lazy bytestring benchmarks... :)

You could try something like this using Conduit:

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import Data.Conduit
import qualified Data.Conduit.List as L
import qualified Data.Conduit.Binary as B
import qualified Data.ByteString.Char8 as BS8

main :: IO ()
main = print =<< runResourceT (
    B.sourceFile filename $$ L.fold (\(!a) (!b) -> a + BS8.count ' ' b)
(0 :: Int))
  where
    filename = ...

Nicolas

> 
> {-# LANGUAGE BangPatterns #-}
> 
> import Data.ByteString.Internal
> import Data.ByteString.Unsafe
> import qualified Data.ByteString.Char8      as S
> import qualified Data.ByteString.Lazy.Char8 as L
> import qualified Data.ByteString.Lazy.Internal as L
> import System.IO.Posix.MMap.Lazy
> 
> main = do
>     f <- unsafeMMapFile "test.txt"
>     print . new 0 $ L.toChunks f
> 
> new :: Int -> [ByteString] -> Int
> new i [] = i
> new i (x:xs) = new (add i x) xs
> 
> -- jump into the fast path
> {-# INLINE add #-}
> add :: Int -> ByteString -> Int
> add !i !s | S.null s   = i
>           | isSpace' x = add (i+1) xs
>           | otherwise  = add i     xs
>   where T x xs = uncons s
> 
> data T = T !Char ByteString
> 
> uncons s = T (w2c (unsafeHead s)) (unsafeTail s)
> 
> isSpace' c = c == '\n'    || c == ' '
> {-# INLINE isSpace' #-}
> 
> 
> 
> 
> On Tue, Mar 19, 2013 at 7:36 PM, Don Stewart <dons00 <at> gmail.com> wrote:
> 
> > Just for fun. Here's some improvements. about 6x faster.
> > I'd be interested to see what io-streams could do on this.
> >
> > Using a 250M test file.
> >
> > -- strict state monad and bang patterns on the uncons and accumulator
> > argument:
> >
> > $ time ./A
> > 4166680
> > ./A  8.42s user 0.57s system 99% cpu 9.037 total
> >
> > -- just write a loop
> >
> > $ time ./A
> > 4166680
> > ./A  3.84s user 0.26s system 99% cpu 4.121 total
> >
> > -- switch to Int
> >
> > $ time ./A
> > 4166680
> > ./A  1.89s user 0.23s system 99% cpu 2.134 total
> >
> > -- custom isSpace function
> >
> > $ time ./A
> > 4166680
> > ./A  1.56s user 0.24s system 99% cpu 1.808 total
> >
> > -- mmap IO
> >
> > $ time ./A
> > 4166680
> > ./A  1.54s user 0.09s system 99% cpu 1.636 total
> >
> > Here's the final program:
> >
> >
> > {-# LANGUAGE BangPatterns #-}
> >
> > import qualified Data.ByteString        as S
> > import qualified Data.ByteString.Lazy.Char8 as L
> > import System.IO.Posix.MMap.Lazy
> >
> > main = do
> >     f <- unsafeMMapFile "test.txt"
> >     print $ go 0 f
> >   where
> >     go :: Int -> L.ByteString -> Int
> >     go !a !s = case L.uncons s of
> >         Nothing     -> a
> >         Just (x,xs) | isSpaceChar8 x -> go (a+1) xs
> >                     | otherwise      -> go a     xs
> >
> > isSpaceChar8 c = c == '\n'    || c == ' '
> > {-# INLINE isSpaceChar8 #-}
> >
> >
> > On Mon, Mar 18, 2013 at 8:53 AM, Konstantin Litvinenko <
> > to.darkangel <at> gmail.com> wrote:
> >
> >> Hi All!
> >>
> >> I tune my toy project for performance and hit the wall on simple, in
> >> imperative world, task. Here is the code that model what I'm trying to
> >> achieve
> >>
> >> import qualified Data.ByteString.Lazy as L
> >> import Data.Word8(isSpace)
> >> import Data.Word
> >> import Control.Monad.State
> >>
> >> type Stream = State L.ByteString
> >>
> >> get_byte :: Stream (Maybe Word8)
> >> get_byte = do
> >>     s <- get
> >>     case L.uncons s of
> >>         Nothing -> return Nothing
> >>         Just (x, xs) -> put xs >> return (Just x)
> >>
> >> main = do
> >>     f <- L.readFile "test.txt"
> >>     let r = evalState count_spaces f
> >>     print r
> >>   where
> >>     count_spaces = go 0
> >>       where
> >>         go a = do
> >>             x <- get_byte
> >>             case x of
> >>                 Just x' ->  if isSpace x' then go (a + 1) else go a
> >>                 Nothing -> return a
> >>
> >> It takes the file and count spaces, in imperative way, consuming bytes
> >> one by one. The problem is: How to rewrite this to get rid of constant
> >> allocation of state but still working with stream of bytes? I can rewrite
> >> this as one-liner L.foldl, but that doesn't help me in any way to optimize
> >> my toy project where all algorithms build upon consuming stream of bytes.
> >>
> >> PS. My main lang is C++ over 10 years and I only learn Haskell :)
> >>
> >>
> >> ______________________________**_________________
> >> Haskell-Cafe mailing list
> >> Haskell-Cafe <at> haskell.org
> >> http://www.haskell.org/**mailman/listinfo/haskell-cafe<http://www.haskell.org/mailman/listinfo/haskell-cafe>
> >>
> >
> >
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
Konstantin Litvinenko | 19 Mar 22:01 2013
Picon

Re: Streaming bytes and performance

On 03/19/2013 10:53 PM, Nicolas Trangez wrote:
> On Tue, 2013-03-19 at 20:32 +0000, Don Stewart wrote:
>> So about 8x faster. Waiting for some non-lazy bytestring benchmarks... :)
>
> You could try something like this using Conduit:
>
> {-# LANGUAGE BangPatterns #-}
> module Main (main) where
>
> import Data.Conduit
> import qualified Data.Conduit.List as L
> import qualified Data.Conduit.Binary as B
> import qualified Data.ByteString.Char8 as BS8
>
> main :: IO ()
> main = print =<< runResourceT (
>      B.sourceFile filename $$ L.fold (\(!a) (!b) -> a + BS8.count ' ' b)
> (0 :: Int))
>    where
>      filename = ...

Please stops counting spaces! :) It was a MODEL that demonstrates 
constant allocation of state when I used State monad. That's the 
*problem*. I mention in my first email that I do know how to count 
spaces using one-line L.foldl with no allocations at all :).
Peter Simons | 19 Mar 22:03 2013
Picon

Re: Streaming bytes and performance

Don Stewart <dons00 <at> gmail.com> writes:

 > Here's the final program: [...]

Here is a version of the program that is just as fast:

  import Prelude hiding ( getContents, foldl )
  import Data.ByteString.Char8

  countSpace :: Int -> Char -> Int
  countSpace i c | c == ' ' || c == '\n' = i + 1
                 | otherwise             = i

  main :: IO ()
  main = getContents >>= print . foldl countSpace 0

Generally speaking, I/O performance is not about fancy low-level system
features, it's about having a proper evaluation order:

 | $ ghc --make -O2 -funbox-strict-fields test1 && time ./test1
 | 37627064
 |
 | real 0m0.381s
 | user 0m0.356s
 | sys  0m0.023s

Versus:

 | $ ghc --make -O2 -funbox-strict-fields test2 && time ./test2 <test.txt
 | Linking test2 ...
 | 37627064
 |
 | real 0m0.383s
 | user 0m0.316s
 | sys  0m0.065s

Using this input file stored in /dev/shm:

 | $ ls -l test.txt 
 | -rw-r--r-- 1 simons users 208745650 Mar 19 21:40 test.txt

Take care,
Peter
Don Stewart | 19 Mar 22:16 2013
Picon

Re: Streaming bytes and performance

This isn't a valid entry -- it uses strict IO (so allocates O(n) space) and reads from standard input, which pretty much swamps the interesting constant factors with buffered IO overhead.

Compare your program (made lazy) on lazy bytestrings using file IO:

import Prelude hiding ( readFile, foldl )
import Data.ByteString.Lazy.Char8

countSpace :: Int -> Char -> Int
countSpace i c | c == ' ' || c == '\n' = i + 1
               | otherwise             = i

main :: IO ()
main = readFile "test.txt" >>= print . foldl countSpace 0

Against my earlier optimized one (that manually specializes and does other tricks).

$ time ./C
4166680
./C  1.49s user 0.42s system 82% cpu 2.326 total

$ time ./fast
4166680
./fast  1.05s user 0.11s system 96% cpu 1.201 total

The optimized one is twice as fast. You can write the same program on lists , and it also runs in constant space but completes 32s instead  of 1.3

Constant factors matter.

On Tue, Mar 19, 2013 at 9:03 PM, Peter Simons <simons <at> cryp.to> wrote:
Don Stewart <dons00 <at> gmail.com> writes:

 > Here's the final program: [...]

Here is a version of the program that is just as fast:

  import Prelude hiding ( getContents, foldl )
  import Data.ByteString.Char8

  countSpace :: Int -> Char -> Int
  countSpace i c | c == ' ' || c == '\n' = i + 1
                 | otherwise             = i

  main :: IO ()
  main = getContents >>= print . foldl countSpace 0

Generally speaking, I/O performance is not about fancy low-level system
features, it's about having a proper evaluation order:

 | $ ghc --make -O2 -funbox-strict-fields test1 && time ./test1
 | 37627064
 |
 | real 0m0.381s
 | user 0m0.356s
 | sys  0m0.023s

Versus:

 | $ ghc --make -O2 -funbox-strict-fields test2 && time ./test2 <test.txt
 | Linking test2 ...
 | 37627064
 |
 | real 0m0.383s
 | user 0m0.316s
 | sys  0m0.065s

Using this input file stored in /dev/shm:

 | $ ls -l test.txt
 | -rw-r--r-- 1 simons users 208745650 Mar 19 21:40 test.txt

Take care,
Peter


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Peter Simons | 19 Mar 22:27 2013
Picon

Re: Streaming bytes and performance

Hi Don,

 > Compare your program (made lazy) on lazy bytestrings using file IO: [...]

if I make those changes, the program runs even faster than before:

  module Main ( main ) where

  import Prelude hiding ( foldl, readFile )
  import Data.ByteString.Lazy.Char8

  countSpace :: Int -> Char -> Int
  countSpace i c | c == ' ' || c == '\n' = i + 1
                 | otherwise             = i

  main :: IO ()
  main = readFile "test.txt" >>= print . foldl countSpace 0

This gives

 | $ ghc --make -O2 -funbox-strict-fields test1 && time ./test1
 | 37627064
 |
 | real    0m0.375s
 | user    0m0.346s
 | sys     0m0.028s

versus:

 | $ ghc --make -O2 -funbox-strict-fields test2 && time ./test2
 | 37627064
 |
 | real    0m0.324s
 | user    0m0.299s
 | sys     0m0.024s

Whether getFile or getContents is used doesn't seem to make difference.

Take care,
Peter
Don Stewart | 19 Mar 22:34 2013
Picon

Re: Streaming bytes and performance

Oh I see what you're doing ... "Using this input file stored in /dev/shm"

So not measuring the IO performance at all. :)

On Mar 19, 2013 9:27 PM, "Peter Simons" <simons <at> cryp.to> wrote:
Hi Don,

 > Compare your program (made lazy) on lazy bytestrings using file IO: [...]

if I make those changes, the program runs even faster than before:

  module Main ( main ) where

  import Prelude hiding ( foldl, readFile )
  import Data.ByteString.Lazy.Char8

  countSpace :: Int -> Char -> Int
  countSpace i c | c == ' ' || c == '\n' = i + 1
                 | otherwise             = i

  main :: IO ()
  main = readFile "test.txt" >>= print . foldl countSpace 0

This gives

 | $ ghc --make -O2 -funbox-strict-fields test1 && time ./test1
 | 37627064
 |
 | real    0m0.375s
 | user    0m0.346s
 | sys     0m0.028s

versus:

 | $ ghc --make -O2 -funbox-strict-fields test2 && time ./test2
 | 37627064
 |
 | real    0m0.324s
 | user    0m0.299s
 | sys     0m0.024s

Whether getFile or getContents is used doesn't seem to make difference.

Take care,
Peter
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Peter Simons | 19 Mar 22:49 2013
Picon

Re: Streaming bytes and performance

Hi Don,

 > "Using this input file stored in /dev/shm"
 >
 > So not measuring the IO performance at all. :)

of course the program measures I/O performance. It just doesn't measure
the speed of the disk.

Anyway, a highly optimized benchmark such as the one you posted is
eventually going to beat one that's not as highly optimized. I think
no-one disputes that fact.

I was merely trying to point out that a program which encodes its
evaluation order properly is going to be reasonably fast without any
further optimizations.

Take care,
Peter
Don Stewart | 19 Mar 22:55 2013
Picon

Re: Streaming bytes and performance

I guess the optimizations that went into making lazy bytestring IO fast (on disks) are increasingly irrelevant as SSDs take over.

On Mar 19, 2013 9:49 PM, "Peter Simons" <simons <at> cryp.to> wrote:
Hi Don,

 > "Using this input file stored in /dev/shm"
 >
 > So not measuring the IO performance at all. :)

of course the program measures I/O performance. It just doesn't measure
the speed of the disk.

Anyway, a highly optimized benchmark such as the one you posted is
eventually going to beat one that's not as highly optimized. I think
no-one disputes that fact.

I was merely trying to point out that a program which encodes its
evaluation order properly is going to be reasonably fast without any
further optimizations.

Take care,
Peter
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Gmane