Philip Müller | 17 May 19:21
Favicon

another Newbie performance question

Hi everybody,

I was doing an assignment in Java for my university concerning a program 
that reads, modifies and writes CSV files, when I suddenly had the idea 
of implementing parts of this in Haskell for fun.

When I finished the Haskell programm, I was disappointed by the performance:
To parse a 200k lines CSV file, insert a line (yes I know i could insert 
a line without parsing the file, that's just an example) at pos. 199999 
and write the file again, the Java program takes 1.1 seconds while the 
Haskell program takes 12.5 seconds.

I have read Don's blog post but am unsure how to implement his tips into 
my program, as I am still kind of a Haskell beginner.

The source code (40 lines incl. comments and empty lines) and the 200k 
CSV file I used for testing and a smaller CSV file demonstrating the 
special easy-to-parse CSV syntax are available on my ftp server,

ftp://baah.servegame.org/public/haskell

The call syntax is
<program> <csv file> <line to insert>
e.g.
main lang.csv "test","this","line"

I haven't posted the source code here directly because I thought it 
might be too long.

If someone here finds the time to look at my code and give me some 
(Continue reading)

Achim Schneider | 17 May 20:35
Picon

Re: another Newbie performance question

Philip Müller <mail <at> philip.in-aachen.net> wrote:

> I have read Don's blog post but am unsure how to implement his tips
> into my program, as I am still kind of a Haskell beginner.
> 
Dan, you seem to have opened a big can of worms. If Haskell is
successful, it's your fault.

Without doing any compiling, staring at core nor profiling myself, I
advice that you

1) traverse the list less often. 
2) use ByteStrings
3) use an intermediate data structure that has better insert behaviour
than a standard list, have a look at Data.Sequence
4) really, really use ByteStrings
5) listen to me if I tell you to use ByteStrings
6) if you already must write in pointless style, please don't also
order the functions in a backward way.

--

-- 
(c) this sig last receiving data processing entity. Inspect headers for
past copyright information. All rights reserved. Unauthorised copying,
hiring, renting, public performance and/or broadcasting of this
signature prohibited. 
Luke Palmer | 17 May 23:07
Picon

Re: another Newbie performance question

On Sat, May 17, 2008 at 5:22 PM, Philip Müller
<mail <at> philip.in-aachen.net> wrote:
> If someone here finds the time to look at my code and give me some hints,
> that would really be nice.

A little experimentation reveals that your main bottleneck is readCSVLine:

  readCSVLine = read . (\x -> "[" ++ x ++ "]")

(I just replaced it with (:[]) and it sped up enormously)

Thus, I rewrote it myself instead of with read.

readCSVLine       = unfoldr builder
    where
    builder [] = Nothing
    builder xs = Just $ readField xs

    readField []       = ([],[])
    readField (',':xs) = ([],xs)
    readField ('"':xs) =
        let (l,'"':r) = span (/= '"') xs
            (field,rest) = readField r

And decreased the runtime from 17 seconds to 4.2.  It probably admits
an even better implementation, but it's likely that this is not the
bottleneck anymore.

The other thing is that the whole table is stored in memory because of
your call to "length csv" in doInteraction.  This may have been the
(Continue reading)

apfelmus | 18 May 09:41
Picon
Favicon

Re: another Newbie performance question

Achim Schneider wrote:
> 1) traverse the list less often. 

Luke Palmer wrote:
> Philip Müller wrote:
>> If someone here finds the time to look at my code and give me some hints,
>> that would really be nice.
> 
> There are probably a few other tricks you could do, but I think I
> identified the main factors.

List concatenation (++) takes time proportional to the length of the 
first list. In other words, every (xs ++) traverses xs again. So, things 
like (x ++ "\n") are expensive. Better write  writeCSV as

   writeCSV = unlines . map (concat . intersperse "," . map show)

Here,  unlines  is a library function which implements your

   (\x -> x ++ "\n") . concat . intersperse "\n"

but slightly more efficiently.

In general, difference lists are better for output (concatenation) than 
Strings and ByteStrings. So, if you want to squeeze even more speed out 
of your program, use those.

Regards,
apfelmus
(Continue reading)

Peter Gammie | 18 May 14:36
Picon

Re: another Newbie performance question

You want to lazily read a CSV file? I had a crack at writing a module  
for that:

http://peteg.org/blog/AYAD/Project/2008-04-11-LazyCSVParser.autumn

It is not quite RFC compliant (I believe that is clearly commented  
upon).

As I say there, anything based on (older versions of) Parsec is likely  
to be too strict in the general case.

cheers
peter
Henning Thielemann | 18 May 15:15
Picon

Re: another Newbie performance question


On Sat, 17 May 2008, Luke Palmer wrote:

> insertLine line csv = let (l,r) =
>    splitLast csv in l ++ [readCSVLine line] ++ r
>    where
>    splitLast [x]    = ([],[x])
>    splitLast (x:xs) = let (l,r) = splitLast xs in (x:l,r)
>
> (Note that I got rid of the "pos" parameter)

I'd like to have a function like
    Data.List.viewR :: [a] -> Maybe ([a], a)
   analogously to Data.Sequence, which is a safe replacement for 'init' and 
'last'.

Gmane