John MacFarlane | 28 Oct 20:16 2012
Picon

Improving Data.Char.isSpace performance

I think that 'isSpace' from Data.Char (and hence also 'words' from the
Prelude) is not as fast as it could be.  Here is the definition
(which is actually found in GHC.Unicode):

isSpace :: Char -> Bool
isSpace c =
    c == ' '     ||
    c == '\t'    ||
    c == '\n'    ||
    c == '\r'    ||
    c == '\f'    ||
    c == '\v'    ||
    c == '\xa0'  ||
    iswspace (fromIntegral (C.ord c)) /= 0

I presume that the point of the disjuncts at the beginning is to
avoid the call to iswspace for the most common space characters.
The problem is that most characters (in most texts) are not space
characters, and for nonspace characters iswspace will always be
called.

So I investigated a possible optimization that would also check for
the most common nonspace characters before calling iswspace:

isSpace_Alt :: Char -> Bool
isSpace_Alt c | c > '\x20' && c < '\xa0' = False
              | c == ' '                 = True
              | '\t' <= c && c <= '\r'   = True
              | c == '\xa0'              = True
              | otherwise                = iswspace (fromIntegral (C.ord c)) /= 0
(Continue reading)

Johan Tibell | 28 Oct 20:21 2012
Picon

Re: Improving Data.Char.isSpace performance

On Sun, Oct 28, 2012 at 12:16 PM, John MacFarlane <jgm <at> berkeley.edu> wrote:
> I'd like to propose that we consider replacing the definition
> of isSpace with isSpace_Alt.

+1

Aside: I've seen a similar optimization in the Beautiful Code book, in
chapter 5.
Edward A Kmett | 28 Oct 22:08 2012
Picon

Re: Improving Data.Char.isSpace performance

+1

It might be worth hunting for similar low hanging fruit as well.

-Edward

On Oct 28, 2012, at 3:16 PM, John MacFarlane <jgm <at> berkeley.edu> wrote:

> I think that 'isSpace' from Data.Char (and hence also 'words' from the
> Prelude) is not as fast as it could be.  Here is the definition
> (which is actually found in GHC.Unicode):
> 
> isSpace :: Char -> Bool
> isSpace c =
>    c == ' '     ||
>    c == '\t'    ||
>    c == '\n'    ||
>    c == '\r'    ||
>    c == '\f'    ||
>    c == '\v'    ||
>    c == '\xa0'  ||
>    iswspace (fromIntegral (C.ord c)) /= 0
> 
> I presume that the point of the disjuncts at the beginning is to
> avoid the call to iswspace for the most common space characters.
> The problem is that most characters (in most texts) are not space
> characters, and for nonspace characters iswspace will always be
> called.
> 
> So I investigated a possible optimization that would also check for
(Continue reading)

John MacFarlane | 29 Oct 00:35 2012
Picon

Re: Improving Data.Char.isSpace performance

+++ Edward A Kmett [Oct 28 12 17:08 ]:
> It might be worth hunting for similar low hanging fruit as well.

No doubt a similar approach would work for isLetter, isDigit,
toUpper, toLower, etc.

One thing puzzles me, though.  Maybe someone here will have an
explanation.  I added a test case to my benchmark for isSpace
*imported* from Data.Char.  The imported isSpace benchmarks
much faster (up to 5X) than the isSpace_DataChar, even though the latter has
the same definition as the former.  Wren Thronton noticed this too,
and suggests (in
http://community.haskell.org/~wren/bytestring-lexing/test/bench/BenchIsSpace.hs)
that the difference "appears to be something special in how the base
library was compiled," but I have no idea what that could be.

John
Johan Tibell | 29 Oct 18:58 2012
Picon

Re: Improving Data.Char.isSpace performance

On Sun, Oct 28, 2012 at 4:35 PM, John MacFarlane <jgm <at> berkeley.edu> wrote:
> +++ Edward A Kmett [Oct 28 12 17:08 ]:
>> It might be worth hunting for similar low hanging fruit as well.
>
> No doubt a similar approach would work for isLetter, isDigit,
> toUpper, toLower, etc.
>
> One thing puzzles me, though.  Maybe someone here will have an
> explanation.  I added a test case to my benchmark for isSpace
> *imported* from Data.Char.  The imported isSpace benchmarks
> much faster (up to 5X) than the isSpace_DataChar, even though the latter has
> the same definition as the former.  Wren Thronton noticed this too,
> and suggests (in
> http://community.haskell.org/~wren/bytestring-lexing/test/bench/BenchIsSpace.hs)
> that the difference "appears to be something special in how the base
> library was compiled," but I have no idea what that could be.

Could be inlining. Compile your benchmark with -ddump-simpl and see if
isSpace got inlined or not.

-- Johan
Simon Peyton-Jones | 29 Oct 23:29 2012
Picon

RE: Improving Data.Char.isSpace performance

|  One thing puzzles me, though.  Maybe someone here will have an
|  explanation.  I added a test case to my benchmark for isSpace
|  *imported* from Data.Char.  The imported isSpace benchmarks
|  much faster (up to 5X) than the isSpace_DataChar, even though the latter has
|  the same definition as the former.  Wren Thronton noticed this too,
|  and suggests (in
|  http://community.haskell.org/~wren/bytestring-	
|  lexing/test/bench/BenchIsSpace.hs)
|  that the difference "appears to be something special in how the base
|  library was compiled," but I have no idea what that could be.

A way to reproduce this would be good.  (If so, open a ticket.)  The only thing I can think of is that libraries
may be compiled with -O2, and that might make a difference.

Simon

|  -----Original Message-----
|  From: libraries-bounces <at> haskell.org [mailto:libraries-bounces <at> haskell.org] On
|  Behalf Of John MacFarlane
|  Sent: 28 October 2012 23:35
|  To: Edward A Kmett
|  Cc: libraries <at> haskell.org
|  Subject: Re: Improving Data.Char.isSpace performance
|  
|  +++ Edward A Kmett [Oct 28 12 17:08 ]:
|  > It might be worth hunting for similar low hanging fruit as well.
|  
|  No doubt a similar approach would work for isLetter, isDigit,
|  toUpper, toLower, etc.
|  
(Continue reading)

John MacFarlane | 29 Oct 18:52 2012
Picon

Re: Improving Data.Char.isSpace performance

I have experimented with a couple of variants that seem
better than the definition I originally proposed.

The most promising is

isSpace_Alt6 :: Char -> Bool
{-# INLINE isSpace_Alt6 #-}
isSpace_Alt6 ' '               = True
isSpace_Alt6 '\n'              = True
isSpace_Alt6 '\t'              = True
isSpace_Alt6 '\r'              = True
isSpace_Alt6 '\x0B'            = True
isSpace_Alt6 '\x0C'            = True
isSpace_Alt6 '\xA0'            = True
isSpace_Alt6 c | c <  '\x1680' = False
               | otherwise     = iswspace (fromIntegral (C.ord c)) /= 0

Benchmarks can be found here:

the program          : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs
results:
  with ghc --make    : http://johnmacfarlane.net/isSpace/unoptimized.html
  with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html

John

+++ John MacFarlane [Oct 28 12 12:16 ]:
> I think that 'isSpace' from Data.Char (and hence also 'words' from the
> Prelude) is not as fast as it could be.  Here is the definition
> (which is actually found in GHC.Unicode):
(Continue reading)

Brandon Allbery | 29 Oct 19:01 2012
Picon

Re: Improving Data.Char.isSpace performance

On Mon, Oct 29, 2012 at 1:52 PM, John MacFarlane <jgm <at> berkeley.edu> wrote:
I have experimented with a couple of variants that seem
better than the definition I originally proposed.

How about U+0085?

--
brandon s allbery kf8nh                               sine nomine associates
allbery.b <at> gmail.com                                  ballbery <at> sinenomine.net
unix/linux, openafs, kerberos, infrastructure          http://sinenomine.net

_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Vincent Hanquez | 29 Oct 19:15 2012

Re: Improving Data.Char.isSpace performance

On 10/29/2012 06:01 PM, Brandon Allbery wrote:
> On Mon, Oct 29, 2012 at 1:52 PM, John MacFarlane <jgm <at> berkeley.edu 
> <mailto:jgm <at> berkeley.edu>> wrote:
>
>     I have experimented with a couple of variants that seem
>     better than the definition I originally proposed.
>
>
> How about U+0085?

U+0085 is not defined as a space char according to 
http://www.fileformat.info/info/unicode/char/85/index.htm.

--

-- 
Vincent
Ivan Lazar Miljenovic | 29 Oct 23:30 2012
Picon

Re: Improving Data.Char.isSpace performance

On 30 October 2012 04:52, John MacFarlane <jgm <at> berkeley.edu> wrote:
> I have experimented with a couple of variants that seem
> better than the definition I originally proposed.
>
> The most promising is
>
> isSpace_Alt6 :: Char -> Bool
> {-# INLINE isSpace_Alt6 #-}
> isSpace_Alt6 ' '               = True
> isSpace_Alt6 '\n'              = True
> isSpace_Alt6 '\t'              = True
> isSpace_Alt6 '\r'              = True
> isSpace_Alt6 '\x0B'            = True
> isSpace_Alt6 '\x0C'            = True
> isSpace_Alt6 '\xA0'            = True
> isSpace_Alt6 c | c <  '\x1680' = False
>                | otherwise     = iswspace (fromIntegral (C.ord c)) /= 0

Is there any particular reason you're using a guard rather than a
pattern match for the \x1680 case?

>
> Benchmarks can be found here:
>
> the program          : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs
> results:
>   with ghc --make    : http://johnmacfarlane.net/isSpace/unoptimized.html
>   with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html
>
> John
>
> +++ John MacFarlane [Oct 28 12 12:16 ]:
>> I think that 'isSpace' from Data.Char (and hence also 'words' from the
>> Prelude) is not as fast as it could be.  Here is the definition
>> (which is actually found in GHC.Unicode):
>>
>> isSpace :: Char -> Bool
>> isSpace c =
>>     c == ' '     ||
>>     c == '\t'    ||
>>     c == '\n'    ||
>>     c == '\r'    ||
>>     c == '\f'    ||
>>     c == '\v'    ||
>>     c == '\xa0'  ||
>>     iswspace (fromIntegral (C.ord c)) /= 0
>>
>> I presume that the point of the disjuncts at the beginning is to
>> avoid the call to iswspace for the most common space characters.
>> The problem is that most characters (in most texts) are not space
>> characters, and for nonspace characters iswspace will always be
>> called.
>>
>> So I investigated a possible optimization that would also check for
>> the most common nonspace characters before calling iswspace:
>>
>> isSpace_Alt :: Char -> Bool
>> isSpace_Alt c | c > '\x20' && c < '\xa0' = False
>>               | c == ' '                 = True
>>               | '\t' <= c && c <= '\r'   = True
>>               | c == '\xa0'              = True
>>               | otherwise                = iswspace (fromIntegral (C.ord c)) /= 0
>>
>> In my benchmarks, this function significantly outperforms isSpace.
>> I also found that a version of isSpace that does not check
>> for nonspace characters, but uses case matching instead of
>> a disjunction of equality tests, outperformed isSpace (but
>> was usually not as fast as isSpace_Alt, and the difference
>> mostly disappears with -O2):
>>
>> isSpace_Pattern :: Char -> Bool
>> isSpace_Pattern c
>>    | c == ' '                  = True
>>    | '\t' <= c && c <= '\r'    = True
>>    | c == '\xa0'               = True
>>    | otherwise                 = iswspace (fromIntegral (C.ord c)) /= 0
>>
>> I benchmarked all three functions against five types of text
>> (all ascii, all Greek, Haskell code, characters 0..255, and
>> all spaces), and here are the (normalized) results:
>>
>> Compiled with 'ghc --make':
>> --------------------------------------------------------------
>> Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
>> --------------- ----------------  ---------------  -----------
>> ascii text      1.0               0.54             0.17
>> greek text      1.0               0.57             0.71
>> haskell code    1.0               0.57             0.24
>> chars 0..255    1.0               0.54             0.39
>> all space chars 1.0               0.70             0.90
>> --------------------------------------------------------------
>>
>> Compiled with 'ghc --make -O2':
>> --------------------------------------------------------------
>> Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
>> --------------- ----------------  ---------------  -----------
>> ascii text      1.0               0.93             0.40
>> greek text      1.0               0.98             0.99
>> haskell code    1.0               1.03             0.58
>> chars 0..255    1.0               0.92             0.62
>> all space chars 1.0               0.88             0.92
>> --------------------------------------------------------------
>>
>> My benchmark program can be found here:
>> https://gist.github.com/3967761
>>
>> I'd like to propose that we consider replacing the definition
>> of isSpace with isSpace_Alt.
>>
>> John
>>
>>
>> _______________________________________________
>> Libraries mailing list
>> Libraries <at> haskell.org
>> http://www.haskell.org/mailman/listinfo/libraries
>
> _______________________________________________
> Libraries mailing list
> Libraries <at> haskell.org
> http://www.haskell.org/mailman/listinfo/libraries

--

-- 
Ivan Lazar Miljenovic
Ivan.Miljenovic <at> gmail.com
http://IvanMiljenovic.wordpress.com
wren ng thornton | 1 Nov 03:39 2012

Re: Improving Data.Char.isSpace performance

On 10/29/12 6:30 PM, Ivan Lazar Miljenovic wrote:
> On 30 October 2012 04:52, John MacFarlane <jgm <at> berkeley.edu> wrote:
>> I have experimented with a couple of variants that seem
>> better than the definition I originally proposed.
>>
>> The most promising is
>>
>> isSpace_Alt6 :: Char -> Bool
>> {-# INLINE isSpace_Alt6 #-}
>> isSpace_Alt6 ' '               = True
>> isSpace_Alt6 '\n'              = True
>> isSpace_Alt6 '\t'              = True
>> isSpace_Alt6 '\r'              = True
>> isSpace_Alt6 '\x0B'            = True
>> isSpace_Alt6 '\x0C'            = True
>> isSpace_Alt6 '\xA0'            = True
>> isSpace_Alt6 c | c <  '\x1680' = False
>>                 | otherwise     = iswspace (fromIntegral (C.ord c)) /= 0
>
> Is there any particular reason you're using a guard rather than a
> pattern match for the \x1680 case?

The \x1680 case comes from the fact that (at present) no characters 
below there are spaces other than the ones listed. Note that it's using 
(<) not (==). The only other thing we could do here is to replace the 
otherwise branch with:

     isSpace_Alt6 c = iswspace (fromIntegral (C.ord c)) /= 0

but I doubt that'd help either performance or clarity.

The one thing I worry about using \x1680 as the threshold[1] is that I'm 
not sure whether every character below \x1680 has been allocated or 
whether some are still free. If any of them are free, then this will 
become incorrect in subsequent versions of Unicode so it's a maintenance 
timebomb. (Whereas if they're all specified then it should be fine.) Can 
someone verify that using \x1680 is sound in this manner?

[1] I discovered \x1680 by simply checking map isSpace ['\x0'..]. 
However, in my original proposal of this particular optimization I used 
\xFF as the boundary, since this is guaranteed by the fact that Unicode 
is interoperable with ISO-8859-1 (Latin-1)

--

-- 
Live well,
~wren
John MacFarlane | 1 Nov 04:10 2012
Picon

Re: Improving Data.Char.isSpace performance

+++ wren ng thornton [Oct 31 12 22:39 ]:
> The one thing I worry about using \x1680 as the threshold[1] is that
> I'm not sure whether every character below \x1680 has been allocated
> or whether some are still free. If any of them are free, then this
> will become incorrect in subsequent versions of Unicode so it's a
> maintenance timebomb. (Whereas if they're all specified then it
> should be fine.) Can someone verify that using \x1680 is sound in
> this manner?

Really good point.  This page [1], if I read it correctly,
seems to indicate that c < \x860 would be safe.
But I'm not a unicode expert and maybe there's a better
place to look.

[1] http://www.unicode.org/alloc/CurrentAllocation.html

One thing I discovered is that if I use c < \xFF, the Greek
benchmark goes out the window -- we are twice as slow as
the original GHC.Unicode.isChar.  Whatever threshold we
choose, it seems, the performance gains below the threshold
will be balanced by performance losses above the threshold.
This makes me disinclined to submit the patch.  It just
seems wrong to change the library in a way that
helps people who use western alphabets at the expense of
people who don't.
Joachim Breitner | 1 Nov 14:24 2012
Picon

Re: Improving Data.Char.isSpace performance

Hi,

have you considered investigating whether the ffi call can be sped up?
To me it seems to be a bit overheady to first check in Haskell-Land for
common cases and then call code that would (or does?) this check as
well.

Maybe the overhead for calling the C code can be reduced so that it is
no longer worth doing special casing in the Haskell code. Maybe using
primitive operations instead of the C calling convention help, as it
would effectively inline the call to u_iswspace?

Greetings,
Joachim

--

-- 
Joachim "nomeata" Breitner
  mail <at> joachim-breitner.de  |  nomeata <at> debian.org  |  GPG: 0x4743206C
  xmpp: nomeata <at> joachim-breitner.de | http://www.joachim-breitner.de/

_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Joachim Breitner | 1 Nov 16:10 2012
Picon

Re: Improving Data.Char.isSpace performance

Hi,

Am Donnerstag, den 01.11.2012, 14:24 +0100 schrieb Joachim Breitner:
> have you considered investigating whether the ffi call can be sped up?
> To me it seems to be a bit overheady to first check in Haskell-Land for
> common cases and then call code that would (or does?) this check as
> well.

looking at the interface of GHC.Unicode I see that the special cases are
included in the unfolding of isSpace. This is, of course, something that
that even a faster ffi variant cannot achieve.

Greetings,
Joachim

--

-- 
Joachim "nomeata" Breitner
  mail <at> joachim-breitner.de  |  nomeata <at> debian.org  |  GPG: 0x4743206C
  xmpp: nomeata <at> joachim-breitner.de | http://www.joachim-breitner.de/

_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Patrick Palka | 1 Nov 04:49 2012

Re: Improving Data.Char.isSpace performance

On Wed, Oct 31, 2012 at 10:39 PM, wren ng thornton <wren <at> freegeek.org> wrote:

The one thing I worry about using \x1680 as the threshold[1] is that I'm not sure whether every character below \x1680 has been allocated or whether some are still free. If any of them are free, then this will become incorrect in subsequent versions of Unicode so it's a maintenance timebomb. (Whereas if they're all specified then it should be fine.) Can someone verify that using \x1680 is sound in this manner?

According to GHCi:

Prelude Data.Char> length $ filter ((== NotAssigned) . generalCategory) ['\0'..'\x1680']
830
_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
wren ng thornton | 8 Nov 19:18 2012

Re: Improving Data.Char.isSpace performance

On 10/31/12 11:49 PM, Patrick Palka wrote:
> On Wed, Oct 31, 2012 at 10:39 PM, wren ng thornton <wren <at> freegeek.org>wrote:
>
>> The one thing I worry about using \x1680 as the threshold[1] is that I'm
>> not sure whether every character below \x1680 has been allocated or whether
>> some are still free. If any of them are free, then this will become
>> incorrect in subsequent versions of Unicode so it's a maintenance timebomb.
>> (Whereas if they're all specified then it should be fine.) Can someone
>> verify that using \x1680 is sound in this manner?
>>
>
> According to GHCi:
>
> Prelude Data.Char> length $ filter ((== NotAssigned) . generalCategory)
>> ['\0'..'\x1680']
>> 830

Guess I never looked closely at what Unicode queries Data.Char offers... 
Looks like the first unassigned character is '\888'

--

-- 
Live well,
~wren
Simon Peyton-Jones | 29 Oct 23:29 2012
Picon

RE: Improving Data.Char.isSpace performance

Sounds good to me.  Thanks for doing this.

When you think you are ready, just submit a patch.  (As others have noted, maybe isSpace isn't the only
function that could benefit from this kind of attention.)

Simon

|  -----Original Message-----
|  From: libraries-bounces <at> haskell.org [mailto:libraries-bounces <at> haskell.org] On
|  Behalf Of John MacFarlane
|  Sent: 28 October 2012 19:16
|  To: libraries <at> haskell.org
|  Subject: Improving Data.Char.isSpace performance
|  
|  I think that 'isSpace' from Data.Char (and hence also 'words' from the
|  Prelude) is not as fast as it could be.  Here is the definition
|  (which is actually found in GHC.Unicode):
|  
|  isSpace :: Char -> Bool
|  isSpace c =
|      c == ' '     ||
|      c == '\t'    ||
|      c == '\n'    ||
|      c == '\r'    ||
|      c == '\f'    ||
|      c == '\v'    ||
|      c == '\xa0'  ||
|      iswspace (fromIntegral (C.ord c)) /= 0
|  
|  I presume that the point of the disjuncts at the beginning is to
|  avoid the call to iswspace for the most common space characters.
|  The problem is that most characters (in most texts) are not space
|  characters, and for nonspace characters iswspace will always be
|  called.
|  
|  So I investigated a possible optimization that would also check for
|  the most common nonspace characters before calling iswspace:
|  
|  isSpace_Alt :: Char -> Bool
|  isSpace_Alt c | c > '\x20' && c < '\xa0' = False
|                | c == ' '                 = True
|                | '\t' <= c && c <= '\r'   = True
|                | c == '\xa0'              = True
|                | otherwise                = iswspace (fromIntegral (C.ord c)) /= 0
|  
|  In my benchmarks, this function significantly outperforms isSpace.
|  I also found that a version of isSpace that does not check
|  for nonspace characters, but uses case matching instead of
|  a disjunction of equality tests, outperformed isSpace (but
|  was usually not as fast as isSpace_Alt, and the difference
|  mostly disappears with -O2):
|  
|  isSpace_Pattern :: Char -> Bool
|  isSpace_Pattern c
|     | c == ' '                  = True
|     | '\t' <= c && c <= '\r'    = True
|     | c == '\xa0'               = True
|     | otherwise                 = iswspace (fromIntegral (C.ord c)) /= 0
|  
|  I benchmarked all three functions against five types of text
|  (all ascii, all Greek, Haskell code, characters 0..255, and
|  all spaces), and here are the (normalized) results:
|  
|  Compiled with 'ghc --make':
|  --------------------------------------------------------------
|  Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
|  --------------- ----------------  ---------------  -----------
|  ascii text      1.0               0.54             0.17
|  greek text      1.0               0.57             0.71
|  haskell code    1.0               0.57             0.24
|  chars 0..255    1.0               0.54             0.39
|  all space chars 1.0               0.70             0.90
|  --------------------------------------------------------------
|  
|  Compiled with 'ghc --make -O2':
|  --------------------------------------------------------------
|  Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
|  --------------- ----------------  ---------------  -----------
|  ascii text      1.0               0.93             0.40
|  greek text      1.0               0.98             0.99
|  haskell code    1.0               1.03             0.58
|  chars 0..255    1.0               0.92             0.62
|  all space chars 1.0               0.88             0.92
|  --------------------------------------------------------------
|  
|  My benchmark program can be found here:
|  https://gist.github.com/3967761
|  
|  I'd like to propose that we consider replacing the definition
|  of isSpace with isSpace_Alt.
|  
|  John
|  
|  
|  _______________________________________________
|  Libraries mailing list
|  Libraries <at> haskell.org
|  http://www.haskell.org/mailman/listinfo/libraries
John MacFarlane | 30 Oct 00:15 2012
Picon

Re: Improving Data.Char.isSpace performance

+++ Simon Peyton-Jones [Oct 29 12 22:29 ]:
> Sounds good to me.  Thanks for doing this.
> 
> When you think you are ready, just submit a patch.  (As others have noted, maybe isSpace isn't the only
function that could benefit from this kind of attention.)

Yes, if the general idea is agreeable, I'll do some of the other
functions in GHC.Unicode as well, and provide benchmarks for them
as well.
John MacFarlane | 31 Oct 03:33 2012
Picon

Re: Improving Data.Char.isSpace performance

I've done some further investigating.  The large differences
I was seeing on OSX went away when I moved to a linux machine,
and also when I compiled the benchmark with -O3.  (When I do
either of those things, the puzzling difference between
isSpace_DataChar and Data.Char.isSpace also goes away.) I
don't know enough about core to figure out what is going on
here, but I'll assume that the benchmarks I'm getting on the linux
box are the sober ones.   They look like this (the number is
the ratio new code time / old code time for isSpace):

Benchmark compiled without optimization:
ascii text                0.71
ascii text (short lines)  0.74
ascii text (long lines)   0.72
Greek text                1.08
Haskell code              0.77
chars 0..255              0.70
all spaces                1.02

Benchmark compiled with -O2:
ascii text                0.69
ascii text (short lines)  0.72
ascii text (long lines)   0.69
Greek text                1.11
Haskell code              0.77
chars 0..255              0.69
all spaces                0.94

This suggests that we can get a modest improvement for the
most common cases if we adopt the new definition of isSpace.
However, performance might actually decrease slightly for
non-latin text.

The changes I tried for other functions in GHC.Unicode did
not result in significant improvements.

So, the question is whether it's worth submitting the patch for
isSpace, given that the gains are more modest than I'd reported
before.  (Note that 'words' will also be affected by this, as
it uses isSpace.)  I have attached the proposed patch to this
email.

John

+++ John MacFarlane [Oct 29 12 16:15 ]:
> +++ Simon Peyton-Jones [Oct 29 12 22:29 ]:
> > Sounds good to me.  Thanks for doing this.
> > 
> > When you think you are ready, just submit a patch.  (As others have noted, maybe isSpace isn't the only
function that could benefit from this kind of attention.)
> 
> Yes, if the general idea is agreeable, I'll do some of the other
> functions in GHC.Unicode as well, and provide benchmarks for them
> as well.
> 
> 
> _______________________________________________
> Libraries mailing list
> Libraries <at> haskell.org
> http://www.haskell.org/mailman/listinfo/libraries
Attachment (isSpace.patch): text/x-diff, 1446 bytes
_______________________________________________
Libraries mailing list
Libraries <at> haskell.org
http://www.haskell.org/mailman/listinfo/libraries
Johan Tibell | 31 Oct 03:42 2012
Picon

Re: Improving Data.Char.isSpace performance

On Tue, Oct 30, 2012 at 7:33 PM, John MacFarlane <jgm <at> berkeley.edu> wrote:
> So, the question is whether it's worth submitting the patch for
> isSpace, given that the gains are more modest than I'd reported
> before.  (Note that 'words' will also be affected by this, as
> it uses isSpace.)  I have attached the proposed patch to this
> email.

I say go for it. You probably want to attach the patch to a ticket on
Trac so it doesn't get lost.

-- Johan

Gmane