Phil Dawes | 1 Sep 15:07

asm intrinsics for speed

Hi Slava,

I'm building (what amounts to) an olap database engine in factor, and 
part of that involves linear searching mmapped files. Basically the 
faster I can do linear searching operations, the less I need to use 
sorted indexes and the more data I can fit into memory (esp with 
compression). I've currently got a bunch of C code doing searches, but 
ideally I'd like to not rely on a separate C dll in the released code or 
require people to have the gcc tool chain to compile it.

I decided to experiment with using asm intrinsics to implement the tight 
loops, and took 'count' as a simple testcase. I'm searching a 48M file 
of 32bit integers.

By comparison some C code (minus the mmaping stuff):

int count_items(int *from,int *to,int item) {
   int count = 0;
   while(from != to){
     if(*from == item) count++;
     from ++;
   }
   return count;
}

Which when compiled with -O3 takes about 80ms to search the file on a 
processor locked at 800MHZ with the filebuffers running warm. The best I 
could do with factor in this setup was a few seconds (which I think is 
still pretty good).

(Continue reading)

Phil Dawes | 1 Sep 16:24

Re: asm intrinsics for speed

Phil Dawes wrote:
> Hi Slava,
> 
 > [ ... ]
> Which when compiled with -O3 takes about 80ms to search the file on a 
> processor locked at 800MHZ with the filebuffers running warm. The best I 
> could do with factor in this setup was a few seconds (which I think is 
> still pretty good).
 >

Incidentally, here is the best performing 'count-items' I could do in 
factor, which was basically to modify 'each' to jump 4 bytes at a time. 
This was ~70% faster than using the builtin factor 'count' and using '4 
fixnum*fast' each iteration to generate the addresses. It clocks in at 
~1400ms on the same setup as the asm test (see orig email). Any tricks 
I'm missing?

-Phil

: next+4 ( from to quot -- from' to quot )
   [ 4 fixnum+fast ] 2dip ; inline

: (each-elem-int) ( from to quot: ( i -- ) -- )
   [ iterate-step next+4 (each-elem-int) ]
   [ 3drop ] if-iterate? ; inline recursive

: each-step-4 ( to quot -- )
   0 -rot (each-elem-int) ; inline

: each-uint ( alien size quot -- )
(Continue reading)

Slava Pestov | 1 Sep 19:52

Re: asm intrinsics for speed

Hi Phil,

I'll reply in detail later, but one thing is that as long as you can
convince Factor that there's a fixnum on the stack, you can change
fixnum+fast and to + with no loss of performance.

Certainly I don't want people to be coding inner loops in ASM and
eventually the compiler should be able to generate equivalent code to
the C inner loop for something like 'count' applied to a specialized
int array.

Slava

On Mon, Sep 1, 2008 at 9:24 AM, Phil Dawes <phil@...> wrote:
> Incidentally, here is the best performing 'count-items' I could do in
> factor, which was basically to modify 'each' to jump 4 bytes at a time.
> This was ~70% faster than using the builtin factor 'count' and using '4
> fixnum*fast' each iteration to generate the addresses. It clocks in at
> ~1400ms on the same setup as the asm test (see orig email). Any tricks
> I'm missing?
>
> -Phil
>
>
> : next+4 ( from to quot -- from' to quot )
>   [ 4 fixnum+fast ] 2dip ; inline
>
> : (each-elem-int) ( from to quot: ( i -- ) -- )
>   [ iterate-step next+4 (each-elem-int) ]
>   [ 3drop ] if-iterate? ; inline recursive
(Continue reading)

Slava Pestov | 3 Sep 14:23

Re: asm intrinsics for speed

Hi Phil,

I found the time to look at your code and thought about what's going on here.

Here are the main sources of overhead, as far as I can tell:

- The accumulating integer for the count cannot be proven to be
fixnum. This would be the case even if the loop count was known to be
a fixnum (which it isn't, in your case). Consider something like this:

{ fixnum } declare 0 swap [ 1+ ] times

Even though the counter increments in lock-step with the internal
counter of 'times', which is known to be bounded by a fixnum, the
optimizer doesn't notice this and produces the following code:

[
    0 swap 0 swap \ ( gensym ) [
        over over fixnum< [
            >r
            >r 1 +-integer-fixnum r>
            r>
            >r 1 fixnum+fast r>
            ( gensym )
        ] [ 2drop ] if
    ] label
]

1 +-integer-fixnum involves a subroutine call and two conditional
branches and is not as fast as 1 fixnum+fast which is essentially a
(Continue reading)

Phil Dawes | 3 Sep 16:31

Re: asm intrinsics for speed

Hi Slava,

Thanks very much for your mail and for taking a look at this - I'm 
looking forward to investigating the new optimizer tonight.

(N.B. I wrote the rest of this mail last night so it doesn't take into 
account your last email.)
----

As far as I can see GC doesn't get called within a basic block, so as an 
experiment I re-implemented count-items using some custom intrinsics 
words which effectively unbox the variables involved:

untag-fixnum
unbox-alien
fetch32-unboxed
+4-unboxed

I also used words with existing intrinsics: eq?

I was able to get down to 338ms for count-items after these mods. For 
reference the optimized dataflow is:

( scratchpad ) \ count-items f optimized-word.
[
     untag-fixnum 0 swap
     ( #shuffle: @4928410 0 @4928406 -- 0 @4928410 @4928406 )
     >r mmap>from,to
     >r unbox-alien r>
     unbox-alien r>
(Continue reading)

Slava Pestov | 3 Sep 23:15

Re: asm intrinsics for speed

On Wed, Sep 3, 2008 at 9:31 AM, Phil Dawes <phil@...> wrote:
> Hi Slava,
>
> Thanks very much for your mail and for taking a look at this - I'm
> looking forward to investigating the new optimizer tonight.
>
> (N.B. I wrote the rest of this mail last night so it doesn't take into
> account your last email.)
> ----
>
> As far as I can see GC doesn't get called within a basic block, so as an
> experiment I re-implemented count-items using some custom intrinsics
> words which effectively unbox the variables involved:
>
> untag-fixnum
> unbox-alien
> fetch32-unboxed
> +4-unboxed
>
> I also used words with existing intrinsics: eq?

Now *that's* evil :-) I'm glad that the compiler internals were easy
enough to understand for someone to be able to pull this off.

> I was able to get down to 338ms for count-items after these mods.

The remaining difference between Factor and C can be attributed to
poor register allocation in the backend.

So you can see the compiler has some fundamental limitations right
(Continue reading)


Gmane