Hubert Feyrer | 11 Sep 2007 11:08
Picon
Favicon

Re: optimizations [for non-debugging] amd64 kernels

On Tue, 11 Sep 2007, Blair Sadewitz wrote:
> I haven't done any formal benchmarks, but the difference seems clear
> when compiling packages.

Now - speed, time, diskspace, ...?
Got numbers?

  - Hubert

Hubert Feyrer | 11 Sep 2007 11:09
Picon
Favicon

Re: optimizations [for non-debugging] amd64 kernels

On Tue, 11 Sep 2007, Hubert Feyrer wrote:
> Now - speed, time, diskspace, ...?

Doh, s/Now/How/

Blair Sadewitz | 11 Sep 2007 13:09
Picon

Re: optimizations [for non-debugging] amd64 kernels

On 9/11/07, Hubert Feyrer <hubert <at> feyrer.de> wrote:
> On Tue, 11 Sep 2007, Hubert Feyrer wrote:
> > Now - speed, time, diskspace, ...?
>
> Doh, s/Now/How/
>

Oh, speed/time.  It's a lot faster.  My GENERIC.MP kernel was built
with -march=nocona, so I know it's not that alone.  When I get a
chance, I'll time some compile jobs.

Also, at:

http://bahar.aydogan.net/~blair/amd64-string.diff

is an enhancement for x86_64 memcpy/bzero/bcopy functions in
common/libc.  This is authored by fuyuki <at> hadaly.org and is a slight
modification of the latest version (<see
http://www.hadaly.org/fuyuki>) of what was originally posted in a PR
back around Jan/Feb.
I changed the size given to the cmpq instruction right below the
remark on non-temporal hints to match the cache size of my CPU
(2MB)/4.  I'm not sure what it should be by default.  Also, I added
the #ifndef _KERNEL, as AFAIK the kernel doesn't copy such long
strings.  I've been using this for ~6 mos now with no ill effects
insofar as I can tell.

I shared this with christos <at>  about 6-8 weeks ago,
and he said that it looked good to him.  I posted it to the list also,
but there was no response.
(Continue reading)

Andrew Doran | 11 Sep 2007 13:25
Picon

Re: optimizations [for non-debugging] amd64 kernels

On Tue, Sep 11, 2007 at 07:09:31AM -0400, Blair Sadewitz wrote:

> Also, at:
> 
> http://bahar.aydogan.net/~blair/amd64-string.diff
> 
> is an enhancement for x86_64 memcpy/bzero/bcopy functions in
> common/libc.  This is authored by fuyuki <at> hadaly.org and is a slight
> modification of the latest version (<see
> http://www.hadaly.org/fuyuki>) of what was originally posted in a PR
> back around Jan/Feb.
...
> I'd appreciate it if someone who actually knew x86_64 assembly would
> take a look at this and/or if others would test it so we could get it
> in the tree at some point.

The setup and teardown for stos/movs/cmps are really expensive and for small
strings (like under 512 bytes) you're better off with really simple loops
using the arithemetic instructions.

Andrew

David Laight | 16 Sep 2007 00:44
Picon

Re: optimizations [for non-debugging] amd64 kernels

On Tue, Sep 11, 2007 at 12:25:49PM +0100, Andrew Doran wrote:
> On Tue, Sep 11, 2007 at 07:09:31AM -0400, Blair Sadewitz wrote:
> 
> > Also, at:
> > 
> > http://bahar.aydogan.net/~blair/amd64-string.diff
> > 
> > is an enhancement for x86_64 memcpy/bzero/bcopy functions in
> > common/libc.  This is authored by fuyuki <at> hadaly.org and is a slight
> > modification of the latest version (<see
> > http://www.hadaly.org/fuyuki>) of what was originally posted in a PR
> > back around Jan/Feb.
> ...
> > I'd appreciate it if someone who actually knew x86_64 assembly would
> > take a look at this and/or if others would test it so we could get it
> > in the tree at some point.
> 
> The setup and teardown for stos/movs/cmps are really expensive and for small
> strings (like under 512 bytes) you're better off with really simple loops
> using the arithemetic instructions.

Worse still, 'rep movsd' falls foul of the athlon 'store-load' optimiser
when the source and destination addresses are separated by a multiple of
some (relatively small) power of 2 - as they would be for kernel COW.
The code must do 'load, load, store, store' to avoid this.

OTOH ISTR the latest Intel cpu has an optimiser for 'rep movsl' that
performs adequately aligned copies using cache line read-writes.
It might also have fast setup for them ....

(Continue reading)

Blair Sadewitz | 16 Sep 2007 09:04
Picon

Re: optimizations [for non-debugging] amd64 kernels

Thanks for the feedback; it helps me better understand concerns I
should be aware of in these cases.  Needless to say, I now think that
a more conservative and less [potentially] costly approach is
warranted.

In the FreeBSD source tree (src/lib/libc/amd64/string, IIRC), they
made some changes to our string functions (mostly removing vestigial
i386 instructions).  Would these changes be a good thing to merge in?

Regards,

--Blair


Gmane