Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)
Thanks for working on this.
I think the assembly code can be made more tight. Also, the tests should
be arith-specific.
- gri
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s
File src/pkg/math/big/arith_amd64.s (right):
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode30
src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0
This can be much improved. I'd like to see almost 0% slow-down for the
short vectors. For one, here's a routine that does what we have now with
less code (and no need to shuffle around carry bits).
The unrolled version should be along the same lines.
// func addVV(z, x, y []Word) (c Word)
TEXT ·addVV(SB),7,$0
MOVQ z+0(FP), R10
MOVQ x+16(FP), R8
MOVQ y+32(FP), R9
MOVL n+8(FP), CX
ANDQ $0x00000000ffffffff, CX // "sign-extension" (TODO determine
correct MOV w/ sign extension instruction)
MOVQ $0, DX
TESTQ CX, CX
JZ E1
MOVQ $0, BX // i = 0
CLC
L1: MOVQ (R8)(BX*8), AX
ADCQ (R9)(BX*8), AX
MOVQ AX, (R10)(BX*8)
INCQ BX // i++
LOOP L1 // n--
ADCQ $0, DX
E1: MOVQ DX, c+48(FP)
RET
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go
File src/pkg/math/big/nat_test.go (right):
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode214
src/pkg/math/big/nat_test.go:214: func benchmarkAdd(b *testing.B, sizex,
sizey int) {
You always provide the same size below for x and y - so just pass one
argument. If it needs to change later, it's trivially changed, but for
now it's not necessary.
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode217
src/pkg/math/big/nat_test.go:217: b.ResetTimer()
you should do this immediately before the loop, otherwise you are also
measuring the SetBits operation.
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode222
src/pkg/math/big/nat_test.go:222: _ = z.Add(&x, &y)
These benchmarks are explicitly testing the performance of a handfull a
few assembly routines. Should call them directly. Otherwise, when other
(Go) improvements to Add and Mul are made, the benchmark results are not
comparable anymore.
Specifically, the overhead for small numbers (1-5 words) is so large
that the assembly code barely matters - they almost all use the same
amount of time (around 50ns on your machine, around 43ns on mine). Thus,
you are not measuring your code. (It is correct that at the end we care
about the top-level operations, but these are the low-level primitives
that might be used in a variety of situations. We need to measure them
alone.)
These tests (in modified form) should be in arith_test.go.
Also, to try to compensate for caching effects, it might be useful to
run one operation outside the measured loop.
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode226
src/pkg/math/big/nat_test.go:226: func benchmarkMul(b *testing.B, sizex,
sizey int) {
same comments apply here
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode241
src/pkg/math/big/nat_test.go:241: func BenchmarkAdd_100kb(b *testing.B)
{ benchmarkAdd(b, 100e3, 100e3) }
100<10
(1kb == 1<<10)
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode242
src/pkg/math/big/nat_test.go:242: func BenchmarkAdd_1Mb(b *testing.B)
{ benchmarkAdd(b, 1e6, 1e6) }
1<<20
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode243
src/pkg/math/big/nat_test.go:243: func BenchmarkAdd_5Mb(b *testing.B)
{ benchmarkAdd(b, 5e6, 5e6) }
I might be wrong, but going past 100kb sized numbers is not really
important for all practical purposes. But more importantly, the
benchmark results are likely dominated by memory latency (the
improvements drop significantly).
It's more important that some of the "smaller" numbers perform
reasonably well. In particular, ideally there should be almost no
slowdown (less than 5%) for any size due to this change. Please test the
following sizes:
1w
2w
5w
10w
50w
1Kb
10Kb
100Kb
Also 1Kb == 1024.
http://codereview.appspot.com/6446165/