remyoudompheng | 21 Aug 2012 19:55
Picon

code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

Reviewers: gri, golang-dev_googlegroups.org,

Message:
Hello gri@...,
golang-dev@... (cc:
golang-dev@..., remy@...),

I'd like you to review this change to
https://go.googlecode.com/hg/

Description:
math/big: unroll loops a bit in amd64 assembly routines.

Processing 4 words at a time reduces the amount of instructions
needed to save and restore the carry flag, among other things.

Benchmarks on a Core 2 Quad Q8200 <at> 2.33GHz

benchmark             old ns/op    new ns/op    delta
BenchmarkAdd_100kb         4400         2517  -42.80%
BenchmarkAdd_1Mb          43965        24363  -44.59%
BenchmarkAdd_5Mb         335207       264941  -20.96%
BenchmarkAdd_10Mb       1148330      1158397   +0.88%
BenchmarkMul_1kb           1189          940  -20.94%
BenchmarkMul_10kb         58555        46842  -20.00%
BenchmarkMul_50kb        825427       642377  -22.18%
BenchmarkMul_100kb      2555077      1983290  -22.38%
BenchmarkMul_1Mb      105479750     82940000  -21.37%
BenchmarkMul_5Mb     1284871000   1004868000  -21.79%
BenchmarkMul_10Mb    3875074000   3038164000  -21.60%
(Continue reading)

Robert Griesemer | 21 Aug 2012 20:53
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

I will look at this a bit later (today or tomorrow). In the meantime could you add benchmarks showing that this doesn't slow down small numbers? Thanks.

- gri


On Tue, Aug 21, 2012 at 10:55 AM, <remyoudompheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Reviewers: gri, golang-dev_googlegroups.org,

Message:
Hello gri-iFWiy5xATs8dnm+yROfE0A@public.gmane.org, golang-dev-/JYPxA39Uh4KIfpRIydVpw@public.gmane.org (cc:
golang-dev <at> googlegroups.com, remy-fd97jBR+K/6hPH1hqNUYSQ@public.gmane.org),

I'd like you to review this change to
https://go.googlecode.com/hg/


Description:
math/big: unroll loops a bit in amd64 assembly routines.

Processing 4 words at a time reduces the amount of instructions
needed to save and restore the carry flag, among other things.

Benchmarks on a Core 2 Quad Q8200 <at> 2.33GHz

benchmark             old ns/op    new ns/op    delta
BenchmarkAdd_100kb         4400         2517  -42.80%
BenchmarkAdd_1Mb          43965        24363  -44.59%
BenchmarkAdd_5Mb         335207       264941  -20.96%
BenchmarkAdd_10Mb       1148330      1158397   +0.88%
BenchmarkMul_1kb           1189          940  -20.94%
BenchmarkMul_10kb         58555        46842  -20.00%
BenchmarkMul_50kb        825427       642377  -22.18%
BenchmarkMul_100kb      2555077      1983290  -22.38%
BenchmarkMul_1Mb      105479750     82940000  -21.37%
BenchmarkMul_5Mb     1284871000   1004868000  -21.79%
BenchmarkMul_10Mb    3875074000   3038164000  -21.60%

Please review this at http://codereview.appspot.com/6446165/

Affected files:
  M src/pkg/math/big/arith_amd64.s
  M src/pkg/math/big/nat_test.go



remyoudompheng | 21 Aug 2012 21:03
Picon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

Hello gri@...,
golang-dev@... (cc:
golang-dev@..., remy@...),

Please take another look.

http://codereview.appspot.com/6446165/

cswenson | 21 Aug 2012 23:42
Picon
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

On 2012/08/21 19:03:31, remyoudompheng wrote:
> Hello mailto:gri@...,
mailto:golang-dev@... (cc:
> mailto:golang-dev@..., mailto:remy@...),

> Please take another look.

I saw this patch and took a quick look. I spotted a couple of potential
improvements.

Though, when we are talking about 1Mb+ numbers for multiplication, we
should really look into Toom-Cook multiplication, or even an FFT-based
multiply.

http://codereview.appspot.com/6446165/

cswenson | 21 Aug 2012 23:42
Picon
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)


http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s
File src/pkg/math/big/arith_amd64.s (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode38
src/pkg/math/big/arith_amd64.s:38: MOVQ $0, BX		// i = 0
This instruction is 7 bytes long, whereas XOR BX, BX is only 3. Ditto
for the following instruction.

Although this might cause the pipeline to stall, I believe that the
instruction size difference should make up for it.

If the pipeline stall is too much, move MOVQ $0, BX to before the
shifts, and change the latter to MOVQ BX, DX.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode46
src/pkg/math/big/arith_amd64.s:46: RCRQ $1, DX
Move this rotate to before the MOVQs to ease the pipeline a little? On
most architectures the processor should reorder this already, but some
(like, say, Atom) it could cause a delay. The carry flag should be
preserved through the MOVQs.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode51
src/pkg/math/big/arith_amd64.s:51: RCLQ $1, DX
Ditto above.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode56
src/pkg/math/big/arith_amd64.s:56: ADDL $4, BX		// i+=4
I feel like we should move $4 into a register (perhaps AX) to save the
byte here.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode85
src/pkg/math/big/arith_amd64.s:85: MOVQ $0, BX		// i = 0
Ditto previous comments about instruction size.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode93
src/pkg/math/big/arith_amd64.s:93: RCRQ $1, DX
See previous comments about moving the RCRQ/RCLQ.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode103
src/pkg/math/big/arith_amd64.s:103: ADDL $4, BX		// i+=4
Ditto above about moving $4 into AX.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode114
src/pkg/math/big/arith_amd64.s:114: ADDL $1, BX		// i++
INCL BX, like above.

http://codereview.appspot.com/6446165/

nigeltao | 22 Aug 2012 03:32
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)


http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s
File src/pkg/math/big/arith_amd64.s (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode38
src/pkg/math/big/arith_amd64.s:38: MOVQ $0, BX		// i = 0
On 2012/08/21 21:42:57, Christopher Swenson wrote:
> This instruction is 7 bytes long, whereas XOR BX, BX is only 3.

I forget exactly where in 6g/6l it happens, but "MOVQ $0, BX" gets
rewritten to "XORQ BX, BX".

http://codereview.appspot.com/6446165/

nigeltao | 22 Aug 2012 03:55
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)


http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s
File src/pkg/math/big/arith_amd64.s (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode37
src/pkg/math/big/arith_amd64.s:37: SHLQ $2, CX             // CX =
(n/4)*4
Should the SHRQ/SHLQ be "ANDQ $~3, CX"? objdump -d tells me that ANDQ is
4 bytes and two shifts are 4+4 bytes:

   400c87:       48 c1 e8 02             shr    $0x2,%rax
   400c8b:       48 c1 e0 02             shl    $0x2,%rax
   400c8f:       48 83 e0 fc             and    $0xfffffffffffffffc,%rax

http://codereview.appspot.com/6446165/

Rob Pike | 22 Aug 2012 04:16
Favicon

Re: Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

The 'linkers' (6l etc.) turn the pseudo-ops into the best instruction
for the job. The assemblers are a peculiar mix of literal instructions
and pseudo-ops like MOV, but the correspondence with machine
instructions can sometimes be unpredictable.  Best to disassemble a
binary or run 6l -a to see what happens.

-rob

remyoudompheng | 22 Aug 2012 05:44
Picon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

On 2012/08/21 21:42:11, Christopher Swenson wrote:
> Though, when we are talking about 1Mb+ numbers for multiplication, we
should
> really look into Toom-Cook multiplication, or even an FFT-based
multiply.

I am currently working on a FFT-based implementation. It may not be
finished but it is already working, and gets closer to GMP performance
after applying this CL (1.2-1.5x slower, 2x slower for very large inputs
where the algorithm should be recursive but is not yet).

The assembly routines are the most deterministic improvement I can find
at the moment.

http://codereview.appspot.com/6446165/

remyoudompheng | 22 Aug 2012 05:45
Picon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

On 2012/08/22 03:44:59, remyoudompheng wrote:
> I am currently working on a FFT-based implementation.

I sent an e-mail a few days ago about this: it is implemented
externally: https://github.com/remyoudompheng/bigfft

http://codereview.appspot.com/6446165/

gri | 23 Aug 2012 01:35
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

Thanks for working on this.

I think the assembly code can be made more tight. Also, the tests should
be arith-specific.

- gri


http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s

File src/pkg/math/big/arith_amd64.s (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode30

src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0
This can be much improved. I'd like to see almost 0% slow-down for the
short vectors. For one, here's a routine that does what we have now with
less code (and no need to shuffle around carry bits).

The unrolled version should be along the same lines.

// func addVV(z, x, y []Word) (c Word)
TEXT ·addVV(SB),7,$0
	MOVQ z+0(FP), R10
	MOVQ x+16(FP), R8
	MOVQ y+32(FP), R9
	MOVL n+8(FP), CX
	ANDQ $0x00000000ffffffff, CX // "sign-extension" (TODO determine
correct MOV w/ sign extension instruction)
	
	MOVQ $0, DX
	TESTQ CX, CX
	JZ E1

	MOVQ $0, BX		// i = 0
	CLC
	
L1:	MOVQ (R8)(BX*8), AX
	ADCQ (R9)(BX*8), AX
	MOVQ AX, (R10)(BX*8)
	
	INCQ BX		// i++
	LOOP L1         // n--

	ADCQ $0, DX

E1:	MOVQ DX, c+48(FP)
	RET

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go

File src/pkg/math/big/nat_test.go (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode214

src/pkg/math/big/nat_test.go:214: func benchmarkAdd(b *testing.B, sizex,
sizey int) {
You always provide the same size below for x and y - so just pass one
argument. If it needs to change later, it's trivially changed, but for
now it's not necessary.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode217

src/pkg/math/big/nat_test.go:217: b.ResetTimer()
you should do this immediately before the loop, otherwise you are also
measuring the SetBits operation.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode222

src/pkg/math/big/nat_test.go:222: _ = z.Add(&x, &y)
These benchmarks are explicitly testing the performance of a handfull a
few assembly routines. Should call them directly. Otherwise, when other
(Go) improvements to Add and Mul are made, the benchmark results are not
comparable anymore.

Specifically, the overhead for small numbers (1-5 words) is so large
that the assembly code barely matters - they almost all use the same
amount of time (around 50ns on your machine, around 43ns on mine). Thus,
you are not measuring your code. (It is correct that at the end we care
about the top-level operations, but these are the low-level primitives
that might be used in a variety of situations. We need to measure them
alone.)

These tests (in modified form) should be in arith_test.go.

Also, to try to compensate for caching effects, it might be useful to
run one operation outside the measured loop.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode226

src/pkg/math/big/nat_test.go:226: func benchmarkMul(b *testing.B, sizex,
sizey int) {
same comments apply here

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode241

src/pkg/math/big/nat_test.go:241: func BenchmarkAdd_100kb(b *testing.B)
{ benchmarkAdd(b, 100e3, 100e3) }
100<10

(1kb == 1<<10)

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode242

src/pkg/math/big/nat_test.go:242: func BenchmarkAdd_1Mb(b *testing.B)
{ benchmarkAdd(b, 1e6, 1e6) }
1<<20

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#newcode243

src/pkg/math/big/nat_test.go:243: func BenchmarkAdd_5Mb(b *testing.B)
{ benchmarkAdd(b, 5e6, 5e6) }
I might be wrong, but going past 100kb sized numbers is not really
important for all practical purposes. But more importantly, the
benchmark results are likely dominated by memory latency (the
improvements drop significantly).

  It's more important that some of the "smaller" numbers perform
reasonably well. In particular, ideally there should be almost no
slowdown (less than 5%) for any size due to this change. Please test the
following sizes:

1w
2w
5w
10w
50w
1Kb
10Kb
100Kb

Also 1Kb == 1024.

http://codereview.appspot.com/6446165/

nigeltao | 23 Aug 2012 06:36
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)


http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s

File src/pkg/math/big/arith_amd64.s (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode30

src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0
On 2012/08/22 23:35:02, gri wrote:
> 	MOVL n+8(FP), CX
> 	ANDQ $0x00000000ffffffff, CX // "sign-extension" (TODO determine
correct MOV w/
> sign extension instruction)

"MOVL n+8(FP), CX" will fill the top 32 bits of CX with zeroes; the MOVL
is equivalent to a MOVLQZX. If you need sign extension, say "MOVLQSX
n+8(FP), CX".

http://codereview.appspot.com/6446165/

Robert Griesemer | 23 Aug 2012 07:18
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

Good to know (MOVLQZX). So the first instructions can be:


        MOVL n+8(FP), CX
        TESTQ CX, CX
        JZ E1


On Wed, Aug 22, 2012 at 9:36 PM, <nigeltao-iFWiy5xATs8dnm+yROfE0A@public.gmane.org> wrote:
On 2012/08/22 23:35:02, gri wrote:
        MOVL n+8(FP), CX
        ANDQ $0x00000000ffffffff, CX // "sign-extension" (TODO determine
correct MOV w/
sign extension instruction)

"MOVL n+8(FP), CX" will fill the top 32 bits of CX with zeroes; the MOVL
is equivalent to a MOVLQZX. If you need sign extension, say "MOVLQSX
n+8(FP), CX".

http://codereview.appspot.com/6446165/

gri | 23 Aug 2012 07:10
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

FYI.


http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s

File src/pkg/math/big/arith_amd64.s (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode30

src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0
Here's a version which has even less setup instructions (and the same
tight inner loop):

// func addVV(z, x, y []Word) (c Word)
TEXT ·addVV(SB),7,$0
         MOVL n+8(FP), CX	// 32bit n
         ANDQ $0xffffffff, CX	// 64bit n; CF = 0; ZF = (n == 0)
	JZ E1			// if n == 0 goto E1
	// CX != 0 && CF == 0
	
	// initialize pointers to x, y, z
         MOVQ x+16(FP), R8
         MOVQ y+32(FP), R9
         MOVQ z+0(FP), R10

         MOVQ $0, BX		// i := 0
	// CX != 0 && CF == 0

L1:     MOVQ (R8)(BX*8), AX
         ADCQ (R9)(BX*8), AX
         MOVQ AX, (R10)(BX*8)

         INCQ BX			// i++, CF unchanged
	// Note: The LOOP instruction can be replaced with
	//	DECQ CX
	//	JNZ L1
	// if they run faster then the LOOP microcode.
         LOOP L1			// n--; if n != 0 goto L1
	// CX == 0

         ADCQ $0, CX		// CX = carry

E1:     MOVQ CX, c+48(FP)	// return carry
         RET
	
	
I'd like to see this code integrated with the unrolled loop body. You
need to check for the case n >= 4 and that can be done with just 2 extra
instructions in the standard path:

After the initialization of BX (i := 0), you need:

	TESTQ $-4, CX	// if n&0xff..ffC == 0 (i.e., n < 4) goto L1, CF = 0
	JZ L1
	// CX >= 4 && CF == 0

	... unrolled loop code here ...

L1:	MOVQ (R8)(BX*8), AX
	...

http://codereview.appspot.com/6446165/

remyoudompheng | 23 Aug 2012 08:58
Picon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both
rolled/unrolled versions), I'm not sure why. Maybe someone can find an
architecture where it runs faster?

http://codereview.appspot.com/6446165/

Robert Griesemer | 23 Aug 2012 17:32
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

It may well be that DECQ/JNZ is slower than CMPQ/JL. It used to be the case a very long time ago (first Pentiums) that some of the fancier instructions that ran longer sequences of micro-instructions  (e.g. LOOP) were significantly slower than a much longer equivalent sequence of more basic "RISC" instructions - and that one was best advised to just stick to the very basic instructions. I was hoping this might have changed, and I am a bit surprised at DECQ being a problem, but I haven't measured myself yet. Also, different architectures may have wildly different results, but we should probably stick to some of the newer machines.


Either way, this is why it's important to measure just the assembly routines in the benchmark so we have a clear(er) picture. I think the effects on execution time we are going to see with these routines are a) cycle count for small vectors (data is in cache); b) memory latency for large vectors (data is in uncached memory); and c) various variations of the three. Unrolling will mostly be beneficial for case b) because extra memory fetches are overlapping outstanding ones.

- gri


On Wed, Aug 22, 2012 at 11:58 PM, <remyoudompheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both
rolled/unrolled versions), I'm not sure why. Maybe someone can find an
architecture where it runs faster?

http://codereview.appspot.com/6446165/

Christopher Swenson | 23 Aug 2012 17:49
Picon
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

Could case b) be helped by using prefetching? I would guess that loop prediction + prefetching might be good enough to make up the difference for most superscalar chips (at least, I've heard Intel claim such things).


I also wonder how the benchmarks differ on different Intel and AMD chips.

--Christopher


On Thu, Aug 23, 2012 at 11:32 AM, Robert Griesemer <gri-iFWiy5xATs8dnm+yROfE0A@public.gmane.org> wrote:
It may well be that DECQ/JNZ is slower than CMPQ/JL. It used to be the case a very long time ago (first Pentiums) that some of the fancier instructions that ran longer sequences of micro-instructions  (e.g. LOOP) were significantly slower than a much longer equivalent sequence of more basic "RISC" instructions - and that one was best advised to just stick to the very basic instructions. I was hoping this might have changed, and I am a bit surprised at DECQ being a problem, but I haven't measured myself yet. Also, different architectures may have wildly different results, but we should probably stick to some of the newer machines.

Either way, this is why it's important to measure just the assembly routines in the benchmark so we have a clear(er) picture. I think the effects on execution time we are going to see with these routines are a) cycle count for small vectors (data is in cache); b) memory latency for large vectors (data is in uncached memory); and c) various variations of the three. Unrolling will mostly be beneficial for case b) because extra memory fetches are overlapping outstanding ones.

- gri


On Wed, Aug 22, 2012 at 11:58 PM, <remyoudompheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both
rolled/unrolled versions), I'm not sure why. Maybe someone can find an
architecture where it runs faster?

http://codereview.appspot.com/6446165/




--
Christopher Swenson
Robert Griesemer | 24 Aug 2012 01:00
Favicon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

FYI: I just submitted http://codereview.appspot.com/6478055/ which contains benchmarks for some of the core vector routines. They measure the routines more precisely and thus permit more accurate tuning.


I also experimented a bit with DECQ and I can confirm that simply using DECQ reg (rather then SUBQ $1, reg) makes the code 2x slower. I suspect that for DECQ the condition code might not be readily available for the next instruction (Jcc) and cause a bad pipeline bubble. Anyway, the lesson is to avoid it.

Unless you beat me to it, I may have some rather fast versions of addVV and subVV later tonight.

- gri


On Wed, Aug 22, 2012 at 11:58 PM, <remyoudompheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both
rolled/unrolled versions), I'm not sure why. Maybe someone can find an
architecture where it runs faster?

http://codereview.appspot.com/6446165/

Ian Lance Taylor | 24 Aug 2012 01:24
Picon
Favicon
Gravatar

Re: Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

On Thu, Aug 23, 2012 at 4:00 PM, Robert Griesemer <gri@...> wrote:
>
> I also experimented a bit with DECQ and I can confirm that simply using DECQ
> reg (rather then SUBQ $1, reg) makes the code 2x slower. I suspect that for
> DECQ the condition code might not be readily available for the next
> instruction (Jcc) and cause a bad pipeline bubble. Anyway, the lesson is to
> avoid it.

The Intel optimization manual says:

    The inc and dec instructions modify only a subset of the bits in the flag
    register. This creates a dependence on all previous writes of the flag
    register. This is especially problematic when these instructions are on
    the critical path because they are used to change an address for a load on
    which many other instructions depend.

    Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and
    dec instructions should be replaced with an add or sub instruction, because
    add and sub overwrite all flags, whereas inc and dec do not, therefore
    creating false dependencies on earlier instructions that set the flags.

Ian

Robert Griesemer | 24 Aug 2012 01:26
Favicon

Re: Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

There you go! Used to be true in 1994, and it's still true :-)

- gri


On Thu, Aug 23, 2012 at 4:24 PM, Ian Lance Taylor <iant-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
On Thu, Aug 23, 2012 at 4:00 PM, Robert Griesemer <gri-iFWiy5xATs9QFI55V6+gNQ@public.gmane.orgg> wrote:
>
> I also experimented a bit with DECQ and I can confirm that simply using DECQ
> reg (rather then SUBQ $1, reg) makes the code 2x slower. I suspect that for
> DECQ the condition code might not be readily available for the next
> instruction (Jcc) and cause a bad pipeline bubble. Anyway, the lesson is to
> avoid it.

The Intel optimization manual says:

    The inc and dec instructions modify only a subset of the bits in the flag
    register. This creates a dependence on all previous writes of the flag
    register. This is especially problematic when these instructions are on
    the critical path because they are used to change an address for a load on
    which many other instructions depend.

    Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and
    dec instructions should be replaced with an add or sub instruction, because
    add and sub overwrite all flags, whereas inc and dec do not, therefore
    creating false dependencies on earlier instructions that set the flags.

Ian

remyoudompheng | 24 Aug 2012 21:26
Picon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

Superseded by http://codereview.appspot.com/6482062/

http://codereview.appspot.com/6446165/

remyoudompheng | 24 Aug 2012 21:26
Picon

Re: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (issue 6446165)

*** Abandoned ***

http://codereview.appspot.com/6446165/


Gmane