Partial register stall happens when there is a 32bit register read followed by a partial register write. In your case, the stall probably happens in the next iteration when 'add eax, 0Ah' executes, so your manual patch does not work. Try change
add al, [dx] into two instructions (assuming esi is available here) movzx esi, ds:data8[dx] add eax, esi David On Tue, Aug 23, 2011 at 10:47 AM, Oleg Smolsky <oleg.smol...@riverbed.com> wrote: > Hey Andrew, > > On 2011/8/22 18:37, Andrew Pinski wrote: >> >> On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky<oleg.smol...@riverbed.com> >> wrote: >>> >>> On 2011/8/22 18:09, Oleg Smolsky wrote: >>>> >>>> Both compilers fully inline the templated function and the emitted code >>>> looks very similar. I am puzzled as to why one of these loops is >>>> significantly slower than the other. I've attached disassembled listings >>>> - >>>> perhaps someone could have a look please? (the body of the loop starts >>>> at >>>> 0000000000400FD for gcc41 and at 0000000000400D90 for gcc46) >>> >>> The difference, theoretically, should be due to the inner loop: >>> >>> v4.6: >>> .text:0000000000400DA0 loc_400DA0: >>> .text:0000000000400DA0 add eax, 0Ah >>> .text:0000000000400DA3 add al, [rdx] >>> .text:0000000000400DA5 add rdx, 1 >>> .text:0000000000400DA9 cmp rdx, 5034E0h >>> .text:0000000000400DB0 jnz short loc_400DA0 >>> >>> v4.1: >>> .text:0000000000400FE0 loc_400FE0: >>> .text:0000000000400FE0 movzx eax, ds:data8[rdx] >>> .text:0000000000400FE7 add rdx, 1 >>> .text:0000000000400FEB add eax, 0Ah >>> .text:0000000000400FEE cmp rdx, 1F40h >>> .text:0000000000400FF5 lea ecx, [rax+rcx] >>> .text:0000000000400FF8 jnz short loc_400FE0 >>> >>> However, I cannot see how the first version would be slow... The custom >>> templated "shifter" degenerates into "add 0xa", which is the point of the >>> test... Hmm... >> >> It is slower because of the subregister depedency between eax and al. >> > Hmm... it is little difficult to reason about these fragments as they are > not equivalent in functionality. The g++4.1 version discards the result > while the other version (correctly) accumulates. Oh, I've just realized that > I grabbed the first iteration of the inner loop which was factored out > (perhaps due to unrolling?) Oops, my apologies. > > Here are complete loops, out of a further digested test: > > g++ 4.1 (1.35 sec, 1185M ops/s): > > .text:0000000000400FDB loc_400FDB: > .text:0000000000400FDB xor ecx, ecx > .text:0000000000400FDD xor edx, edx > .text:0000000000400FDF nop > .text:0000000000400FE0 > .text:0000000000400FE0 loc_400FE0: > .text:0000000000400FE0 movzx eax, ds:data8[rdx] > .text:0000000000400FE7 add rdx, 1 > .text:0000000000400FEB add eax, 0Ah > .text:0000000000400FEE cmp rdx, 1F40h > .text:0000000000400FF5 lea ecx, [rax+rcx] > .text:0000000000400FF8 jnz short loc_400FE0 > .text:0000000000400FFA movsx eax, cl > .text:0000000000400FFD add esi, 1 > .text:0000000000401000 add ebx, eax > .text:0000000000401002 cmp esi, edi > .text:0000000000401004 jnz short loc_400FDB > > g++ 4.6 (2.86s, 563M ops/s) : > > .text:0000000000400D80 loc_400D80: > .text:0000000000400D80 mov edx, offset data8 > .text:0000000000400D85 xor eax, eax > .text:0000000000400D87 db 66h, 66h > .text:0000000000400D87 nop > .text:0000000000400D8A db 66h, 66h > .text:0000000000400D8A nop > .text:0000000000400D8D db 66h, 66h > .text:0000000000400D8D nop > .text:0000000000400D90 > .text:0000000000400D90 loc_400D90: > .text:0000000000400D90 add eax, 0Ah > .text:0000000000400D93 add al, [rdx] > .text:0000000000400D95 add rdx, 1 > .text:0000000000400D99 cmp rdx, 503480h > .text:0000000000400DA0 jnz short loc_400D90 > .text:0000000000400DA2 movsx eax, al > .text:0000000000400DA5 add ecx, 1 > .text:0000000000400DA8 add ebx, eax > .text:0000000000400DAA cmp ecx, esi > .text:0000000000400DAC jnz short loc_400D80 > > Your observation still holds - there are two sequential instructions that > operate on the same register. So, I manually patched the 4.6 binary's inner > loop to the following: > > .text:0000000000400D90 add al, [rdx] > .text:0000000000400D92 add rdx, 1 > .text:0000000000400D96 add eax, 0Ah > .text:0000000000400D99 cmp rdx, 503480h > .text:0000000000400DA0 jnz short loc_400D90 > > and that made no significant difference in performance. > > Is this dependency really a performance issue? BTW, the outer loop executes > 200,000 times... > > Thanks! > > Oleg. > > P.S. GDB disassembles the v4.6 emitted padding as: > > 0x0000000000400d87 <+231>: data32 xchg ax,ax > 0x0000000000400d8a <+234>: data32 xchg ax,ax > 0x0000000000400d8d <+237>: data32 xchg ax,ax >