Partial register stall happens when there is a 32bit register read
followed by a partial register write. In your case, the stall probably
happens in the next iteration when 'add eax, 0Ah' executes, so your
manual patch does not work.  Try change

add al, [dx] into two instructions (assuming esi is available here)

movzx esi, ds:data8[dx]
add  eax, esi

David


On Tue, Aug 23, 2011 at 10:47 AM, Oleg Smolsky
<oleg.smol...@riverbed.com> wrote:
> Hey Andrew,
>
> On 2011/8/22 18:37, Andrew Pinski wrote:
>>
>> On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky<oleg.smol...@riverbed.com>
>>  wrote:
>>>
>>> On 2011/8/22 18:09, Oleg Smolsky wrote:
>>>>
>>>> Both compilers fully inline the templated function and the emitted code
>>>> looks very similar. I am puzzled as to why one of these loops is
>>>> significantly slower than the other. I've attached disassembled listings
>>>> -
>>>> perhaps someone could have a look please? (the body of the loop starts
>>>> at
>>>> 0000000000400FD for gcc41 and at 0000000000400D90 for gcc46)
>>>
>>> The difference, theoretically, should be due to the inner loop:
>>>
>>> v4.6:
>>> .text:0000000000400DA0 loc_400DA0:
>>> .text:0000000000400DA0                 add     eax, 0Ah
>>> .text:0000000000400DA3                 add     al, [rdx]
>>> .text:0000000000400DA5                 add     rdx, 1
>>> .text:0000000000400DA9                 cmp     rdx, 5034E0h
>>> .text:0000000000400DB0                 jnz     short loc_400DA0
>>>
>>> v4.1:
>>> .text:0000000000400FE0 loc_400FE0:
>>> .text:0000000000400FE0                 movzx   eax, ds:data8[rdx]
>>> .text:0000000000400FE7                 add     rdx, 1
>>> .text:0000000000400FEB                 add     eax, 0Ah
>>> .text:0000000000400FEE                 cmp     rdx, 1F40h
>>> .text:0000000000400FF5                 lea     ecx, [rax+rcx]
>>> .text:0000000000400FF8                 jnz     short loc_400FE0
>>>
>>> However, I cannot see how the first version would be slow... The custom
>>> templated "shifter" degenerates into "add 0xa", which is the point of the
>>> test... Hmm...
>>
>> It is slower because of the subregister depedency between eax and al.
>>
> Hmm... it is little difficult to reason about these fragments as they are
> not equivalent in functionality. The g++4.1 version discards the result
> while the other version (correctly) accumulates. Oh, I've just realized that
> I grabbed the first iteration of the inner loop which was factored out
> (perhaps due to unrolling?) Oops, my apologies.
>
> Here are complete loops, out of a further digested test:
>
> g++ 4.1 (1.35 sec, 1185M ops/s):
>
> .text:0000000000400FDB loc_400FDB:
> .text:0000000000400FDB                 xor     ecx, ecx
> .text:0000000000400FDD                 xor     edx, edx
> .text:0000000000400FDF                 nop
> .text:0000000000400FE0
> .text:0000000000400FE0 loc_400FE0:
> .text:0000000000400FE0                 movzx   eax, ds:data8[rdx]
> .text:0000000000400FE7                 add     rdx, 1
> .text:0000000000400FEB                 add     eax, 0Ah
> .text:0000000000400FEE                 cmp     rdx, 1F40h
> .text:0000000000400FF5                 lea     ecx, [rax+rcx]
> .text:0000000000400FF8                 jnz     short loc_400FE0
> .text:0000000000400FFA                 movsx   eax, cl
> .text:0000000000400FFD                 add     esi, 1
> .text:0000000000401000                 add     ebx, eax
> .text:0000000000401002                 cmp     esi, edi
> .text:0000000000401004                 jnz     short loc_400FDB
>
> g++ 4.6 (2.86s, 563M ops/s) :
>
> .text:0000000000400D80 loc_400D80:
> .text:0000000000400D80                 mov     edx, offset data8
> .text:0000000000400D85                 xor     eax, eax
> .text:0000000000400D87                 db      66h, 66h
> .text:0000000000400D87                 nop
> .text:0000000000400D8A                 db      66h, 66h
> .text:0000000000400D8A                 nop
> .text:0000000000400D8D                 db      66h, 66h
> .text:0000000000400D8D                 nop
> .text:0000000000400D90
> .text:0000000000400D90 loc_400D90:
> .text:0000000000400D90                 add     eax, 0Ah
> .text:0000000000400D93                 add     al, [rdx]
> .text:0000000000400D95                 add     rdx, 1
> .text:0000000000400D99                 cmp     rdx, 503480h
> .text:0000000000400DA0                 jnz     short loc_400D90
> .text:0000000000400DA2                 movsx   eax, al
> .text:0000000000400DA5                 add     ecx, 1
> .text:0000000000400DA8                 add     ebx, eax
> .text:0000000000400DAA                 cmp     ecx, esi
> .text:0000000000400DAC                 jnz     short loc_400D80
>
> Your observation still holds - there are two sequential instructions that
> operate on the same register. So, I manually patched the 4.6 binary's inner
> loop to the following:
>
> .text:0000000000400D90                 add     al, [rdx]
> .text:0000000000400D92                 add     rdx, 1
> .text:0000000000400D96                 add     eax, 0Ah
> .text:0000000000400D99                 cmp     rdx, 503480h
> .text:0000000000400DA0                 jnz     short loc_400D90
>
> and that made no significant difference in performance.
>
> Is this dependency really a performance issue? BTW, the outer loop executes
> 200,000 times...
>
> Thanks!
>
> Oleg.
>
> P.S. GDB disassembles the v4.6 emitted padding as:
>
>   0x0000000000400d87 <+231>:   data32 xchg ax,ax
>   0x0000000000400d8a <+234>:   data32 xchg ax,ax
>   0x0000000000400d8d <+237>:   data32 xchg ax,ax
>

Reply via email to