On Wed, Aug 24, 2011 at 12:50 PM, Oleg Smolsky <oleg.smol...@riverbed.com> wrote: > On 2011/8/23 11:38, Xinliang David Li wrote: >> >> Partial register stall happens when there is a 32bit register read >> followed by a partial register write. In your case, the stall probably >> happens in the next iteration when 'add eax, 0Ah' executes, so your >> manual patch does not work. Try change >> >> add al, [dx] into two instructions (assuming esi is available here) >> >> movzx esi, ds:data8[dx] >> add eax, esi >> > I patched the code to use "movzx edi" but the result is a little clumsy as > the loop is based on the virtual address rather than index.
my bad -- I did copy & paste without making it precise. > Also, the > sequence is a bit bigger so I had to spill the patch into the preceding > padding: > > .text:0000000000400D80 loc_400D80: > .text:0000000000400D80 mov edx, offset data8 > .text:0000000000400D85 xor eax, eax > .text:0000000000400D87 nop > .text:0000000000400D88 nop > .text:0000000000400D89 nop > .text:0000000000400D8A nop > .text:0000000000400D8B nop > .text:0000000000400D8C > .text:0000000000400D8C loc_400D8C: > .text:0000000000400D8C movzx edi, byte ptr [rdx+0] > .text:0000000000400D90 add eax, edi > .text:0000000000400D92 add eax, 0Ah > .text:0000000000400D95 add rdx, 1 > .text:0000000000400D99 cmp rdx, 503480h > .text:0000000000400DA0 jnz short loc_400D8C > .text:0000000000400DA2 movsx eax, al > .text:0000000000400DA5 add ecx, 1 > .text:0000000000400DA8 add ebx, eax > .text:0000000000400DAA cmp ecx, esi > .text:0000000000400DAC jnz short loc_400D80 > > The performance improved from 2.84 sec (563.38 M ops/s) to 1.51 sec (1059.60 > M ops/s). It's close to the code emitted by g++4.1 now. Very funky! > > So, this is one test out of the suite. Many of them degraded... Are you guys > interested in looking at other ones? Or is there something to be fixed in > the register allocation logic? File bugs --- the isolated examples like this one would be very helpful in the bug report. Thanks, David > > Oleg. >