http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53133

             Bug #: 53133
           Summary: XOR AL,AL to zero lower 8 bits of EAX/RAX causes
                    partial register stall (Intel Core 2)
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: a...@consulting.net.nz


Processor is Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz

#include <stdint.h>
#include <stdio.h>

uint32_t mem = 0;

int main(void) {
  uint64_t sum=0;
  for (uint32_t i=3000000000; i>0; --i) {
    asm volatile ("" : : : "memory"); //load data from memory each time
    uint64_t data = mem;

    //partial register stall
    sum += (data & UINT64_C(0xFFFFFFFFFFFFFF00)) >> 2;

    //no partial register stall
    //sum += (data >> 2) & UINT64_C(0xFFFFFFFFFFFFFFC0);
  }
  printf("sum is %llu\n", sum);
}

$ gcc-4.7 -O3 -std=gnu99 partial_register_stall.c && time ./a.out 
sum is 0

real    0m4.504s
user    0m4.500s
sys    0m0.000s

Each loop iteration is 4.5 cycles.

Relevant assembly code:

  400410:       8b 05 ee 04 20 00       mov    eax,DWORD PTR [rip+0x2004ee]    
   # 600904 <mem>
  400416:       30 c0                   xor    al,al
  400418:       48 c1 e8 02             shr    rax,0x2
  40041c:       48 01 c6                add    rsi,rax
  40041f:       83 ea 01                sub    edx,0x1
  400422:       75 ec                   jne    400410 <main+0x10>

mem is zero-extended into RAX. The lower 8 bits of RAX are zeroed via XOR AL,
AL. The result is shifted down by two.

An equivalent way of computing this is to first shift down by two and then mask
the lower six bits to zero. That is, replace the line:
   sum += (data & UINT64_C(0xFFFFFFFFFFFFFF00)) >> 2;
with:
   sum += (data >> 2) & UINT64_C(0xFFFFFFFFFFFFFFC0);

$ gcc-4.7 -O3 -std=gnu99 partial_register_stall.c && time ./a.out 
sum is 0

real    0m2.002s
user    0m2.000s
sys    0m0.000s

Each loop iteration is now 2 cycles.

Relevant assembly code:

  400410:       8b 05 fe 04 20 00       mov    eax,DWORD PTR [rip+0x2004fe]    
   # 600914 <mem>
  400416:       48 c1 e8 02             shr    rax,0x2
  40041a:       48 83 e0 c0             and    rax,0xffffffffffffffc0
  40041e:       48 01 c6                add    rsi,rax
  400421:       83 ea 01                sub    edx,0x1
  400424:       75 ea                   jne    400410 <main+0x10>

Reply via email to