http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53133
Bug #: 53133 Summary: XOR AL,AL to zero lower 8 bits of EAX/RAX causes partial register stall (Intel Core 2) Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: a...@consulting.net.nz Processor is Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz #include <stdint.h> #include <stdio.h> uint32_t mem = 0; int main(void) { uint64_t sum=0; for (uint32_t i=3000000000; i>0; --i) { asm volatile ("" : : : "memory"); //load data from memory each time uint64_t data = mem; //partial register stall sum += (data & UINT64_C(0xFFFFFFFFFFFFFF00)) >> 2; //no partial register stall //sum += (data >> 2) & UINT64_C(0xFFFFFFFFFFFFFFC0); } printf("sum is %llu\n", sum); } $ gcc-4.7 -O3 -std=gnu99 partial_register_stall.c && time ./a.out sum is 0 real 0m4.504s user 0m4.500s sys 0m0.000s Each loop iteration is 4.5 cycles. Relevant assembly code: 400410: 8b 05 ee 04 20 00 mov eax,DWORD PTR [rip+0x2004ee] # 600904 <mem> 400416: 30 c0 xor al,al 400418: 48 c1 e8 02 shr rax,0x2 40041c: 48 01 c6 add rsi,rax 40041f: 83 ea 01 sub edx,0x1 400422: 75 ec jne 400410 <main+0x10> mem is zero-extended into RAX. The lower 8 bits of RAX are zeroed via XOR AL, AL. The result is shifted down by two. An equivalent way of computing this is to first shift down by two and then mask the lower six bits to zero. That is, replace the line: sum += (data & UINT64_C(0xFFFFFFFFFFFFFF00)) >> 2; with: sum += (data >> 2) & UINT64_C(0xFFFFFFFFFFFFFFC0); $ gcc-4.7 -O3 -std=gnu99 partial_register_stall.c && time ./a.out sum is 0 real 0m2.002s user 0m2.000s sys 0m0.000s Each loop iteration is now 2 cycles. Relevant assembly code: 400410: 8b 05 fe 04 20 00 mov eax,DWORD PTR [rip+0x2004fe] # 600914 <mem> 400416: 48 c1 e8 02 shr rax,0x2 40041a: 48 83 e0 c0 and rax,0xffffffffffffffc0 40041e: 48 01 c6 add rsi,rax 400421: 83 ea 01 sub edx,0x1 400424: 75 ea jne 400410 <main+0x10>