http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693
David Edelsohn <dje at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2011-10-11 Ever Confirmed|0 |1 --- Comment #8 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 01:11:47 UTC --- Both loop1 and loop2 produce the same code on LLVM, presumably from its memset pattern: movq %rax, 8(%r15) movq %rbx, (%r15) testq %rbx, %rbx je .LBB1_3 # BB#1: movq %rbx, %rcx movq %rax, %rdx .align 16, 0x90 .LBB1_2: # %.lr.ph # =>This Inner Loop Header: Depth=1 movb %r14b, (%rdx) incq %rdx decq %rcx jne .LBB1_2 .LBB1_3: # %._crit_edge movb $0, (%rax,%rbx) Direct pointer arithmetic might not be recommended, but Intel makes do. For loop1, GCC produces: testq %rbx, %rbx movq %rax, 8(%rbp) movq %rbx, 0(%rbp) je .L3 xorl %edx, %edx .p2align 4,,10 .p2align 3 .L5: movb %r12b, (%rax,%rdx) addq $1, %rdx movq 8(%rbp), %rax cmpq %rbx, %rdx jne .L5 .L3: movb $0, (%rax,%rbx) For loop2, GCC produces: xorl %edx, %edx testq %rbx, %rbx movq %rax, 8(%rbp) movq %rbx, 0(%rbp) jne .L13 jmp .L9 .p2align 4,,10 .p2align 3 .L11: movq 8(%rbp), %rax .L8: .L13: .L10: movb %r12b, (%rax,%rdx) addq $1, %rdx cmpq %rbx, %rdx jne .L11 movq 8(%rbp), %rax .L9: movb $0, (%rax,%rbx) In both cases GCC unnecessarily re-reads v->chars. Is loop2 slower because jne .L13 jump into the middle of the loop confuses the Intel loop branch predictor logic? Or the loop2 instructions order cracks into uops badly? The cause of the performance difference is not obvious.