http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182
--- Comment #32 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-02 08:28:34 UTC --- For me, 4.1 is equally fast to 4.6 on my CPU and on the reduced testcase I've attached (not clear if it models what the original benchmark did right or not), and on the trunk regressed with http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=176072 Before that the inner loop looked like: .L12: addl $10, %edx addb 0(%rbp,%rcx), %dl addq $1, %rcx cmpl %ecx, %ebx jg .L12 and now it looks like: .L12: movzbl 0(%rbp,%rdx), %r8d addq $1, %rdx cmpl %edx, %ebx leal 10(%rcx,%r8), %ecx jg .L12