http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56200
--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> 2013-02-04 21:36:38 UTC --- (In reply to comment #1) > What happens if you also use -fno-ivopts ? For me, -fno-ivopts gives a small improvement, but still slower than -O0. I think the slowdown is related to code layout in the Icache and branch predictors. There is a hot region which is composed of three consecutive conditional branches (cmp-jg-cmp-jg-cmp-jg in optimized code and mov-cmp-jl-mov-cmp-jl-mov-cmp-jl at -O0). If I align the first _and_ the second to a 16-byte boundary, I get better performance then -O0, but aligning only one of those is still slower than -O0: --- o1.s 2013-02-05 00:04:44.405072150 +0400 +++ o1h.s 2013-02-05 01:17:43.648014420 +0400 @@ -119,9 +119,11 @@ find: movq %rdx, %rbp leal 1(%r14), %eax movl %eax, 12(%rsp) + .p2align 4,,7 .L18: cmpl file(%r12), %r14d jg .L17 + .p2align 4,,7 cmpl (%r15,%r12), %r14d jg .L17 cmpl (%rbx), %r14d