http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56200
H.J. Lu <hjl.tools at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |areg.melikadamyan at gmail | |dot com --- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> 2013-02-05 23:50:35 UTC --- Optimized alignments are enabled for -O2 and above. For -O2, there are: .p2align 4,,10 .p2align 3 .L19: cmpl file(,%rbx,4), %ebp jg .L18 cmpl 0(%r13,%rbx,4), %ebp jg .L18 cmpl (%r12), %ebp jle .L22 .p2align 4,,10 .p2align 3 .L18: and generate 400ab6: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1) 400ac0: 3b 2c 9d a0 1a 60 00 cmp 0x601aa0(,%rbx,4),%ebp 400ac7: 7f 17 jg 400ae0 <find+0x70> 400ac9: 41 3b 6c 9d 00 cmp 0x0(%r13,%rbx,4),%ebp 400ace: 7f 10 jg 400ae0 <find+0x70> 400ad0: 41 3b 2c 24 cmp (%r12),%ebp 400ad4: 7e 32 jle 400b08 <find+0x98> 400ad6: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1) Branch Predict Unit fetches 32-byte at a time. There are 3 back-to-back fused cmp/jcc instructions in 32-byte window, which causes misprediction. We can add a nop after the first cmp/jcc to avoid back-to-back cmp/jccs.