http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56200
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> 2013-02-06 09:57:13 UTC --- (In reply to comment #5) > Optimized alignments are enabled for -O2 and above. For -O2, there are: > > .p2align 4,,10 > .p2align 3 > .L19: > cmpl file(,%rbx,4), %ebp > jg .L18 > cmpl 0(%r13,%rbx,4), %ebp > jg .L18 > cmpl (%r12), %ebp > jle .L22 > .p2align 4,,10 > .p2align 3 > .L18: > > and generate > > 400ab6: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1) > 400ac0: 3b 2c 9d a0 1a 60 00 cmp 0x601aa0(,%rbx,4),%ebp > 400ac7: 7f 17 jg 400ae0 <find+0x70> > 400ac9: 41 3b 6c 9d 00 cmp 0x0(%r13,%rbx,4),%ebp > 400ace: 7f 10 jg 400ae0 <find+0x70> > 400ad0: 41 3b 2c 24 cmp (%r12),%ebp > 400ad4: 7e 32 jle 400b08 <find+0x98> > 400ad6: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1) > > Branch Predict Unit fetches 32-byte at a time. There are 3 back-to-back > fused cmp/jcc instructions in 32-byte window, which causes misprediction. > We can add a nop after the first cmp/jcc to avoid back-to-back cmp/jccs. Yeah, I suppose if we bother with alignment we should do that. Can we do it with a peephole to only do it between two consecutive cmp/jccs?