------- Comment #13 from jakub at gcc dot gnu dot org 2009-04-29 09:32 ------- You are benchmarking something completely unrelated. What really matters is how code that has 4 branches/calls in one 16-byte block is able to predict all those branches. And Core2 similarly to various AMD CPUs is not able to predict them well.
In the #c6 testcase it considers the je, call, jne and ret whether they can be in a 16 byte block or not. They can't, je is 2 bytes, call 5 bytes, leal 4 bytes (but gcc uses min_insn_size, which is 2 in this case), testl 2, jne 2, addq 4 (but again, min_insn_size is 2 in this case). min_insn_size seems to be very conservative, I guess teaching it about a bunch of prefixes couldn't hurt, for non-jump/call insns ATM it estimates just the displacement size, doesn't consider any prefixes (even those that really can't change after machine reorg), etc. -- jakub at gcc dot gnu dot org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hubicka at gcc dot gnu dot | |org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942