------- Comment #2 from rguenth at gcc dot gnu dot org 2008-11-30 11:38 ------- Due to the high density of branches in the code this is easily a code layout and/or padding issue. Different architectures have different constraints on their decoders and branch predictors related to branch density. Core introduces other branch limitations for loops that engage the loop stream detector.
We do not at all try to properly optimize (or even model) this apart from inserting nops. YMMV with -fschedule-insns. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38306