I have a Pentium-4 HT 521 running in 64 bit mode here, which seems to have a branch prediction or prefetch misfeature. Here is an example:
// stall.c typedef int (*fn)(void*); int nop(void* ip) { fn *next=((fn*)ip)+1; #ifdef HEIMLICH // choked? if (ip==0) abort(); #endif #ifdef NOSC return 1+(*next)(next); #else return (*next)(next); #endif } int ret(void* ip) { return 0; } int main() { int i; fn prog[]={&nop,&nop,&nop,&nop,&nop,&nop,&nop,&nop,&nop,&ret}; for (i=0;i<100000000;i++) { (*prog)(prog); } } // eof (gcc is 4.0.3, gcc-4.3 from svn isn't different) gcc -march=nocona -fomit-frame-pointer -O3 stall.c ./a.out runtime: 5.75 sec gcc -march=nocona -fomit-frame-pointer -O3 -DHEIMLICH stall.c ./a.out runtime: 1.92 sec gcc -m32 -march=prescott -fomit-frame-pointer -O3 stall.c ./a.out runtime: 7.06 sec gcc -m32 -march=prescott -fomit-frame-pointer -DHEIMLICH -O3 stall.c ./a.out runtime: 2.67 sec It looks like the extra branch involved in the "if (*ip==0) abort();" line shakes something up in a healthy way, bringing performance back to the regions of a core duo cpu. In fact, a simple "jz 0" somewhere before the generated sibling call has the same effect. A similar result can be obtained with -DNOSC (which will result in an indirect call). Since this behaviour affects all kinds of dispatching code (switch, goto label, interpreter), I would like to know if this is specific to my stepping or a more general problem of the precott core. That is I'd like to ask if you people can reproduce this with other models/steppings, in order to find out if it's considerable enough to file a enhancement report for the optimizer. Here's my relevant data (using http://www.etallen.com/cpuid.html): version information (1/eax): processor type = primary processor (0) family = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15) model = 0x4 (4) stepping id = 0x1 (1) extended family = 0x0 (0) extended model = 0x0 (0) (simple synth) = Intel Pentium 4 (Prescott E0) / Xeon (Nocona E0) / Xeon MP (Cranford A0 / Potomac C0) / Celeron D (Prescott E0 ) / Mobile Pentium 4 (Prescott E0), 90nm