I have a Pentium-4 HT 521 running in 64 bit mode here, which seems to have a 
branch prediction or prefetch misfeature. Here is an example: 

// stall.c

typedef int (*fn)(void*); 
int nop(void* ip) 
{
        fn *next=((fn*)ip)+1; 
#ifdef HEIMLICH
        // choked?
        if (ip==0) abort(); 
#endif
#ifdef NOSC
        return 1+(*next)(next);
#else
        return (*next)(next); 
#endif
}
int ret(void* ip)
{
        return 0; 
}

int main()
{
        int i; 
        fn prog[]={&nop,&nop,&nop,&nop,&nop,&nop,&nop,&nop,&nop,&ret}; 
        for (i=0;i<100000000;i++) 
        {
                (*prog)(prog); 
        }
}

// eof

(gcc is 4.0.3, gcc-4.3 from svn isn't different) 

gcc -march=nocona -fomit-frame-pointer -O3 stall.c
./a.out runtime: 5.75 sec

gcc -march=nocona -fomit-frame-pointer -O3 -DHEIMLICH stall.c
./a.out runtime: 1.92 sec

gcc -m32 -march=prescott -fomit-frame-pointer -O3 stall.c
./a.out runtime: 7.06 sec

gcc -m32 -march=prescott -fomit-frame-pointer -DHEIMLICH -O3 stall.c
./a.out runtime: 2.67 sec

It looks like the extra branch involved in the "if (*ip==0) abort();" line 
shakes something up in a healthy way, bringing performance back to the 
regions of a core duo cpu. In fact, a simple "jz 0" somewhere before the 
generated sibling call has the same effect. A similar result can be obtained 
with -DNOSC (which will result in an indirect call). 

Since this behaviour affects all kinds of dispatching code (switch, goto 
label, interpreter), I would like to know if this is specific to my stepping 
or a more general problem of the precott core. That is I'd like to ask if you 
people can reproduce this with other models/steppings, in order to find out 
if it's considerable enough to file a enhancement report for the optimizer.

Here's my relevant data (using http://www.etallen.com/cpuid.html): 

   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium 4/Pentium D/Pentium Extreme 
Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon 
XP-M/Opteron/Sempron/Turion (15)
      model           = 0x4 (4)
      stepping id     = 0x1 (1)
      extended family = 0x0 (0)
      extended model  = 0x0 (0)
      (simple synth)  = Intel Pentium 4 (Prescott E0) / Xeon (Nocona E0) / 
Xeon MP (Cranford A0 / Potomac C0) / Celeron D (Prescott E0 ) / Mobile 
Pentium 4 (Prescott E0), 90nm

Reply via email to