Hi all... I post this here as it can be of direct interest for kernel development (as I recall many discussions about inlining yes or no...).
Testing other problems, I finally got this this issue: the same short and stupid loop lasted from 3 to 5 times more if it was in main() than if it was in an out-of-line function. The same (bad thing) happens if the function is inlined. The basic code is like this: float data[]; [inline] double one() { double sum; sum = 0; for (i=0; i<SIZE; i++) sum += data[i]; return sum; } int main() { gettimeofday(&tv0,0); for (i=0; i<SIZE; i++) s0 += data[i]; gettimeofday(&tv1,0); printf("T0: %6.2f ms\n",elap(tv0,tv1)); gettimeofday(&tv0,0); s1 = one(); gettimeofday(&tv1,0); printf("T1: %6.2f ms\n",elap(tv0,tv1)); } The times if one() is not inlined (emt64, 2.33GHz): apolo:~/e4> tst T0: 1145.12 ms S0: 268435456.00 T1: 457.19 ms S1: 268435456.00 With one() inlined: apolo:~/e4> tst T0: 1200.52 ms S0: 268435456.00 T1: 1200.14 ms S1: 268435456.00 Looking at the assembler, the non-inlined version does: .L2: cvtss2sd (%rdx,%rax,4), %xmm0 incq %rax cmpq $268435456, %rax addsd %xmm0, %xmm1 jne .L2 and the inlined .L13: cvtss2sd (%rdx,%rax,4), %xmm0 incq %rax cmpq $268435456, %rax addsd 8(%rsp), %xmm0 movsd %xmm0, 8(%rsp) jne .L13 It looks like is updating the stack on each iteration...This is -march=opteron code, the -march=pentium4 is similar. Same behaviour with gcc3 and gcc4. tst.c and Makefile attached. Nice, isn't it ? Please, probe where is my fault... -- J.A. Magallon <jamagallon()ono!com> \ Software is like sex: \ It's better when it's free Mandriva Linux release 2007.1 (Cooker) for i586 Linux 2.6.20-jam06 (gcc 4.1.2 20070302 (prerelease) (4.1.2-1mdv2007.1)) #1 SMP PREEMPT
Makefile
Description: Binary data
#include <stdio.h> #include <stdlib.h> #include <sys/time.h> #define SIZE 256*1024*1024 #define elap(t0,t1) \ ((1000*t1.tv_sec+0.001*t1.tv_usec) - (1000*t0.tv_sec+0.001*t0.tv_usec)) double one(); float *data; #ifdef INLINE inline #endif double one() { int i; double sum; sum = 0; asm("#FBGN"); for (i=0; i<SIZE; i++) sum += data[i]; asm("#FEND"); return sum; } int main(int argc,char** argv) { struct timeval tv0,tv1; double s0,s1; int i; data = malloc(SIZE*sizeof(float)); for (i=0; i<SIZE; i++) data[i] = 1; gettimeofday(&tv0,0); s0 = 0; asm("#MBGN"); for (i=0; i<SIZE; i++) s0 += data[i]; asm("#MEND"); gettimeofday(&tv1,0); printf("T0: %6.2f ms\n",elap(tv0,tv1)); printf("S0: %0.2lf\n",s0); gettimeofday(&tv0,0); s1 = one(); gettimeofday(&tv1,0); printf("T1: %6.2f ms\n",elap(tv0,tv1)); printf("S1: %0.2lf\n",s1); free(data); return 0; }