------- Comment #5 from ubizjak at gmail dot com 2008-03-21 20:58 ------- The inner loop is compiled (-O2 -march=pentium4 -malign-double) to:
.L4: movl %ecx, %eax andl $1, %eax movl a(,%eax,4), %eax xorl %edx, %edx (*) pushl %edx (*) pushl %eax (*) fildll (%esp) addl $8, %esp faddp %st, %st(1) addl $1, %ecx cmpl $100000000, %ecx jne .L4 Instructions marked with (*) form partial memory access. Runtime: time ./a.out real 0m0.794s user 0m0.724s sys 0m0.000s Patched gcc creates: .L4: movl %edx, %eax andl $1, %eax movd a(,%eax,4), %xmm0 movq %xmm0, -16(%ebp) fildll -16(%ebp) faddp %st, %st(1) addl $1, %edx cmpl $100000000, %edx jne .L4 time ./a.out real 0m0.123s user 0m0.124s sys 0m0.000s This represents more than 5.8x speedup. The optimization is applicable to non-TARGET_INTER_UNIT_MOVES target as well, despite extra (store forwarded!) memory access. -- ubizjak at gmail dot com changed: What |Removed |Added ---------------------------------------------------------------------------- URL| |http://gcc.gnu.org/ml/gcc- | |patches/2008- | |03/msg01295.html Status|ASSIGNED |RESOLVED Resolution| |FIXED Target Milestone|--- |4.4.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13958