------- Comment #5 from ubizjak at gmail dot com  2008-03-21 20:58 -------
The inner loop is compiled (-O2 -march=pentium4 -malign-double) to:

.L4:
       movl    %ecx, %eax
       andl    $1, %eax
       movl    a(,%eax,4), %eax
       xorl    %edx, %edx
(*)    pushl   %edx
(*)    pushl   %eax
(*)    fildll  (%esp)
       addl    $8, %esp
       faddp   %st, %st(1)
       addl    $1, %ecx
       cmpl    $100000000, %ecx
       jne     .L4


Instructions marked with (*) form partial memory access.

Runtime:

time ./a.out

real    0m0.794s
user    0m0.724s
sys     0m0.000s


Patched gcc creates:

.L4:
       movl    %edx, %eax
       andl    $1, %eax
       movd    a(,%eax,4), %xmm0
       movq    %xmm0, -16(%ebp)
       fildll  -16(%ebp)
       faddp   %st, %st(1)
       addl    $1, %edx
       cmpl    $100000000, %edx
       jne     .L4


time ./a.out

real    0m0.123s
user    0m0.124s
sys     0m0.000s


This represents more than 5.8x speedup. The optimization is applicable to
non-TARGET_INTER_UNIT_MOVES target as well, despite extra (store forwarded!)
memory access.


-- 

ubizjak at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |http://gcc.gnu.org/ml/gcc-
                   |                            |patches/2008-
                   |                            |03/msg01295.html
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED
   Target Milestone|---                         |4.4.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13958

Reply via email to