L.S.,

Due to the discussion on register allocation, I went back to a hobby of mine: Studying the assembly output of the compiler.

For this Fortran subroutine (note: unless otherwise told to the Fortran front end, reals are 32 bit floating point numbers):

      subroutine sum(a, b, c, n)
      integer i, n
      real a(n), b(n), c(n)
      do i = 1, n
         c(i) = a(i) + b(i)
      enddo
      end

with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

        xorps   %xmm2, %xmm2
        ....
.L6:
        movaps  %xmm2, %xmm0
        movaps  %xmm2, %xmm1
        movlps  (%r9,%rax), %xmm0
        movlps  (%r8,%rax), %xmm1
        movhps  8(%r9,%rax), %xmm0
        movhps  8(%r8,%rax), %xmm1
        incl    %ecx
        addps   %xmm1, %xmm0
        movaps  %xmm0, 0(%rbp,%rax)
        addq    $16, %rax
        cmpl    %ebx, %ecx
        jb      .L6

I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before they are completely filled with the mov{l,h}ps instructions ?

Am I missing something ?

[ BTW, the induction variable %ecx could have been eliminated,
  because %rax also counts upwards (but 16 at a time instead of 1) ]

Thanks for any insight,

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html

Reply via email to