H.J. Lu wrote:
On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene <t...@moene.org> wrote:
L.S.,

Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.

For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):

     subroutine sum(a, b, c, n)
     integer i, n
     real a(n), b(n), c(n)
     do i = 1, n
        c(i) = a(i) + b(i)
     enddo
     end

with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

       xorps   %xmm2, %xmm2
       ....
.L6:
       movaps  %xmm2, %xmm0
       movaps  %xmm2, %xmm1
       movlps  (%r9,%rax), %xmm0
       movlps  (%r8,%rax), %xmm1
       movhps  8(%r9,%rax), %xmm0
       movhps  8(%r8,%rax), %xmm1
       incl    %ecx
       addps   %xmm1, %xmm0
       movaps  %xmm0, 0(%rbp,%rax)
       addq    $16, %rax
       cmpl    %ebx, %ecx
       jb      .L6

I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1}
have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before
they are completely filled with the mov{l,h}ps instructions ?


I think it is used to avoid partial SSE register stall.


You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for %xmm1) instruction (to copy 4*32 bits to the register) ?

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html

Reply via email to