L.S.,
Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.
For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):
subroutine sum(a, b, c, n)
integer i, n
real a(n), b(n), c(n)
do i = 1, n
c(i) = a(i) + b(i)
enddo
end
with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:
xorps %xmm2, %xmm2
....
.L6:
movaps %xmm2, %xmm0
movaps %xmm2, %xmm1
movlps (%r9,%rax), %xmm0
movlps (%r8,%rax), %xmm1
movhps 8(%r9,%rax), %xmm0
movhps 8(%r8,%rax), %xmm1
incl %ecx
addps %xmm1, %xmm0
movaps %xmm0, 0(%rbp,%rax)
addq $16, %rax
cmpl %ebx, %ecx
jb .L6
I'm not a master of x86_64 assembly, but this strongly looks like
%xmm{0,1}
have to be zero'd (%xmm2 is set to zero by xor'ing it with itself),
before
they are completely filled with the mov{l,h}ps instructions ?