L.S.,
Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.
For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):
subroutine sum(a, b, c, n)
integer i, n
real a(n), b(n), c(n)
do i = 1, n
c(i) = a(i) + b(i)
enddo
end
with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:
xorps %xmm2, %xmm2
....
.L6:
movaps %xmm2, %xmm0
movaps %xmm2, %xmm1
movlps (%r9,%rax), %xmm0
movlps (%r8,%rax), %xmm1
movhps 8(%r9,%rax), %xmm0
movhps 8(%r8,%rax), %xmm1
incl %ecx
addps %xmm1, %xmm0
movaps %xmm0, 0(%rbp,%rax)
addq $16, %rax
cmpl %ebx, %ecx
jb .L6
I'm not a master of x86_64 assembly, but this strongly looks like
%xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with
itself), before they are completely filled with the mov{l,h}ps
instructions ?
Am I missing something ?
[ BTW, the induction variable %ecx could have been eliminated,
because %rax also counts upwards (but 16 at a time instead of 1) ]
Thanks for any insight,
--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html