Richard Guenther wrote:
On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince <n...@aol.com> wrote:
Toon Moene wrote:
H.J. Lu wrote:
On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene <t...@moene.org> wrote:
L.S.,
Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.
For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):
subroutine sum(a, b, c, n)
integer i, n
real a(n), b(n), c(n)
do i = 1, n
c(i) = a(i) + b(i)
enddo
end
with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:
xorps %xmm2, %xmm2
....
.L6:
movaps %xmm2, %xmm0
movaps %xmm2, %xmm1
movlps (%r9,%rax), %xmm0
movlps (%r8,%rax), %xmm1
movhps 8(%r9,%rax), %xmm0
movhps 8(%r8,%rax), %xmm1
incl %ecx
addps %xmm1, %xmm0
movaps %xmm0, 0(%rbp,%rax)
addq $16, %rax
cmpl %ebx, %ecx
jb .L6
I'm not a master of x86_64 assembly, but this strongly looks like
%xmm{0,1}
have to be zero'd (%xmm2 is set to zero by xor'ing it with itself),
before
they are completely filled with the mov{l,h}ps instructions ?
I think it is used to avoid partial SSE register stall.
You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for
%xmm1) instruction (to copy 4*32 bits to the register) ?
If you want those, you must request them with -mtune=barcelona.
Which would then get you movups (%r9,%rax), %xmm0 (unaligned move).
generic tuning prefers the split moves, AMD Fam10 and above handle
unaligned moves just fine.
Correct, the movaps would have been used if alignment were recognized.
The newer CPUs achieve full performance with movups.
Do you consider Core i7/Nehalem as included in "AMD Fam10 and above?"