On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince <n...@aol.com> wrote: > Toon Moene wrote: >> >> H.J. Lu wrote: >>> >>> On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene <t...@moene.org> wrote: >>>> >>>> L.S., >>>> >>>> Due to the discussion on register allocation, I went back to a hobby of >>>> mine: Studying the assembly output of the compiler. >>>> >>>> For this Fortran subroutine (note: unless otherwise told to the Fortran >>>> front end, reals are 32 bit floating point numbers): >>>> >>>> subroutine sum(a, b, c, n) >>>> integer i, n >>>> real a(n), b(n), c(n) >>>> do i = 1, n >>>> c(i) = a(i) + b(i) >>>> enddo >>>> end >>>> >>>> with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: >>>> >>>> xorps %xmm2, %xmm2 >>>> .... >>>> .L6: >>>> movaps %xmm2, %xmm0 >>>> movaps %xmm2, %xmm1 >>>> movlps (%r9,%rax), %xmm0 >>>> movlps (%r8,%rax), %xmm1 >>>> movhps 8(%r9,%rax), %xmm0 >>>> movhps 8(%r8,%rax), %xmm1 >>>> incl %ecx >>>> addps %xmm1, %xmm0 >>>> movaps %xmm0, 0(%rbp,%rax) >>>> addq $16, %rax >>>> cmpl %ebx, %ecx >>>> jb .L6 >>>> >>>> I'm not a master of x86_64 assembly, but this strongly looks like >>>> %xmm{0,1} >>>> have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), >>>> before >>>> they are completely filled with the mov{l,h}ps instructions ? >>>> >>> >>> I think it is used to avoid partial SSE register stall. >>> >>> >> You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for >> %xmm1) instruction (to copy 4*32 bits to the register) ? >> > If you want those, you must request them with -mtune=barcelona.
Which would then get you movups (%r9,%rax), %xmm0 (unaligned move). generic tuning prefers the split moves, AMD Fam10 and above handle unaligned moves just fine. Richard.