On Sat, Nov 28, 2009 at 5:31 PM, Tim Prince <n...@aol.com> wrote: > Richard Guenther wrote: >> >> On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince <n...@aol.com> wrote: >>> >>> Toon Moene wrote: >>>> >>>> H.J. Lu wrote: >>>>> >>>>> On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene <t...@moene.org> wrote: >>>>>> >>>>>> L.S., >>>>>> >>>>>> Due to the discussion on register allocation, I went back to a hobby >>>>>> of >>>>>> mine: Studying the assembly output of the compiler. >>>>>> >>>>>> For this Fortran subroutine (note: unless otherwise told to the >>>>>> Fortran >>>>>> front end, reals are 32 bit floating point numbers): >>>>>> >>>>>> subroutine sum(a, b, c, n) >>>>>> integer i, n >>>>>> real a(n), b(n), c(n) >>>>>> do i = 1, n >>>>>> c(i) = a(i) + b(i) >>>>>> enddo >>>>>> end >>>>>> >>>>>> with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: >>>>>> >>>>>> xorps %xmm2, %xmm2 >>>>>> .... >>>>>> .L6: >>>>>> movaps %xmm2, %xmm0 >>>>>> movaps %xmm2, %xmm1 >>>>>> movlps (%r9,%rax), %xmm0 >>>>>> movlps (%r8,%rax), %xmm1 >>>>>> movhps 8(%r9,%rax), %xmm0 >>>>>> movhps 8(%r8,%rax), %xmm1 >>>>>> incl %ecx >>>>>> addps %xmm1, %xmm0 >>>>>> movaps %xmm0, 0(%rbp,%rax) >>>>>> addq $16, %rax >>>>>> cmpl %ebx, %ecx >>>>>> jb .L6 >>>>>> >>>>>> I'm not a master of x86_64 assembly, but this strongly looks like >>>>>> %xmm{0,1} >>>>>> have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), >>>>>> before >>>>>> they are completely filled with the mov{l,h}ps instructions ? >>>>>> >>>>> I think it is used to avoid partial SSE register stall. >>>>> >>>>> >>>> You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for >>>> %xmm1) instruction (to copy 4*32 bits to the register) ? >>>> >>> If you want those, you must request them with -mtune=barcelona. >> >> Which would then get you movups (%r9,%rax), %xmm0 (unaligned move). >> generic tuning prefers the split moves, AMD Fam10 and above handle >> unaligned moves just fine. > > Correct, the movaps would have been used if alignment were recognized. > The newer CPUs achieve full performance with movups. > Do you consider Core i7/Nehalem as included in "AMD Fam10 and above?"
I'd have to consult the optimization manual of those, but HJ may know off-head. Richard.