On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince <n...@aol.com> wrote:
> Toon Moene wrote:
>>
>> H.J. Lu wrote:
>>>
>>> On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene <t...@moene.org> wrote:
>>>>
>>>> L.S.,
>>>>
>>>> Due to the discussion on register allocation, I went back to a hobby of
>>>> mine: Studying the assembly output of the compiler.
>>>>
>>>> For this Fortran subroutine (note: unless otherwise told to the Fortran
>>>> front end, reals are 32 bit floating point numbers):
>>>>
>>>>     subroutine sum(a, b, c, n)
>>>>     integer i, n
>>>>     real a(n), b(n), c(n)
>>>>     do i = 1, n
>>>>        c(i) = a(i) + b(i)
>>>>     enddo
>>>>     end
>>>>
>>>> with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:
>>>>
>>>>       xorps   %xmm2, %xmm2
>>>>       ....
>>>> .L6:
>>>>       movaps  %xmm2, %xmm0
>>>>       movaps  %xmm2, %xmm1
>>>>       movlps  (%r9,%rax), %xmm0
>>>>       movlps  (%r8,%rax), %xmm1
>>>>       movhps  8(%r9,%rax), %xmm0
>>>>       movhps  8(%r8,%rax), %xmm1
>>>>       incl    %ecx
>>>>       addps   %xmm1, %xmm0
>>>>       movaps  %xmm0, 0(%rbp,%rax)
>>>>       addq    $16, %rax
>>>>       cmpl    %ebx, %ecx
>>>>       jb      .L6
>>>>
>>>> I'm not a master of x86_64 assembly, but this strongly looks like
>>>> %xmm{0,1}
>>>> have to be zero'd (%xmm2 is set to zero by xor'ing it with itself),
>>>> before
>>>> they are completely filled with the mov{l,h}ps instructions ?
>>>>
>>>
>>> I think it is used to avoid partial SSE register stall.
>>>
>>>
>> You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for
>> %xmm1) instruction (to copy 4*32 bits to the register) ?
>>
> If you want those, you must request them with -mtune=barcelona.

Which would then get you movups (%r9,%rax), %xmm0 (unaligned move).
generic tuning prefers the split moves, AMD Fam10 and above handle
unaligned moves just fine.

Richard.

Reply via email to