On 01/01/2010 09:51 PM, Tim Prince wrote:
Benjamin Redelings I wrote:
Hi,

I have been playing with the GCC vectorizer and examining assembly
code that is produced for dot products that are not for a fixed number
of elements. (This comes up surprisingly often in scientific codes.)
So far, the generated code is not faster than non-vectorized code, and
I think that it is because I can't find a way to tell the compiler
that the target of a double* is 16-byte aligned.



From Pr 27827 - http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827 :
"I just quickly glanced at the code, and I see that it never uses
"movapd" from memory, which is a key to getting decent performance."

How many people would take advantage of special machinery for some old
CPU, if that's your goal?
Actually, I think a lot. I did some tests using -mtune=barcelona, and the code was no faster than a no-SSE version, on a Core2 Duo, even though movupd instructions were generated.

Color me confused. Does "some old CPU" include all CPUs except the Core i7? I hear that there is no penalty (or very little penalty) for using unaligned loads on the Core i7, but apparently Core2 still needs the aligned loads?




simplifying your example to

double f3(const double* p_, const double* q_,int n)
{
double sum = 0;
for(int i=0; i<n;i++)
sum += p_[i] * q_[i];

return sum;
}
g++ -S -O3 -march=pentium-m -ffast-math -mtune=barcelona -mfpmath=sse
(options chosen for my discontinued OS on discontinued CPU)
produces loop body

.p2align 5,,24
L4:
movupd (%ebx,%eax), %xmm0
movupd (%ecx,%eax), %xmm2
incl %edx
addl $16, %eax
cmpl %edx, %edi
mulpd %xmm2, %xmm0
addpd %xmm0, %xmm1
ja L4

On CPUs introduced in the last 2 years, movupd should be as fast as
movapd, and -mtune=barcelona should work well in general, not only in
this example.
The bigger difference in performance, for longer loops, would come with
further batching of sums, favoring loop lengths of multiples of 4 (or 8,
with unrolling). That alignment already favors a fairly long loop.

As you're using C++, it seems you could have used inner_product() rather
than writing out a function.

My Core I7 showed matrix multiply 25x25 times 25x100 producing 17Gflops
with gfortran in-line code. g++ produces about 80% of that.




Reply via email to