Benjamin Redelings I wrote:
Hi,

I have been playing with the GCC vectorizer and examining assembly code that is produced for dot products that are not for a fixed number of elements. (This comes up surprisingly often in scientific codes.) So far, the generated code is not faster than non-vectorized code, and I think that it is because I can't find a way to tell the compiler that the target of a double* is 16-byte aligned.



 From Pr 27827 - http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827 :
"I just quickly glanced at the code, and I see that it never uses "movapd" from memory, which is a key to getting decent performance."

How many people would take advantage of special machinery for some old CPU, if that's your goal?

simplifying your example to

double f3(const double* p_, const double* q_,int n)
{
      double sum = 0;
      for(int i=0; i<n;i++)
            sum += p_[i] * q_[i];

      return sum;
      }
g++ -S -O3 -march=pentium-m -ffast-math -mtune=barcelona -mfpmath=sse
(options chosen for my discontinued OS on discontinued CPU)
produces loop body

         .p2align 5,,24
 L4:
         movupd  (%ebx,%eax), %xmm0
         movupd  (%ecx,%eax), %xmm2
         incl    %edx
         addl    $16, %eax
         cmpl    %edx, %edi
         mulpd   %xmm2, %xmm0
         addpd   %xmm0, %xmm1
         ja      L4

On CPUs introduced in the last 2 years, movupd should be as fast as movapd, and -mtune=barcelona should work well in general, not only in this example. The bigger difference in performance, for longer loops, would come with further batching of sums, favoring loop lengths of multiples of 4 (or 8, with unrolling). That alignment already favors a fairly long loop.

As you're using C++, it seems you could have used inner_product() rather than writing out a function.

My Core I7 showed matrix multiply 25x25 times 25x100 producing 17Gflops with gfortran in-line code. g++ produces about 80% of that.


Reply via email to