Thanks for the information!

How many people would take advantage of special machinery for some old
CPU, if that's your goal?
Some, but I suppose the old machinery will be gone eventually. But, yes, I am most interested in current processors.

<assembly with movupd snipped>

On CPUs introduced in the last 2 years, movupd should be as fast as
movapd,
OK, I didn't know this.  Thanks for the information!

 and -mtune=barcelona should work well in general, not only in
this example.
The bigger difference in performance, for longer loops, would come with
further batching of sums, favoring loop lengths of multiples of 4 (or 8,
with unrolling). That alignment already favors a fairly long loop.

As you're using C++, it seems you could have used inner_product() rather
than writing out a function.

That was a reduced test case. The code that I'm modifying is doing two simultaneous inner products with the same number of iterations:

                for (int j = 0; j < kStateCount; j++) {
                    sum1 += matrices1w[j] * partials1v[j];
                    sum2 += matrices2w[j] * partials2v[j];
                }

I tried using two separate calls to inner_product, and it turns out to be slightly slower. GCC does not fuse the loops.

My Core I7 showed matrix multiply 25x25 times 25x100 producing 17Gflops
with gfortran in-line code. g++ produces about 80% of that.

So, one reason that I incorrectly assumed that movapd is necessary for good performance is because the SSE code is actually being matched in performance by non-SSE code - on a core2 processor and the x86_64 abi. I expected the SSE code to be two times faster, if vectorization was working, since I am using double precision. But perhaps SSE should not be expected to give (much) of a performance advantage here?

For a recent gcc 4.5 with CXXFLAGS="-O3 -ffast-math -fno-tree-vectorize -march=native -mno-sse2 -mno-sse3 -mno-sse4" I got this code for the inner loop:

    be00:       dd 04 07                fldl   (%rdi,%rax,1)
    be03:       dc 0c 01                fmull  (%rcx,%rax,1)
    be06:       de c1                   faddp  %st,%st(1)
    be08:       dd 04 06                fldl   (%rsi,%rax,1)
    be0b:       dc 0c 02                fmull  (%rdx,%rax,1)
    be0e:       48 83 c0 08             add    $0x8,%rax
    be12:       de c2                   faddp  %st,%st(2)
    be14:       4c 39 c0                cmp    %r8,%rax
    be17:       75 e7                   jne    be00

Using alternative CXXFLAGS="-O3 -march=native -g -ffast-math -mtune=generic" I get:

 1f1:   66 0f 57 c9             xorpd  %xmm1,%xmm1
 1f5:   31 c0                   xor    %eax,%eax
 1f7:   31 d2                   xor    %edx,%edx
 1f9:   66 0f 28 d1             movapd %xmm1,%xmm2
 1fd:   0f 1f 00                nopl   (%rax)

 200:   f2 42 0f 10 1c 10       movsd  (%rax,%r10,1),%xmm3
 206:   83 c2 01                add    $0x1,%edx
 209:   f2 42 0f 10 24 00       movsd  (%rax,%r8,1),%xmm4
 20f:   66 41 0f 16 5c 02 08    movhpd 0x8(%r10,%rax,1),%xmm3
 216:   66 42 0f 16 64 00 08    movhpd 0x8(%rax,%r8,1),%xmm4
 21d:   66 0f 28 c3             movapd %xmm3,%xmm0
 221:   f2 41 0f 10 1c 03       movsd  (%r11,%rax,1),%xmm3
 227:   66 0f 59 c4             mulpd  %xmm4,%xmm0
 22b:   66 41 0f 16 5c 03 08    movhpd 0x8(%r11,%rax,1),%xmm3
 232:   f2 42 0f 10 24 08       movsd  (%rax,%r9,1),%xmm4
 238:   66 42 0f 16 64 08 08    movhpd 0x8(%rax,%r9,1),%xmm4
 23f:   48 83 c0 10             add    $0x10,%rax
 243:   39 ea                   cmp    %ebp,%edx
 245:   66 0f 58 d0             addpd  %xmm0,%xmm2
 249:   66 0f 28 c3             movapd %xmm3,%xmm0
 24d:   66 0f 59 c4             mulpd  %xmm4,%xmm0
 251:   66 0f 58 c8             addpd  %xmm0,%xmm1
 255:   72 a9                   jb     200

 257:   44 39 f3                cmp    %r14d,%ebx
 25a:   66 0f 7c c9             haddpd %xmm1,%xmm1
 25e:   44 89 f0                mov    %r14d,%eax
 261:   66 0f 7c d2             haddpd %xmm2,%xmm2

(Note the presence of movsd / movhpd instead of movupd.)

So... should I expect the SSE code to be any faster? If not, could you possibly say why not? Are there other operations (besided inner products) where SSE code would actually be expected to be faster?

-BenRI

Reply via email to