Thanks for the information!
How many people would take advantage of special machinery for some old
CPU, if that's your goal?
Some, but I suppose the old machinery will be gone eventually. But,
yes, I am most interested in current processors.
<assembly with movupd snipped>
On CPUs introduced in the last 2 years, movupd should be as fast as
movapd,
OK, I didn't know this. Thanks for the information!
and -mtune=barcelona should work well in general, not only in
this example.
The bigger difference in performance, for longer loops, would come with
further batching of sums, favoring loop lengths of multiples of 4 (or 8,
with unrolling). That alignment already favors a fairly long loop.
As you're using C++, it seems you could have used inner_product() rather
than writing out a function.
That was a reduced test case. The code that I'm modifying is doing two
simultaneous inner products with the same number of iterations:
for (int j = 0; j < kStateCount; j++) {
sum1 += matrices1w[j] * partials1v[j];
sum2 += matrices2w[j] * partials2v[j];
}
I tried using two separate calls to inner_product, and it turns out to
be slightly slower. GCC does not fuse the loops.
My Core I7 showed matrix multiply 25x25 times 25x100 producing 17Gflops
with gfortran in-line code. g++ produces about 80% of that.
So, one reason that I incorrectly assumed that movapd is necessary for
good performance is because the SSE code is actually being matched in
performance by non-SSE code - on a core2 processor and the x86_64 abi.
I expected the SSE code to be two times faster, if vectorization was
working, since I am using double precision. But perhaps SSE should not
be expected to give (much) of a performance advantage here?
For a recent gcc 4.5 with CXXFLAGS="-O3 -ffast-math -fno-tree-vectorize
-march=native -mno-sse2 -mno-sse3 -mno-sse4" I got this code for the
inner loop:
be00: dd 04 07 fldl (%rdi,%rax,1)
be03: dc 0c 01 fmull (%rcx,%rax,1)
be06: de c1 faddp %st,%st(1)
be08: dd 04 06 fldl (%rsi,%rax,1)
be0b: dc 0c 02 fmull (%rdx,%rax,1)
be0e: 48 83 c0 08 add $0x8,%rax
be12: de c2 faddp %st,%st(2)
be14: 4c 39 c0 cmp %r8,%rax
be17: 75 e7 jne be00
Using alternative CXXFLAGS="-O3 -march=native -g -ffast-math
-mtune=generic" I get:
1f1: 66 0f 57 c9 xorpd %xmm1,%xmm1
1f5: 31 c0 xor %eax,%eax
1f7: 31 d2 xor %edx,%edx
1f9: 66 0f 28 d1 movapd %xmm1,%xmm2
1fd: 0f 1f 00 nopl (%rax)
200: f2 42 0f 10 1c 10 movsd (%rax,%r10,1),%xmm3
206: 83 c2 01 add $0x1,%edx
209: f2 42 0f 10 24 00 movsd (%rax,%r8,1),%xmm4
20f: 66 41 0f 16 5c 02 08 movhpd 0x8(%r10,%rax,1),%xmm3
216: 66 42 0f 16 64 00 08 movhpd 0x8(%rax,%r8,1),%xmm4
21d: 66 0f 28 c3 movapd %xmm3,%xmm0
221: f2 41 0f 10 1c 03 movsd (%r11,%rax,1),%xmm3
227: 66 0f 59 c4 mulpd %xmm4,%xmm0
22b: 66 41 0f 16 5c 03 08 movhpd 0x8(%r11,%rax,1),%xmm3
232: f2 42 0f 10 24 08 movsd (%rax,%r9,1),%xmm4
238: 66 42 0f 16 64 08 08 movhpd 0x8(%rax,%r9,1),%xmm4
23f: 48 83 c0 10 add $0x10,%rax
243: 39 ea cmp %ebp,%edx
245: 66 0f 58 d0 addpd %xmm0,%xmm2
249: 66 0f 28 c3 movapd %xmm3,%xmm0
24d: 66 0f 59 c4 mulpd %xmm4,%xmm0
251: 66 0f 58 c8 addpd %xmm0,%xmm1
255: 72 a9 jb 200
257: 44 39 f3 cmp %r14d,%ebx
25a: 66 0f 7c c9 haddpd %xmm1,%xmm1
25e: 44 89 f0 mov %r14d,%eax
261: 66 0f 7c d2 haddpd %xmm2,%xmm2
(Note the presence of movsd / movhpd instead of movupd.)
So... should I expect the SSE code to be any faster? If not, could you
possibly say why not? Are there other operations (besided inner
products) where SSE code would actually be expected to be faster?
-BenRI