Re: The "right way" to handle alignment of pointer targets in the compiler?

Benjamin Redelings I Sat, 02 Jan 2010 09:59:46 -0800

Thanks for the information!

How many people would take advantage of special machinery for some old
CPU, if that's your goal?

Some, but I suppose the old machinery will be gone eventually. But,yes, I am most interested in current processors.


<assembly with movupd snipped>

On CPUs introduced in the last 2 years, movupd should be as fast as
movapd,

OK, I didn't know this.  Thanks for the information!

 and -mtune=barcelona should work well in general, not only in

this example.
The bigger difference in performance, for longer loops, would come with
further batching of sums, favoring loop lengths of multiples of 4 (or 8,
with unrolling). That alignment already favors a fairly long loop.

As you're using C++, it seems you could have used inner_product() rather
than writing out a function.

That was a reduced test case. The code that I'm modifying is doing twosimultaneous inner products with the same number of iterations:


                for (int j = 0; j < kStateCount; j++) {
                    sum1 += matrices1w[j] * partials1v[j];
                    sum2 += matrices2w[j] * partials2v[j];
                }

I tried using two separate calls to inner_product, and it turns out tobe slightly slower. GCC does not fuse the loops.

My Core I7 showed matrix multiply 25x25 times 25x100 producing 17Gflops
with gfortran in-line code. g++ produces about 80% of that.

So, one reason that I incorrectly assumed that movapd is necessary forgood performance is because the SSE code is actually being matched inperformance by non-SSE code - on a core2 processor and the x86_64 abi.I expected the SSE code to be two times faster, if vectorization wasworking, since I am using double precision. But perhaps SSE should notbe expected to give (much) of a performance advantage here?

For a recent gcc 4.5 with CXXFLAGS="-O3 -ffast-math -fno-tree-vectorize-march=native -mno-sse2 -mno-sse3 -mno-sse4" I got this code for theinner loop:


    be00:       dd 04 07                fldl   (%rdi,%rax,1)
    be03:       dc 0c 01                fmull  (%rcx,%rax,1)
    be06:       de c1                   faddp  %st,%st(1)
    be08:       dd 04 06                fldl   (%rsi,%rax,1)
    be0b:       dc 0c 02                fmull  (%rdx,%rax,1)
    be0e:       48 83 c0 08             add    $0x8,%rax
    be12:       de c2                   faddp  %st,%st(2)
    be14:       4c 39 c0                cmp    %r8,%rax
    be17:       75 e7                   jne    be00

Using alternative CXXFLAGS="-O3 -march=native -g -ffast-math-mtune=generic" I get:


 1f1:   66 0f 57 c9             xorpd  %xmm1,%xmm1
 1f5:   31 c0                   xor    %eax,%eax
 1f7:   31 d2                   xor    %edx,%edx
 1f9:   66 0f 28 d1             movapd %xmm1,%xmm2
 1fd:   0f 1f 00                nopl   (%rax)

 200:   f2 42 0f 10 1c 10       movsd  (%rax,%r10,1),%xmm3
 206:   83 c2 01                add    $0x1,%edx
 209:   f2 42 0f 10 24 00       movsd  (%rax,%r8,1),%xmm4
 20f:   66 41 0f 16 5c 02 08    movhpd 0x8(%r10,%rax,1),%xmm3
 216:   66 42 0f 16 64 00 08    movhpd 0x8(%rax,%r8,1),%xmm4
 21d:   66 0f 28 c3             movapd %xmm3,%xmm0
 221:   f2 41 0f 10 1c 03       movsd  (%r11,%rax,1),%xmm3
 227:   66 0f 59 c4             mulpd  %xmm4,%xmm0
 22b:   66 41 0f 16 5c 03 08    movhpd 0x8(%r11,%rax,1),%xmm3
 232:   f2 42 0f 10 24 08       movsd  (%rax,%r9,1),%xmm4
 238:   66 42 0f 16 64 08 08    movhpd 0x8(%rax,%r9,1),%xmm4
 23f:   48 83 c0 10             add    $0x10,%rax
 243:   39 ea                   cmp    %ebp,%edx
 245:   66 0f 58 d0             addpd  %xmm0,%xmm2
 249:   66 0f 28 c3             movapd %xmm3,%xmm0
 24d:   66 0f 59 c4             mulpd  %xmm4,%xmm0
 251:   66 0f 58 c8             addpd  %xmm0,%xmm1
 255:   72 a9                   jb     200

 257:   44 39 f3                cmp    %r14d,%ebx
 25a:   66 0f 7c c9             haddpd %xmm1,%xmm1
 25e:   44 89 f0                mov    %r14d,%eax
 261:   66 0f 7c d2             haddpd %xmm2,%xmm2

(Note the presence of movsd / movhpd instead of movupd.)

So... should I expect the SSE code to be any faster? If not, could youpossibly say why not? Are there other operations (besided innerproducts) where SSE code would actually be expected to be faster?


-BenRI

Re: The "right way" to handle alignment of pointer targets in the compiler?

Reply via email to