[sage-devel] Re: SSE2 not so useless after all

Bill Hart Thu, 22 May 2008 06:46:27 -0700

The asymptotics really only kick in when we are using Strassen. The
starting point is thus not 0, but the crossover point between M4RM and
Strassen. So I don't think you can read too much into the asymptotic
statement. But once the asymptotics do kick in, we aren't too far off.
The only thing throwing them off slightly is the inexact cut business,
which my patch mitigates to some extent, though not totally.

I wouldn't be surprised to find we were a factor of 1.5 or 2 from
optimal, but this would be because we could probably do more work in
our base case without falling out of cache.

There is still room for improvement. A few percent can be gained by
splitting A horizontally into two matrices.

I also believe we could get a significant gain by allowing exact cuts,
i.e. not along 64 bit boundaries. This could be done by copying out
the data and shifting it (for the shifting we'd benefit from SSE on
both the Opteron and C2D), from the two submatrices on the right when
doing Strassen. But this would also dramatically increase memory
usage, so I'm not sure how pronounced the overall effect would be.
Perhaps we'd gain 10-15%. The other problem would be that we'd have to
mask the end of each row in the submatrices on the left hand side
before using them, so this could complicate the code quite a bit.

Another option would be to represent our matrices a different way.
We'd store a 32000 by 32000 matrix as a series of 32000 x 2000
matrices, padded with zeroes out to a 64 bit boundary. Addition of
matrices wouldn't be affected much by this, but the m4rm code would
have to be substantially altered in the way it deals with the matrix
A.

At any rate I think most of the possible substantial improvements now
result in yucky code.

Bill.

On 22 May, 11:21, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> On Thursday 22 May 2008, Bill Hart wrote:
>
> > Hi Martin,
>
> > This version works great. Here are the times on my unburdened 2.8Ghz
> > Opteron. First the Magma times, then the times for an older version of
> > m4ri and now, for the first time ever, the new Magma beating times:
>
> > 10000x10000: 2.940s 3.13s 2.25s
>
> > 16384x16384: 9.250s 12.96s 8.80s
>
> > 20000x20000: 16.57s 22.43s 15.48
>
> > 32000x32000: 59.1s 90.2s 57.8s
>
> Bill, I suppose that also means that now we actually beat (or are close to
> beating) Magma on the C2D "for real". My M4RI times are quite similar on the
> C2D as your times on your Opteron. But my version of Magma (on the C2D) is
> much worse than your version of Magma (on the Opteron). So it is probably
> best to assume at least your times for Magma on my machine too.
>
> Martin
>
> PS: I wonder if this argument makes sense:
>
> We have a complexity of n^2.807 and a CPU of say 2.333 Ghz which can operate
> on 64 bits per clock (128 if we use SSE2). So if we had optimal code (no
> missed branch predictions, no caching issues, everything optimal) we would
> expect a running time of    n^2.807 / 64 / 2.333 / 10^9
>
> If we plug in 20,000 for n we'd get 7.923 seconds w.o. SSE2 and 3.961 with
> SSE2. So our implementation (12.2 s) is a factor of ~1.5 or ~3 away from
> being optimal? Does that sound correct or complete bollocks?
>
> --
> name: Martin Albrecht
> _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> _www:http://www.informatik.uni-bremen.de/~malb
> _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

[sage-devel] Re: SSE2 not so useless after all

Reply via email to