I think it might just be possible to get down to the speed of Magma
with a highly optimised classical multiplication routine. At 3600X3600
one clearly has to do 3600x3600 scalar products of a row by a column.
We'll assume one of the matrices has been transposed to facilitate
this.

If we use SSE2 then we can operate 128 bits at a time. There are 16
SSE registers.

The basic idea would be to load 4 SSE registers from different rows of
matrix A and 2 from different columns of matrix B. We compute the
scalar products of all 8 combinations of rowsxcolumns simultaneously.
For this we need 2 temporary registers and 8 registers to hold the
running totals. So all in all we need 4+2+2+8 = 16 registers. We only
need to do an AND and an XOR to do the multiplication and addition
required.

Caching becomes irrelevant if we choose a large selection of rows from
A and a large selection of columns from B and do all the possible
scalar products or rows by columns before moving to the next part.

Assuming the Opteron can do the memory loads at the same time as doing
arithmetic not depending on those loads, careful instruction
scheduling should get everything down to the cost of an SSE AND and an
SSE OR per 128x128 bit pair in the classical algorithm. That puts the
entire algorithm at pretty close to what Magma is doing timewise.

Another option is to not use SSE and just use the 64 bit integer
registers. The disadvantage is one has to load things 64 bits at a
time. But the advantage is the Opteron can do three arithmetic
operations per cycle if properly scheduled. That gets 50% more work
done than the SSE registers, which can only do 1 x 128 bit operation
per cycle.

Of course I'm making the assumption here that the Opteron can indeed
do loads at the same time as arithmetic.

if not, then there is no way Magma can be using classical
multiplication out to 3600x3600 since there are simply too many
operations to perform in the number of cycles available.

Bill.

On 16 May, 00:03, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> On Thursday 15 May 2008, Bill Hart wrote:
>
> > Here is the graph of Magma times:
>
> >http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png
>
> > The crossover is not clear. The change from a smooth curve to a
> > squiggly line is about 3600. So presumably that is it, but the graph
> > also seems to change character at about 6200 or 7000 as well. One of
> > these changes may be cache related.
> > At 3600, the total data for all three matrices is almost 5mb and the
> > cache on my machine is 1024kb. But if Magma is using classical
> > multiplication, then this is pretty much irrelevant anyway, since you
> > can keep the working data within the cache for quite a while during
> > the algorithm.
>
> On the other hand: a squiggly line is what one one would expect for
> Strassen-Winograd due to the extra rows/columns that have to be taken care
> of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
> once gave a talk about his implementation and stated that they don't use
> classical block multiplication (I saw some slides with that remark).
>
> Martin
>
> --
> name: Martin Albrecht
> _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> _www:http://www.informatik.uni-bremen.de/~malb
> _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to