I think it might just be possible to get down to the speed of Magma with a highly optimised classical multiplication routine. At 3600X3600 one clearly has to do 3600x3600 scalar products of a row by a column. We'll assume one of the matrices has been transposed to facilitate this.
If we use SSE2 then we can operate 128 bits at a time. There are 16 SSE registers. The basic idea would be to load 4 SSE registers from different rows of matrix A and 2 from different columns of matrix B. We compute the scalar products of all 8 combinations of rowsxcolumns simultaneously. For this we need 2 temporary registers and 8 registers to hold the running totals. So all in all we need 4+2+2+8 = 16 registers. We only need to do an AND and an XOR to do the multiplication and addition required. Caching becomes irrelevant if we choose a large selection of rows from A and a large selection of columns from B and do all the possible scalar products or rows by columns before moving to the next part. Assuming the Opteron can do the memory loads at the same time as doing arithmetic not depending on those loads, careful instruction scheduling should get everything down to the cost of an SSE AND and an SSE OR per 128x128 bit pair in the classical algorithm. That puts the entire algorithm at pretty close to what Magma is doing timewise. Another option is to not use SSE and just use the 64 bit integer registers. The disadvantage is one has to load things 64 bits at a time. But the advantage is the Opteron can do three arithmetic operations per cycle if properly scheduled. That gets 50% more work done than the SSE registers, which can only do 1 x 128 bit operation per cycle. Of course I'm making the assumption here that the Opteron can indeed do loads at the same time as arithmetic. if not, then there is no way Magma can be using classical multiplication out to 3600x3600 since there are simply too many operations to perform in the number of cycles available. Bill. On 16 May, 00:03, Martin Albrecht <[EMAIL PROTECTED]> wrote: > On Thursday 15 May 2008, Bill Hart wrote: > > > Here is the graph of Magma times: > > >http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png > > > The crossover is not clear. The change from a smooth curve to a > > squiggly line is about 3600. So presumably that is it, but the graph > > also seems to change character at about 6200 or 7000 as well. One of > > these changes may be cache related. > > At 3600, the total data for all three matrices is almost 5mb and the > > cache on my machine is 1024kb. But if Magma is using classical > > multiplication, then this is pretty much irrelevant anyway, since you > > can keep the working data within the cache for quite a while during > > the algorithm. > > On the other hand: a squiggly line is what one one would expect for > Strassen-Winograd due to the extra rows/columns that have to be taken care > of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel > once gave a talk about his implementation and stated that they don't use > classical block multiplication (I saw some slides with that remark). > > Martin > > -- > name: Martin Albrecht > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > _www:http://www.informatik.uni-bremen.de/~malb > _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---