I tried cache blocking matrix B, but the times are exactly the same. I think the processor is happy to keep loading B one row at a time sequentially, and since it is only done a handful of times in the algorithm, it accounts for little of the runtime.
It might go faster in combination with storing the gray tables, but the optimal block size seems to be about 100 (*k) for the matrix B, which would necessitate storage of 100 gray tables, which is an awful lot of memory. Probably I think it is best to avoid this kind of memory usage. So I am not sure where the other factor of 2 will come from. Bill. On 16 May, 14:37, Bill Hart <[EMAIL PROTECTED]> wrote: > Here is the modified _mzd_mul_m4rm_impl function which gives this > speedup: > > http://sage.math.washington.edu/home/wbhart/m4rmul.c > > I used a crossover of 3600 and I indicate at the top of this file how > constants should be changed to get the speedups for the various values > of n. I didn't put any code in to automatically choose the correct > values. > > I am presuming that the test_multiplication code actually tests this > function, otherwise maybe my code is just rubbish. > > The basic idea is to cache block the A matrix into groups of BLOCK > rows. If one also cache blocked the B matrix and left the computed > gray tables in memory instead of writing over them and recalculating > them all the time, as I currently do, one could probably get the other > factor of 2 that we need. > > Note I've turned SSE off in this function, but left it on in > _mzd_combine since it makes no real difference there. > > Bill. > > On 16 May, 14:16, Bill Hart <[EMAIL PROTECTED]> wrote: > > > 10000x10000 is down to 4.5s now, so nearly a 2x speedup. > > > 20000x20000 is down to 32.0s, so again nearly a 2x speedup. > > > Bill. > > > On 16 May, 13:53, Bill Hart <[EMAIL PROTECTED]> wrote: > > > > I made some changes to the original code so it would use the cache a > > > bit better. The test code seems to pass, so I don't think I've screwed > > > anything up. > > > > The time for 16384x16384 on my machine is now 20s, so a factor of > > > 2.15x faster. The time for 2000x2000 also seems to be the same time as > > > Magma now. Hopefully I didn't just mistime things before, and this is > > > a real improvement. > > > > I am still trying things out. > > > > Bill. > > > > On 16 May, 01:41, Bill Hart <[EMAIL PROTECTED]> wrote: > > > > > I think it might just be possible to get down to the speed of Magma > > > > with a highly optimised classical multiplication routine. At 3600X3600 > > > > one clearly has to do 3600x3600 scalar products of a row by a column. > > > > We'll assume one of the matrices has been transposed to facilitate > > > > this. > > > > > If we use SSE2 then we can operate 128 bits at a time. There are 16 > > > > SSE registers. > > > > > The basic idea would be to load 4 SSE registers from different rows of > > > > matrix A and 2 from different columns of matrix B. We compute the > > > > scalar products of all 8 combinations of rowsxcolumns simultaneously. > > > > For this we need 2 temporary registers and 8 registers to hold the > > > > running totals. So all in all we need 4+2+2+8 = 16 registers. We only > > > > need to do an AND and an XOR to do the multiplication and addition > > > > required. > > > > > Caching becomes irrelevant if we choose a large selection of rows from > > > > A and a large selection of columns from B and do all the possible > > > > scalar products or rows by columns before moving to the next part. > > > > > Assuming the Opteron can do the memory loads at the same time as doing > > > > arithmetic not depending on those loads, careful instruction > > > > scheduling should get everything down to the cost of an SSE AND and an > > > > SSE OR per 128x128 bit pair in the classical algorithm. That puts the > > > > entire algorithm at pretty close to what Magma is doing timewise. > > > > > Another option is to not use SSE and just use the 64 bit integer > > > > registers. The disadvantage is one has to load things 64 bits at a > > > > time. But the advantage is the Opteron can do three arithmetic > > > > operations per cycle if properly scheduled. That gets 50% more work > > > > done than the SSE registers, which can only do 1 x 128 bit operation > > > > per cycle. > > > > > Of course I'm making the assumption here that the Opteron can indeed > > > > do loads at the same time as arithmetic. > > > > > if not, then there is no way Magma can be using classical > > > > multiplication out to 3600x3600 since there are simply too many > > > > operations to perform in the number of cycles available. > > > > > Bill. > > > > > On 16 May, 00:03, Martin Albrecht <[EMAIL PROTECTED]> > > > > wrote: > > > > > > On Thursday 15 May 2008, Bill Hart wrote: > > > > > > > Here is the graph of Magma times: > > > > > > >http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png > > > > > > > The crossover is not clear. The change from a smooth curve to a > > > > > > squiggly line is about 3600. So presumably that is it, but the graph > > > > > > also seems to change character at about 6200 or 7000 as well. One of > > > > > > these changes may be cache related. > > > > > > At 3600, the total data for all three matrices is almost 5mb and the > > > > > > cache on my machine is 1024kb. But if Magma is using classical > > > > > > multiplication, then this is pretty much irrelevant anyway, since > > > > > > you > > > > > > can keep the working data within the cache for quite a while during > > > > > > the algorithm. > > > > > > On the other hand: a squiggly line is what one one would expect for > > > > > Strassen-Winograd due to the extra rows/columns that have to be taken > > > > > care > > > > > of. In any case: 3600 seems rather late for 1MB L2 cache. I think > > > > > Allan Steel > > > > > once gave a talk about his implementation and stated that they don't > > > > > use > > > > > classical block multiplication (I saw some slides with that remark). > > > > > > Martin > > > > > > -- > > > > > name: Martin Albrecht > > > > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > > > > > _www:http://www.informatik.uni-bremen.de/~malb > > > > > _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---