[sage-devel] Re: slightly OT: new M4RI library

Bill Hart Fri, 16 May 2008 08:00:31 -0700

P.S: I also tried cache hints, but no luck. They just slow it down.

Bill.


On 16 May, 15:59, Bill Hart <[EMAIL PROTECTED]> wrote:
> I tried cache blocking matrix B, but the times are exactly the same. I
> think the processor is happy to keep loading B one row at a time
> sequentially, and since it is only done a handful of times in the
> algorithm, it accounts for little of the runtime.
>
> It might go faster in combination with storing the gray tables, but
> the optimal block size seems to be about 100 (*k) for the matrix B,
> which would necessitate storage of 100 gray tables, which is an awful
> lot of memory. Probably I think it is best to avoid this kind of
> memory usage.
>
> So I am not sure where the other factor of 2 will come from.
>
> Bill.
>
> On 16 May, 14:37, Bill Hart <[EMAIL PROTECTED]> wrote:
>
> > Here is the modified _mzd_mul_m4rm_impl function which gives this
> > speedup:
>
> >http://sage.math.washington.edu/home/wbhart/m4rmul.c
>
> > I used a crossover of 3600 and I indicate at the top of this file how
> > constants should be changed to get the speedups for the various values
> > of n. I didn't put any code in to automatically choose the correct
> > values.
>
> > I am presuming that the test_multiplication code actually tests this
> > function, otherwise maybe my code is just rubbish.
>
> > The basic idea is to cache block the A matrix into groups of BLOCK
> > rows. If one also cache blocked the B matrix and left the computed
> > gray tables in memory instead of writing over them and recalculating
> > them all the time, as I currently do, one could probably get the other
> > factor of 2 that we need.
>
> > Note I've turned SSE off in this function, but left it on in
> > _mzd_combine since it makes no real difference there.
>
> > Bill.
>
> > On 16 May, 14:16, Bill Hart <[EMAIL PROTECTED]> wrote:
>
> > > 10000x10000 is down to 4.5s now, so nearly a 2x speedup.
>
> > > 20000x20000 is down to 32.0s, so again nearly a 2x speedup.
>
> > > Bill.
>
> > > On 16 May, 13:53, Bill Hart <[EMAIL PROTECTED]> wrote:
>
> > > > I made some changes to the original code so it would use the cache a
> > > > bit better. The test code seems to pass, so I don't think I've screwed
> > > > anything up.
>
> > > > The time for 16384x16384 on my machine is now 20s, so a factor of
> > > > 2.15x faster. The time for 2000x2000 also seems to be the same time as
> > > > Magma now. Hopefully I didn't just mistime things before, and this is
> > > > a real improvement.
>
> > > > I am still trying things out.
>
> > > > Bill.
>
> > > > On 16 May, 01:41, Bill Hart <[EMAIL PROTECTED]> wrote:
>
> > > > > I think it might just be possible to get down to the speed of Magma
> > > > > with a highly optimised classical multiplication routine. At 3600X3600
> > > > > one clearly has to do 3600x3600 scalar products of a row by a column.
> > > > > We'll assume one of the matrices has been transposed to facilitate
> > > > > this.
>
> > > > > If we use SSE2 then we can operate 128 bits at a time. There are 16
> > > > > SSE registers.
>
> > > > > The basic idea would be to load 4 SSE registers from different rows of
> > > > > matrix A and 2 from different columns of matrix B. We compute the
> > > > > scalar products of all 8 combinations of rowsxcolumns simultaneously.
> > > > > For this we need 2 temporary registers and 8 registers to hold the
> > > > > running totals. So all in all we need 4+2+2+8 = 16 registers. We only
> > > > > need to do an AND and an XOR to do the multiplication and addition
> > > > > required.
>
> > > > > Caching becomes irrelevant if we choose a large selection of rows from
> > > > > A and a large selection of columns from B and do all the possible
> > > > > scalar products or rows by columns before moving to the next part.
>
> > > > > Assuming the Opteron can do the memory loads at the same time as doing
> > > > > arithmetic not depending on those loads, careful instruction
> > > > > scheduling should get everything down to the cost of an SSE AND and an
> > > > > SSE OR per 128x128 bit pair in the classical algorithm. That puts the
> > > > > entire algorithm at pretty close to what Magma is doing timewise.
>
> > > > > Another option is to not use SSE and just use the 64 bit integer
> > > > > registers. The disadvantage is one has to load things 64 bits at a
> > > > > time. But the advantage is the Opteron can do three arithmetic
> > > > > operations per cycle if properly scheduled. That gets 50% more work
> > > > > done than the SSE registers, which can only do 1 x 128 bit operation
> > > > > per cycle.
>
> > > > > Of course I'm making the assumption here that the Opteron can indeed
> > > > > do loads at the same time as arithmetic.
>
> > > > > if not, then there is no way Magma can be using classical
> > > > > multiplication out to 3600x3600 since there are simply too many
> > > > > operations to perform in the number of cycles available.
>
> > > > > Bill.
>
> > > > > On 16 May, 00:03, Martin Albrecht <[EMAIL PROTECTED]>
> > > > > wrote:
>
> > > > > > On Thursday 15 May 2008, Bill Hart wrote:
>
> > > > > > > Here is the graph of Magma times:
>
> > > > > > >http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png
>
> > > > > > > The crossover is not clear. The change from a smooth curve to a
> > > > > > > squiggly line is about 3600. So presumably that is it, but the 
> > > > > > > graph
> > > > > > > also seems to change character at about 6200 or 7000 as well. One 
> > > > > > > of
> > > > > > > these changes may be cache related.
> > > > > > > At 3600, the total data for all three matrices is almost 5mb and 
> > > > > > > the
> > > > > > > cache on my machine is 1024kb. But if Magma is using classical
> > > > > > > multiplication, then this is pretty much irrelevant anyway, since 
> > > > > > > you
> > > > > > > can keep the working data within the cache for quite a while 
> > > > > > > during
> > > > > > > the algorithm.
>
> > > > > > On the other hand: a squiggly line is what one one would expect for
> > > > > > Strassen-Winograd due to the extra rows/columns that have to be 
> > > > > > taken care
> > > > > > of. In any case: 3600 seems rather late for 1MB L2 cache. I think 
> > > > > > Allan Steel
> > > > > > once gave a talk about his implementation and stated that they 
> > > > > > don't use
> > > > > > classical block multiplication (I saw some slides with that remark).
>
> > > > > > Martin
>
> > > > > > --
> > > > > > name: Martin Albrecht
> > > > > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> > > > > > _www:http://www.informatik.uni-bremen.de/~malb
> > > > > > _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

[sage-devel] Re: slightly OT: new M4RI library

Reply via email to