[sage-devel] Re: SSE2 not so useless after all

Bill Hart Mon, 19 May 2008 18:55:26 -0700

Hi Clement,

I heard you had a big smile on your face today. Well done.

Regarding your suggestion about copying into blocks, that is a very
good idea. The problem at present is that we break up into blocks
vertically, not horizontally. But we absolutely should be doing it
horizontally. The reason is that we could break along cache line
boundaries. Then if we copied into blocks as you suggest, before
calling M4RM, the time saving should be quite noticeable. I think I
will investigate this when I get some more time to work on this.

Regarding the asymptotics, I think they are the same as Magma. The
time for addition is absolutely tiny in comparison with multiplication
(like 1000 times smaller). Each matrix above our crossover to Strassen
is also already well outside the L2 cache size. Thus there just isn't
anything to save either in cache friendliness or in the additions.
Thus I think the issue is actually still in the M4RM routine.
Basically 32000x32000 breaks down into matrices of size 2000x2000 when
it gets below the crossover. These smaller matrices are just taking
too long to multiply.

The reason that I formerly felt that there was a problem with Strassen
is that when we compared 10000x10000 and 20000x20000 it seemed that
Magma beat us for the latter, but not the former. But when 20000x20000
gets broken down, it can't get broken into 10000x10000 because 10000
is not divisible by 64. So it can't be broken in this way along a limb
boundary. So whatever it is broken down into is too slow. That's where
we need to focus our efforts now, I think.

Bill.

On 20 May, 02:14, Clement Pernet <[EMAIL PROTECTED]> wrote:
> hi guys,
>
> I am finally up to date with this discussion (I was being interviewed,
> and then flying when it started).
> First, congrats for the great job you have achieved. I have started to
> dive into m4ri, and I really like the quality of the code.
>
> I have a few remarks
>
> * the loop unrolling technique used for the creation of the table, could
> maybe be used in the computation of the product as well.
> Is 8 optimal? I remember seeing 32 in ATLAS, but don't know of any
> justifications. Since some pipeline are longer than 8, this might be
> better to have a longer unrolled loop.
>
> * I am not sure about this, but wouldn't it be better to have a block
> decomposition that matches the babystep-giantstep structure?
> This could happen at the strassen threshold : instead of simply copying
> the matrix (which already improves the data-locality) copy it into a
> bunch blocks of size blocksize and call m4rm on that structure. ATLAS
> are doing this kind of copies for dimensions not larger than 200 if I
> recall correctly.
> Maybe I am just missing something about your babystep/giantstep algorithm.
>
> Anyway, as you pointed out, the battle is now on the asymptotic
> comparison with Magma, and I still have no ideas on how to improve your
> strassen implementation. Still thinking about it....
>
> Cheers,
>
> Clément
>
> Bill Hart a écrit :
>
> > Yep that's exactly the same thing as what M4RM does. Thanks for the
> > explanation.
>
> > Bill.
>
> > On 20 May, 00:22, Robert Miller <[EMAIL PROTECTED]> wrote:
>
> >>> I can't tell exactly what GAP does. It is beautifully documented, but
> >>> it talks about "grease units", which is terminology I don't
> >>> understand. It does look like M4RM though.
>
> >> Grease is a concept for speeding up certain things using caching. For
> >> example, suppose I have the permutation group S_{32} acting on ints. I
> >> can represent a particular permutation as a permutation matrix, which
> >> means that applying that permutation is just vector-matrix
> >> multiplication. However, instead of computing the full matrix
> >> multiplication, we can cut the matrix into pieces (probably of length
> >> "grease units" or something). Essentially, we compute every possible
> >> sum of the first four rows, then the next four rows, etc. Then, to see
> >> how to multiply the vector by the matrix, we cut the vector into
> >> chunks of four, and simply look up the corresponding entry in the
> >> "grease table", finally adding them together in the end.
>
> >> -- RLM
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

[sage-devel] Re: SSE2 not so useless after all

Reply via email to