I tried copying out the input matrices to the M4RM routine, but only when the rows aren't all contiguous in memory. This didn't speed anything up. Of course the reason for that is the second matrix is only read a handful of times, to construct the Gray tables which are then used extensively. The first matrix is read out of order anyway, by M4RM, so there's no point making all its rows contiguous.
It's hard to see how to get that last bit we need to beat Magma. What I don't quite understand now is the fact that we are beating Magma all the way up to 10000x10000, which is surely past their crossover to Strassen. But we start losing for large matrices when we use Strassen. I checked that the addition is not slower than Magma (its not, it's up to 5 times faster). The only trick I have left to try is to use twice the number of Gray tables, but make them half the width, but that seems like cheating, since we should already have a fast enough base case by now! Bill. On 18 May, 20:09, Bill Hart <[EMAIL PROTECTED]> wrote: > The copying out makes 50% difference (its better with copying) to the > speed of 16384x16384 but no difference to 10000x10000 or 20000x20000. > > That's wierd. > > Bill. > > On 18 May, 17:36, Martin Albrecht <[EMAIL PROTECTED]> > wrote: > > > Hi, > > > first, I recorded the different speed-ups in a small table for an overview > > in > > the attachment (I think we've come a far way :-)) To disable the copying out > > one needs to edit > > > /* we copy the matrix first since it is only constant memory > > overhead and improves data locality, if you remove it make sure > > there are no speed regressions */ > > /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */ > > packedmatrix *Cbar = mzd_init(C->nrows, C->ncols); > > Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE); > > mzd_copy(C, Cbar); > > mzd_free(Cbar); > > return C; > > > in strassen.c to > > > /* we copy the matrix first since it is only constant memory > > overhead and improves data locality, if you remove it make sure > > there are no speed regressions */ > > C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); > > return C; > > > This disables the copying out. > > > Martin > > > PS: If I find some time later today I'll make some changes such that SSE2 > > can > > be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is > > used. > > > -- > > name: Martin Albrecht > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > > _www:http://www.informatik.uni-bremen.de/~malb > > _jab: [EMAIL PROTECTED] > > > timings.html > > 2KDownload --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---