Well, apparently there are speed gains right up to 8 Gray tables, though I really have no idea why that is.
Here are the new times: Magma Old New 10000x10000: 2.940s 3.13s 2.32s 16384x16384: 9.250s 12.96s 9.17s 20000x20000: 16.57s 22.43s 16.49s 32000x32000: 59.1s 90.2s 81.6s 65.7s So now we beat Magma all the way up to 20000x20000. I'll put the adjusted code in the directory: http://sage.math.washington.edu/home/wbhart/m4ri3gray/ in just a moment. I also found a memory leak in my code which I fixed. Bill. On 18 May, 19:19, Bill Hart <[EMAIL PROTECTED]> wrote: > If two Gray tables is better than one, then you can't have enough of a > good thing right? So I made 3 Gray tables now. > > The files I modified are in: > > http://sage.math.washington.edu/home/wbhart/m4ri3gray/ > > There are three main modifications: > > 1) Make all matrices SSE aligned if we have SSE2, and make all rows of > all matrices aligned. This required a fix in brilliantrussian.c, since > it makes some assumptions about how the data is stored in matrices. > > 2) Introduce function combine2 and combine3 which use SSE to do a[i] > ^= b[i] ^ c[i] and a[i] ^= b[i] ^ c[i] ^ d[i]. > > 3) Code to use three Gray tables. I had to remove the a_nc%k's since > these were causing segfaults for reasons I don't understand. > > Here are the Magma times, your old times and the new times on my > Opteron: > > 10000x10000: 2.940s 3.13s 2.82s > > 16384x16384: 9.250s 12.96s 11.25s > > 20000x20000: 16.57s 22.43s 19.39s > > 32000x32000: 59.1s 90.2s 81.56s > > Note I do not use the new combine2 and combine3 functions as they slow > it down on my machine. I cannot believe how useless SSE seems to be! > > I've commented the three lines out that use combine3 in > brilliantrussian.c in case you want to try it with SSE on your CTD. > > Bill. > > On 18 May, 17:36, Martin Albrecht <[EMAIL PROTECTED]> > wrote: > > > Hi, > > > first, I recorded the different speed-ups in a small table for an overview > > in > > the attachment (I think we've come a far way :-)) To disable the copying out > > one needs to edit > > > /* we copy the matrix first since it is only constant memory > > overhead and improves data locality, if you remove it make sure > > there are no speed regressions */ > > /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */ > > packedmatrix *Cbar = mzd_init(C->nrows, C->ncols); > > Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE); > > mzd_copy(C, Cbar); > > mzd_free(Cbar); > > return C; > > > in strassen.c to > > > /* we copy the matrix first since it is only constant memory > > overhead and improves data locality, if you remove it make sure > > there are no speed regressions */ > > C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); > > return C; > > > This disables the copying out. > > > Martin > > > PS: If I find some time later today I'll make some changes such that SSE2 > > can > > be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is > > used. > > > -- > > name: Martin Albrecht > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > > _www:http://www.informatik.uni-bremen.de/~malb > > _jab: [EMAIL PROTECTED] > > > timings.html > > 2KDownload --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---