The asymptotics really only kick in when we are using Strassen. The starting point is thus not 0, but the crossover point between M4RM and Strassen. So I don't think you can read too much into the asymptotic statement. But once the asymptotics do kick in, we aren't too far off. The only thing throwing them off slightly is the inexact cut business, which my patch mitigates to some extent, though not totally.
I wouldn't be surprised to find we were a factor of 1.5 or 2 from optimal, but this would be because we could probably do more work in our base case without falling out of cache. There is still room for improvement. A few percent can be gained by splitting A horizontally into two matrices. I also believe we could get a significant gain by allowing exact cuts, i.e. not along 64 bit boundaries. This could be done by copying out the data and shifting it (for the shifting we'd benefit from SSE on both the Opteron and C2D), from the two submatrices on the right when doing Strassen. But this would also dramatically increase memory usage, so I'm not sure how pronounced the overall effect would be. Perhaps we'd gain 10-15%. The other problem would be that we'd have to mask the end of each row in the submatrices on the left hand side before using them, so this could complicate the code quite a bit. Another option would be to represent our matrices a different way. We'd store a 32000 by 32000 matrix as a series of 32000 x 2000 matrices, padded with zeroes out to a 64 bit boundary. Addition of matrices wouldn't be affected much by this, but the m4rm code would have to be substantially altered in the way it deals with the matrix A. At any rate I think most of the possible substantial improvements now result in yucky code. Bill. On 22 May, 11:21, Martin Albrecht <[EMAIL PROTECTED]> wrote: > On Thursday 22 May 2008, Bill Hart wrote: > > > Hi Martin, > > > This version works great. Here are the times on my unburdened 2.8Ghz > > Opteron. First the Magma times, then the times for an older version of > > m4ri and now, for the first time ever, the new Magma beating times: > > > 10000x10000: 2.940s 3.13s 2.25s > > > 16384x16384: 9.250s 12.96s 8.80s > > > 20000x20000: 16.57s 22.43s 15.48 > > > 32000x32000: 59.1s 90.2s 57.8s > > Bill, I suppose that also means that now we actually beat (or are close to > beating) Magma on the C2D "for real". My M4RI times are quite similar on the > C2D as your times on your Opteron. But my version of Magma (on the C2D) is > much worse than your version of Magma (on the Opteron). So it is probably > best to assume at least your times for Magma on my machine too. > > Martin > > PS: I wonder if this argument makes sense: > > We have a complexity of n^2.807 and a CPU of say 2.333 Ghz which can operate > on 64 bits per clock (128 if we use SSE2). So if we had optimal code (no > missed branch predictions, no caching issues, everything optimal) we would > expect a running time of n^2.807 / 64 / 2.333 / 10^9 > > If we plug in 20,000 for n we'd get 7.923 seconds w.o. SSE2 and 3.961 with > SSE2. So our implementation (12.2 s) is a factor of ~1.5 or ~3 away from > being optimal? Does that sound correct or complete bollocks? > > -- > name: Martin Albrecht > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > _www:http://www.informatik.uni-bremen.de/~malb > _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---