I checked the coinflip and it is definitely fine. There is no greater probability of 6 zeroes in a row than there ought to be. So the speedup I just reported is quite a mystery.
Bill. On 17 May, 22:57, Bill Hart <[EMAIL PROTECTED]> wrote: > I suppose that this might be due to the ends of rows all being zero as > they aren't a multiple of 64 bits long. But I checked for 16384x16384 > and we are nearly down to the speed of Magma there too. I just don't > get it. The coinflip has to be broken I think. > > Bill. > > On 17 May, 22:40, Bill Hart <[EMAIL PROTECTED]> wrote: > > > Martin, > > > Here's a really unusual thing. Perhaps you can confirm this. I get a > > 20% improvement if I add: > > > if (x) > > { > > > } > > > in the three obvious places in the function _mzd_mul_m4rm_impl. This > > stops it mpz_combining the zero row. > > > But I don't understand why this works. The time should be only 1.5% > > better since k = 6 and there are 2^k rows in the table, only one of > > which is zero. > > > Could it be that your coinflip function is not quite random? > > > Anyway, I'm down to 3.40s for 10000x10000 with this change. Test > > functions still pass. > > > Bill. > > > On 17 May, 22:05, Bill Hart <[EMAIL PROTECTED]> wrote: > > > > Yet another idea. > > > > Suppose we do not combine entire rows in the Gray table, but only half > > > rows. Once half a row is bigger than a single cache line (512 bits on > > > the Opteron) we may as well work with half rows. This allows us to > > > work with twice as many rows at once in the Gray tables (each of half > > > the size). This means that we are dealing with twice as many bits from > > > rows of A as usual and twice as many rows of B as usual, but we need > > > to do it all again for the second half of the rows. This means we get > > > twice the work done in the same amount of cache space. > > > > Combined with the idea of using two Gray tables of 2^5 combinations of > > > rows instead of a single table of 2^6 combinations of rows, this would > > > equate to dealing with 20 bits of each row of A at a time and 20 rows > > > of B at a time. > > > > With this scheme, there would then be 4 arithmetic operations in SSE > > > registers, 5 loads and 1 store, when combining rows from Gray tables, > > > instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations, > > > changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5. > > > > This is another example where copying the data (the half rows) out and > > > reordering it so it has better locality, would probably make a big > > > difference. That sort of thing always works exceptionally well on AMD > > > chips. > > > > Bill. > > > > On 17 May, 21:05, Bill Hart <[EMAIL PROTECTED]> wrote: > > > > > That's looking good. Would you like me to run it on an unburdened > > > > opteron to see how it goes? If you like you can send me a tarball and > > > > I'll try it out. > > > > > I think our best bet for a significant improvement now is the idea of > > > > using two Gray tables of half the size simultaneously. I also realised > > > > it possibly improves the cache performance for the A matrix too. > > > > > I was casually wondering whether Magma might use a highly optimised > > > > Winograd's algorithm instead of the naive algorithm. But over GF2 I > > > > think it probably actually takes longer, since it basically replaces > > > > n^2 full length scalar multiplies by n^2 half length ones and 2*n^2 > > > > half row additions, plus a pile of other overhead. > > > > > Bill. > > > > > On 17 May, 20:32, Martin Albrecht <[EMAIL PROTECTED]> > > > > wrote: > > > > > > On Saturday 17 May 2008, Martin Albrecht wrote: > > > > > > > > I think a better idea would be to explicitly force all matrices > > > > > > > and > > > > > > > all rows to be 128 bit aligned if the matrices are wide enough to > > > > > > > benefit from SSE2, Then the combine function can always use SSE2 > > > > > > > and > > > > > > > there will be no need to check for alignment. > > > > > > > That doesn't seem to make a noticeable difference for me (on C2D). > > > > > > However, > > > > > > I realised that the multiplications where the target matrix is a > > > > > > real > > > > > > matrix rather than a window (which has bad data locality). Copying > > > > > > everything over seems not like a good idea but it at least > > > > > > indicates an > > > > > > area for improvements. > > > > > > Okay, if I only copy when we crossover to M4RM then the memory > > > > > overhead is > > > > > constant (~ cutoff^2) and the performance still improves. > > > > > > Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo > > > > > Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) > > > > > 10,000 x 10,000 2.920 3.610 > > > > > 16,384 x 16,384 11.140 12.120 > > > > > 20,000 x 20,000 20.370 24.390 > > > > > 32,000 x 32,000 74.290 94.910 > > > > > > New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo > > > > > Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) > > > > > 10,000 x 10,000 2.920 2.990 > > > > > 16,384 x 16,384 11.140 11.750 > > > > > 20,000 x 20,000 20.370 21.180 > > > > > 32,000 x 32,000 74.290 86.570 > > > > > > On Opteron things don't look this way, but I think sage.math is > > > > > pretty heavily > > > > > used right now such that my benchmarks there are not very telling. > > > > > > Martin > > > > > > -- > > > > > name: Martin Albrecht > > > > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > > > > > _www:http://www.informatik.uni-bremen.de/~malb > > > > > _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---