Martin, Here's a really unusual thing. Perhaps you can confirm this. I get a 20% improvement if I add:
if (x) { } in the three obvious places in the function _mzd_mul_m4rm_impl. This stops it mpz_combining the zero row. But I don't understand why this works. The time should be only 1.5% better since k = 6 and there are 2^k rows in the table, only one of which is zero. Could it be that your coinflip function is not quite random? Anyway, I'm down to 3.40s for 10000x10000 with this change. Test functions still pass. Bill. On 17 May, 22:05, Bill Hart <[EMAIL PROTECTED]> wrote: > Yet another idea. > > Suppose we do not combine entire rows in the Gray table, but only half > rows. Once half a row is bigger than a single cache line (512 bits on > the Opteron) we may as well work with half rows. This allows us to > work with twice as many rows at once in the Gray tables (each of half > the size). This means that we are dealing with twice as many bits from > rows of A as usual and twice as many rows of B as usual, but we need > to do it all again for the second half of the rows. This means we get > twice the work done in the same amount of cache space. > > Combined with the idea of using two Gray tables of 2^5 combinations of > rows instead of a single table of 2^6 combinations of rows, this would > equate to dealing with 20 bits of each row of A at a time and 20 rows > of B at a time. > > With this scheme, there would then be 4 arithmetic operations in SSE > registers, 5 loads and 1 store, when combining rows from Gray tables, > instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations, > changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5. > > This is another example where copying the data (the half rows) out and > reordering it so it has better locality, would probably make a big > difference. That sort of thing always works exceptionally well on AMD > chips. > > Bill. > > On 17 May, 21:05, Bill Hart <[EMAIL PROTECTED]> wrote: > > > That's looking good. Would you like me to run it on an unburdened > > opteron to see how it goes? If you like you can send me a tarball and > > I'll try it out. > > > I think our best bet for a significant improvement now is the idea of > > using two Gray tables of half the size simultaneously. I also realised > > it possibly improves the cache performance for the A matrix too. > > > I was casually wondering whether Magma might use a highly optimised > > Winograd's algorithm instead of the naive algorithm. But over GF2 I > > think it probably actually takes longer, since it basically replaces > > n^2 full length scalar multiplies by n^2 half length ones and 2*n^2 > > half row additions, plus a pile of other overhead. > > > Bill. > > > On 17 May, 20:32, Martin Albrecht <[EMAIL PROTECTED]> > > wrote: > > > > On Saturday 17 May 2008, Martin Albrecht wrote: > > > > > > I think a better idea would be to explicitly force all matrices and > > > > > all rows to be 128 bit aligned if the matrices are wide enough to > > > > > benefit from SSE2, Then the combine function can always use SSE2 and > > > > > there will be no need to check for alignment. > > > > > That doesn't seem to make a noticeable difference for me (on C2D). > > > > However, > > > > I realised that the multiplications where the target matrix is a real > > > > matrix rather than a window (which has bad data locality). Copying > > > > everything over seems not like a good idea but it at least indicates an > > > > area for improvements. > > > > Okay, if I only copy when we crossover to M4RM then the memory overhead is > > > constant (~ cutoff^2) and the performance still improves. > > > > Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo > > > Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) > > > 10,000 x 10,000 2.920 3.610 > > > 16,384 x 16,384 11.140 12.120 > > > 20,000 x 20,000 20.370 24.390 > > > 32,000 x 32,000 74.290 94.910 > > > > New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo > > > Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) > > > 10,000 x 10,000 2.920 2.990 > > > 16,384 x 16,384 11.140 11.750 > > > 20,000 x 20,000 20.370 21.180 > > > 32,000 x 32,000 74.290 86.570 > > > > On Opteron things don't look this way, but I think sage.math is pretty > > > heavily > > > used right now such that my benchmarks there are not very telling. > > > > Martin > > > > -- > > > name: Martin Albrecht > > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > > > _www:http://www.informatik.uni-bremen.de/~malb > > > _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---