I suppose that this might be due to the ends of rows all being zero as
they aren't a multiple of 64 bits long. But I checked for 16384x16384
and we are nearly down to the speed of Magma there too. I just don't
get it. The coinflip has to be broken I think.

Bill.

On 17 May, 22:40, Bill Hart <[EMAIL PROTECTED]> wrote:
> Martin,
>
> Here's a really unusual thing. Perhaps you can confirm this. I get a
> 20% improvement if I add:
>
> if (x)
> {
>
> }
>
> in the three obvious places in the function _mzd_mul_m4rm_impl. This
> stops it mpz_combining the zero row.
>
> But I don't understand why this works. The time should be only 1.5%
> better since k = 6 and there are 2^k rows in the table, only one of
> which is zero.
>
> Could it be that your coinflip function is not quite random?
>
> Anyway, I'm down to 3.40s for 10000x10000 with this change. Test
> functions still pass.
>
> Bill.
>
> On 17 May, 22:05, Bill Hart <[EMAIL PROTECTED]> wrote:
>
> > Yet another idea.
>
> > Suppose we do not combine entire rows in the Gray table, but only half
> > rows. Once half a row is bigger than a single cache line (512 bits on
> > the Opteron) we may as well work with half rows. This allows us to
> > work with twice as many rows at once in the Gray tables (each of half
> > the size). This means that we are dealing with twice as many bits from
> > rows of A as usual and twice as many rows of B as usual, but we need
> > to do it all again for the second half of the rows. This means we get
> > twice the work done in the same amount of cache space.
>
> > Combined with the idea of using two Gray tables of 2^5 combinations of
> > rows instead of a single table of 2^6 combinations of rows, this would
> > equate to dealing with 20 bits of each row of A at a time and 20 rows
> > of B at a time.
>
> > With this scheme, there would then be 4 arithmetic operations in SSE
> > registers, 5 loads and 1 store, when combining rows from Gray tables,
> > instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations,
> > changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5.
>
> > This is another example where copying the data (the half rows) out and
> > reordering it so it has better locality, would probably make a big
> > difference. That sort of thing always works exceptionally well on AMD
> > chips.
>
> > Bill.
>
> > On 17 May, 21:05, Bill Hart <[EMAIL PROTECTED]> wrote:
>
> > > That's looking good. Would you like me to run it on an unburdened
> > > opteron to see how it goes? If you like you can send me a tarball and
> > > I'll try it out.
>
> > > I think our best bet for a significant improvement now is the idea of
> > > using two Gray tables of half the size simultaneously. I also realised
> > > it possibly improves the cache performance for the A matrix too.
>
> > > I was casually wondering whether Magma might use a highly optimised
> > > Winograd's algorithm instead of the naive algorithm. But over GF2 I
> > > think it probably actually takes longer, since it basically replaces
> > > n^2 full length scalar multiplies by n^2 half length ones and 2*n^2
> > > half row additions, plus a pile of other overhead.
>
> > > Bill.
>
> > > On 17 May, 20:32, Martin Albrecht <[EMAIL PROTECTED]>
> > > wrote:
>
> > > > On Saturday 17 May 2008, Martin Albrecht wrote:
>
> > > > > > I think a better idea would be to explicitly force all matrices and
> > > > > > all rows to be 128 bit aligned if the matrices are wide enough to
> > > > > > benefit from SSE2, Then the combine function can always use SSE2 and
> > > > > > there will be no need to check for alignment.
>
> > > > > That doesn't seem to make a noticeable difference for me (on C2D). 
> > > > > However,
> > > > > I realised that the multiplications where the target matrix is a real
> > > > > matrix rather than a window (which has bad data locality). Copying
> > > > > everything over seems not like a good idea but it at least indicates 
> > > > > an
> > > > > area for improvements.
>
> > > > Okay, if I only copy when we crossover to M4RM then the memory overhead 
> > > > is
> > > > constant (~ cutoff^2) and the performance still improves.
>
> > > > Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
> > > > Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
> > > > 10,000 x 10,000         2.920                           3.610
> > > > 16,384 x 16,384         11.140                          12.120
> > > > 20,000 x 20,000         20.370                          24.390
> > > > 32,000 x 32,000 74.290                          94.910
>
> > > > New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
> > > > Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
> > > > 10,000 x 10,000         2.920                           2.990
> > > > 16,384 x 16,384         11.140                          11.750
> > > > 20,000 x 20,000         20.370                          21.180
> > > > 32,000 x 32,000 74.290                          86.570
>
> > > > On Opteron things don't look this way, but I think sage.math is pretty 
> > > > heavily
> > > > used right now such that my benchmarks there are not very telling.
>
> > > > Martin
>
> > > > --
> > > > name: Martin Albrecht
> > > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> > > > _www:http://www.informatik.uni-bremen.de/~malb
> > > > _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to