> Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo > Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) > 10,000 x 10,000 2.920 3.610 > 16,384 x 16,384 11.140 12.120 > 20,000 x 20,000 20.370 24.390 > 32,000 x 32,000 74.290 94.910 > > New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo > Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) > 10,000 x 10,000 2.920 2.990 > 16,384 x 16,384 11.140 11.750 > 20,000 x 20,000 20.370 21.180 > 32,000 x 32,000 74.290 86.570
If you take this + Bill's idea: > For 10000x10000 we currently use k = 6. Instead of this, we could use > k = 5 and make two Gray tables simultaneously. This will still fit in > cache. > > Instead of doing 6 bits at a time, we can then do 10 bits at a time. > We'd load the appropriate line from the first Gray table, then the > appropriate one from the second and xor them, then xor with the output > matrix. This should decrease the number of loads and stores > considerably. Moreover, the SSE instructions will then be much more > efficient as the ratio of arithmetic instructions to loads and stores > is higher. > > Of course one could also do 16 bits at a time, by doing 4 tables, but > I think this might actually get slower again since you've only > increased the amount of work done by 60%, but you've had a 30 % > increase in instructions. You get (on the C2D): sage: B = random_matrix(GF(2), 3.2*10^4, 3.2*10^4) sage: A = random_matrix(GF(2), 3.2*10^4, 3.2*10^4) sage: time C= A._multiply_strassen(B,cutoff=2^11) CPU times: user 75.82 s, sys: 0.22 s, total: 76.04 s Wall time: 76.31 sage: A = random_matrix(GF(2), 2*10^4, 2*10^4) sage: B = random_matrix(GF(2), 2*10^4, 2*10^4) sage: time C= A._multiply_strassen(B,cutoff=2^11) CPU times: user 19.14 s, sys: 0.09 s, total: 19.24 s Wall time: 19.29 sage: B = random_matrix(GF(2), 2^14, 2^14) sage: A = random_matrix(GF(2), 2^14, 2^14) sage: time C= A._multiply_strassen(B,cutoff=2^11) CPU times: user 10.62 s, sys: 0.05 s, total: 10.67 s Wall time: 10.70 sage: B = random_matrix(GF(2), 10^4, 10^4) sage: A = random_matrix(GF(2), 10^4, 10^4) sage: time C= A._multiply_strassen(B,cutoff=2^11) CPU times: user 2.73 s, sys: 0.02 s, total: 2.75 s Wall time: 2.76 i.e the speed of my current Magma install on the same computer (mind you, this one might not be optimised for the C2D but for the Opteron, I don't know). The times above don't have SSE2 yet. I guess documenting Bill's tricks - process the rows of A in blocks - use two rather than one Gray code table well is in order since now M4RM looks quite different from the original algorithm. I'll do that tomorrow. Again, thanks Bill! Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---