Hi Martin, The cycle counter in your bench code seems to give random values on the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has Magma 2-14.10 on it.
Anyhow, here are the Magma times: > A := RandomMatrix(GF(2),10^4,10^4); > B := RandomMatrix(GF(2),10^4,10^4); > t := Cputime(); C := A*B; Cputime(t); 2.940 > A := RandomMatrix(GF(2),2*10^4,2*10^4); > B := RandomMatrix(GF(2),2*10^4,2*10^4); > t := Cputime(); C := A*B; Cputime(t); 16.570 > A := RandomMatrix(GF(2),2^14,2^14); > B := RandomMatrix(GF(2),2^14,2^14); > t := Cputime(); C := A*B; Cputime(t); 9.250 The corresponding times for your code are approximately (hand timed; adjusted for generation of random matrices): 10^4: 8s 2*10^4: 59s 2^14: 43s I think it is polynomials over GF2 which Magma 2-14.x is much faster at rather than linear algebra. Bill. On 14 May, 22:52, Martin Albrecht <[EMAIL PROTECTED]> wrote: > Oh gods of the cpu cycle ... or hi there, > > (this e-mail contains details on a particular implementation for > GF(2) linear algebra, feel free to ignore it if that doesn't get you going) > > I've just submitted a new (much improved) version of the M4RI library for > inclusion in Sage: > > http://trac.sagemath.org/sage_trac/ticket/3204 > > M4RI is a library by Gregory Bard and myself for fast arithmetic with dense > matrices over GF(2). M4RI stands for the "Method of the Four Russians" > inversion which is an algorithm for matrix reduction first published by Greg. > The website is here: > > http://sage.math.washington.edu/home/malb/m4ri/ > > This new version has among other things two new features: > - SSE2 support where support (XOR two 128-bit words in one instruction) > - Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four > Russians" multiplication or Kronrod's algorithm) > > I'm writing to [sage-devel] because I observed some odd performance > characteristics and hope to pick your brains on how to improve performance: > > At least in my experience enabling SSE2 results in slower code on AMD CPUs. I > think this is rumoured across the net but did anyone else also witness this > behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo > notebook SSE2 speeds up the computation at least up to the size of the L2 > cache. > > One would expect that the M4RM + Strassen implementation would have an easy > time beating Magma since it uses a better algorithm. Same as Magma it does > Strassen-Winograd down to the cutoff but then it switches to the M4RM > algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This > should give a nice constant speed-up. This theory so far seems to match > reality on a Core2Duo notebook with 4MB L2 cache: > > ---------------------------------------------------------------------- > | SAGE Version 3.0.1, Release Date: 2008-05-05 | > | Type notebook() for the GUI, and license() for information. | > ---------------------------------------------------------------------- > sage: A = random_matrix(GF(2),10^4,10^4) > sage: B = random_matrix(GF(2),10^4,10^4) > sage: time C = A._multiply_strassen(B,cutoff=3200) > CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s > Wall time: 4.56 > > Magma V2.13-5 Wed May 14 2008 21:55:31 on road [Seed = 2246483032] > Type ? for help. Type <Ctrl>-D to quit.> A := RandomMatrix(GF(2),10^4,10^4); > > B := RandomMatrix(GF(2),10^4,10^4); > > t := Cputime(); C := A*B; Cputime(t); > > 7.170 > > sage: A = random_matrix(GF(2),2*10^4,2*10^4) > sage: B = random_matrix(GF(2),2*10^4,2*10^4) > sage: time C = A._multiply_strassen(B,cutoff=3200) > CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s > Wall time: 33.59 > > > A := RandomMatrix(GF(2),2*10^4,2*10^4); > > B := RandomMatrix(GF(2),2*10^4,2*10^4); > > t := Cputime(); C := A*B; Cputime(t); > > 49.990 > > Finally, a perfect cutting example (no extra rows/columns to take care of): > > sage: A = random_matrix(GF(2),1<<14,1<<14) > sage: B = random_matrix(GF(2),1<<14,1<<14) > sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200 > CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s > Wall time: 16.96 > > > A:= RandomMatrix(GF(2),2^14,2^14); > > B:= RandomMatrix(GF(2),2^14,2^14); > > t := Cputime(); C := A*B; Cputime(t); > > 24.060 > > Note however that this comparison is NOT FAIR since this version of Magma is > not optimised for the C2D chip! > > However, on AMD CPUs the times don't look that nice. Both on sage.math and on > a > > AMD Athlon(tm) 64 X2 Dual Core Processor 6400+ > > the performance is way way behind Magma (or what I'd expect Magma to perform > like), e.g. on sage.math: > > sage: A = random_matrix(GF(2),10^4,10^4) > sage: B = random_matrix(GF(2),10^4,10^4) > sage: time C = A._multiply_strassen(B,cutoff=1500) > CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s > Wall time: 14.62 > > Magma V2.13-5 Wed May 14 2008 14:07:18 on sage [Seed = 4255048009] > Type ? for help. Type <Ctrl>-D to quit.> A := RandomMatrix(GF(2),10^4,10^4); > > B := RandomMatrix(GF(2),10^4,10^4); > > t := Cputime(); C := A*B; Cputime(t); > > 3.890 <<--------- They even beat the C2D time! > > sage: A = random_matrix(GF(2),2*10^4,2*10^4) > sage: B = random_matrix(GF(2),2*10^4,2*10^4) > sage: time C = A._multiply_strassen(B,cutoff=1500) > CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s > Wall time: 100.06 > > > B := RandomMatrix(GF(2),2*10^4,2*10^4); > > A := RandomMatrix(GF(2),2*10^4,2*10^4); > > t := Cputime(); C := A*B; Cputime(t); > > 27.470 > > # perfect cutoff > sage: A = random_matrix(GF(2),1<<14,1<<14) > sage: B = random_matrix(GF(2),1<<14,1<<14) > sage: time C = A._multiply_strassen(B,cutoff=1<<10) > CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s > Wall time: 100.45 > sage: time C = A._multiply_strassen(B,cutoff=1<<9) > CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s > Wall time: 76.87 > sage: time C = A._multiply_strassen(B,cutoff=1<<8) > CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s > Wall time: 52.78 > > > A := RandomMatrix(GF(2),2^14,2^14); > > B := RandomMatrix(GF(2),2^14,2^14); > > t := Cputime(); C := A*B; Cputime(t); > > 14.970 > > I guess the bad performance of the M4RI library is partly related to the > smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower > (since we want all three C = A*B submatrices to fit in L2 cache). Also it > seems that Magma is almost always ~ 3.5 times faster than our implementation. > I am slowly running out of ideas what to improve in order to close to the gap > so if anyone has some ideas please let me hear them. It might be worth noting > that we are using the same cutting strategy as Sage in strassen.pyx which > ignores uneven rows/columns at every recursion step and deals with the extra > rows/columns after the Strassen-Winograd formula (see strassen.pyx for > details). We are also using the improved operation schedule from paper of > Clement et al.Though I don't know if the cutting strategy is optimal, we also > get beaten in a perfect cutting scenario, so I guess it is not that important > for now. Anyway, I'd happily provide profiling data if anyone wants to take a > look and squeeze a couple of cycles out. If not, then this e-mail is jut a > quite long status report :-) > > Btw. I don't have access to Magma 2.14 which I believe to be the fastest in > linear algebra over GF(2). In version 2.14 they added SSE2 support too. > So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear > about the results. I don't know what else to compare with except Magma. Any > suggestions? Thoughts? Rants? > > Cheers, > Martin > > -- > name: Martin Albrecht > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > _www:http://www.informatik.uni-bremen.de/~malb > _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---