Bill, I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is worth the trouble (I agree that it make the otherwise pretty looking code ugly).
I have some timings (for an old implementation) here: http://trac.sagemath.org/sage_trac/ticket/3204#comment:2 My guess is that SSE2 is slower on the Opteron because SSE2 is basically an Intel thing and only provided by AMD for compatibility reasons. There are several reports of SSE2 being slow on the Opteron and I guess the SSE2 integer operations were not focused for speed since MMX/SSE is all about floating point mainly. One thing I noticed on the Opteron is that if I put the code in mzd_combine vs. putting the same code directly in the function made huge difference. I blamed it on better cache prefetching support but that was probably preliminary. My proposal: - This evening I'll update my code with your 8 Gray tables and check the performance on the C2D - Then I'll re-introduce SSE2 and check whether it makes a worthy difference, if not we drop SSE2 from the multiplication. Martin PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel implementation of Strassen-Winograd yesterday and it gives - as is - a speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma feels like cheating, first we should aim for better speed with the same resources and then we switch to parallel implementations for even better times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix multiplication in 1.7 seconds on my notebook one week ago :-) -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---