You seemed to be getting up to 8% at points there. That's definitely worth it. I'll be interested to see this evening how it comes out, though I recommend optimising my combine3 function (which I suppose should now be combine8), even including it inline rather than have it in a separate file.
Of course on the Opteron, SSE should be switched off, since it is definitely slower by about 5%-10% even with careful optimisation. Bill. On 19 May, 14:23, Martin Albrecht <[EMAIL PROTECTED]> wrote: > Bill, > > I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is > worth the trouble (I agree that it make the otherwise pretty looking code > ugly). > > I have some timings (for an old implementation) here: > > http://trac.sagemath.org/sage_trac/ticket/3204#comment:2 > > My guess is that SSE2 is slower on the Opteron because SSE2 is basically an > Intel thing and only provided by AMD for compatibility reasons. There are > several reports of SSE2 being slow on the Opteron and I guess the SSE2 > integer operations were not focused for speed since MMX/SSE is all about > floating point mainly. > > One thing I noticed on the Opteron is that if I put the code in mzd_combine > vs. putting the same code directly in the function made huge difference. I > blamed it on better cache prefetching support but that was probably > preliminary. > > My proposal: > - This evening I'll update my code with your 8 Gray tables and check the > performance on the C2D > - Then I'll re-introduce SSE2 and check whether it makes a worthy difference, > if not we drop SSE2 from the multiplication. > > Martin > > PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel > implementation of Strassen-Winograd yesterday and it gives - as is - a > speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma > feels like cheating, first we should aim for better speed with the same > resources and then we switch to parallel implementations for even better > times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix > multiplication in 1.7 seconds on my notebook one week ago :-) > > -- > name: Martin Albrecht > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > _www:http://www.informatik.uni-bremen.de/~malb > _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---