You seemed to be getting up to 8% at points there. That's definitely
worth it. I'll be interested to see this evening how it comes out,
though I recommend optimising my combine3 function (which I suppose
should now be combine8), even including it inline rather than have it
in a separate file.

Of course on the Opteron, SSE should be switched off, since it is
definitely slower by about 5%-10% even with careful optimisation.

Bill.

On 19 May, 14:23, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> Bill,
>
> I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is
> worth the trouble (I agree that it make the otherwise pretty looking code
> ugly).
>
> I have some timings (for an old implementation) here:
>
>    http://trac.sagemath.org/sage_trac/ticket/3204#comment:2
>
> My guess is that SSE2 is slower on the Opteron because SSE2 is basically an
> Intel thing and only provided by AMD for compatibility reasons. There are
> several reports of SSE2 being slow on the Opteron and I guess the SSE2
> integer operations were not focused for speed since MMX/SSE is all about
> floating point mainly.
>
> One thing I noticed on the Opteron is that if I put the code in mzd_combine
> vs. putting the same code directly in the function made huge difference. I
> blamed it on better cache prefetching support but that was probably
> preliminary.
>
> My proposal:
>  - This evening I'll update my code with your 8 Gray tables and check the
> performance on the C2D
>  - Then I'll re-introduce SSE2 and check whether it makes a worthy difference,
> if not we drop SSE2 from the multiplication.
>
> Martin
>
> PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel
> implementation of Strassen-Winograd yesterday and it gives - as is - a
> speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma
> feels like cheating, first we should aim for better speed with the same
> resources and then we switch to parallel implementations for even better
> times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix
> multiplication in 1.7 seconds on my notebook one week ago :-)
>
> --
> name: Martin Albrecht
> _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> _www:http://www.informatik.uni-bremen.de/~malb
> _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to