I had a look at gf2x. The reason they can make use of sse is because of the shl and shr capabilities. Doing that 128 bits at a time is more efficient since one operation can be done instead of quite a few using general purpose regs.
Bill. On 19 May, 13:52, Bill Hart <[EMAIL PROTECTED]> wrote: > Martin, > > does your machine speed up at all with any of the SSE code? I spent > considerable time yesterday trying to optimise the combine3 function I > wrote (note: I didn't submit this improved code). Whilst I did speed > it up considerably by removing %'s and /'s and changing the function > prototype to accept rows directly and lots of other minor > improvements, the overall result was still much slower than the > generic C code. > > If the SSE code doesn't ever speed it up (and it might not be faster > to use SSE when there is at least one load/store per arithmetic > operation), then it might be worth throwing all the SSE code out > completely, since it just polutes what is otherwise very nice looking > code. > > I did check by compiling the code to assembler without assembling it > (the -S option in gcc), that it is actually using SSE instructions and > that it is doing this in what I would consider an optimal way. So it > really is SSE itself that is slow, not just inefficiencies introduced > by the compiler. > > One possible avenue for exploration is the gf2x library by Zimmerman, > Brent, et al. They use SSE in their code for polynomials over GF2, > apparently to good effect. We could look to see if there is some trick > to using it efficiently. The difference there however is probably that > the polynomials get quite long and multiple arithmetic operations need > to be done for each store/load. I don't know if any of their tricks > can help us for that reason. > > Bill. > > On 19 May, 02:00, Martin Albrecht <[EMAIL PROTECTED]> > wrote: > > > On Monday 19 May 2008, Bill Hart wrote: > > > > Well, apparently there are speed gains right up to 8 Gray tables, > > > though I really have no idea why that is. > > > > Here are the new times: > > > > Magma Old New > > > > 10000x10000: > > > 2.940s 3.13s 2.32s > > > > 16384x16384: > > > 9.250s 12.96s 9.17s > > > > 20000x20000: > > > 16.57s 22.43s 16.49s > > > > 32000x32000: > > > 59.1s 90.2s 81.6s 65.7s > > > > So now we beat Magma all the way up to 20000x20000. > > > > I'll put the adjusted code in the directory: > > > >http://sage.math.washington.edu/home/wbhart/m4ri3gray/ > > > > in just a moment. I also found a memory leak in my code which I fixed. > > > > Bill. > > > Cool, I'll take a look tomorrow. It seems we're closing in on the classical > > algorithm though. Although, you don't seem to decrease k anymore. > > > As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too > > many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if > > that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned > > on cache lines. I'm widely speculating here. > > > Martin > > > -- > > name: Martin Albrecht > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 > > _www:http://www.informatik.uni-bremen.de/~malb > > _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---