Heres another idea which should speed things up a bit.

For 10000x10000 we currently use k = 6. Instead of this, we could use
k = 5 and make two Gray tables simultaneously. This will still fit in
cache.

Instead of doing 6 bits at a time, we can then do 10 bits at a time.
We'd load the appropriate line from the first Gray table, then the
appropriate one from the second and xor them, then xor with the output
matrix. This should decrease the number of loads and stores
considerably. Moreover, the SSE instructions will then be much more
efficient as the ratio of arithmetic instructions to loads and stores
is higher.

Of course one could also do 16 bits at a time, by doing 4 tables, but
I think this might actually get slower again since you've only
increased the amount of work done by 60%, but you've had a 30 %
increase in instructions.

Bill.

On 17 May, 17:45, Bill Hart <[EMAIL PROTECTED]> wrote:
> Martin,
>
> The test code still passes if you change RADIX to 128. I've no idea
> how it passes, but it does. Shame the results are not correct, because
> this speeds the code up by a factor of 2.
>
> I notice that in the SSE code, you check to see if alignment can be
> achieved, otherwise it doesn't use SSE. But this introduces an
> unpredictable branch. Also, where ther are three operands, you can't
> use SSE2 because the likelihood of all three being aligned is too
> small.
>
> I think a better idea would be to explicitly force all matrices and
> all rows to be 128 bit aligned if the matrices are wide enough to
> benefit from SSE2, Then the combine function can always use SSE2 and
> there will be no need to check for alignment.
>
> I experimented with interleaving MMX and GPR XOR's, but this doesn't
> speed anything up. There are more instructions emitted and the time
> stays about the same. The only way interleaving the MMX and GPR code
> would speed things up is if there was more computation going on in the
> registers and less memory loading and storing, I think.
>
> Bill.
>
> On 17 May, 15:45, Bill Hart <[EMAIL PROTECTED]> wrote:
>
> > Hi Martin,
>
> > Here is another 10% improvement. In the loop at the bottom of
> > mzd_combine you can explicitly unroll by a factor of 8:
>
> >     word * end = b1_ptr + wide;
> >     register word * end8 = end - 8;
> >     while (b1_ptr < end8)
> >     {
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >     }
> >     while (b1_ptr < end)
> >     {
> >          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
> >     }
>
> > I did this in combination with changing the crossover for 10000x10000
> > from 3600 to 7200.
>
> > Bill.
>
> > On 17 May, 09:40, Martin Albrecht <[EMAIL PROTECTED]>
> > wrote:
>
> > > On Saturday 17 May 2008, Bill Hart wrote:
>
> > > > In going from 5000x5000 to 10000x10000 Magma's time increases by a
> > > > factor of less than 4. That is impossible. Strassen will never help us
> > > > there. They must be doing something else. Probably something clever.
>
> > > > Bill.
>
> > >  I was stuck there too yesterday. Maybe only at 10000x10000 the pipeline 
> > > gets
> > > fully utilised?
>
> > > Martin
>
> > > PS: If we run out of idea we can simply go for parallelism, that should 
> > > help
> > > on sage.math ;-)
>
> > > --
> > > name: Martin Albrecht
> > > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> > > _www:http://www.informatik.uni-bremen.de/~malb
> > > _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to