> > However, your other solutions are better.
> >
> >
> > > > >
> > > > > mask = (FM & 1);
> > > > > mask |= (FM << 3) & 0x10;
> > > > > mask |= (FM << 6) & 0x100;
> > > > > mask |= (FM << 9) & 0x1000;
> > > > > mask |= (FM << 12) & 0x10000;
> > > > > mask |= (FM << 15) & 0x100000;
> > > > > mask |= (FM << 18) & 0x1000000;
> > > > > mask |= (FM << 21) & 0x10000000;
> > > > > mask *= 15;
> > > > >
> > > > > should do the job, in less code space and without a single branch.
...
> > > > > Another way of optomizing this could be:
> > > > >
> > > > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > > mask *= 15;
...
> Ok, if you have measured that method1 is faster than method2, let us go for 
> it.
> I believe method2 would be faster if you had a large out-of-order execution
> window, because more parallelism can be extracted from it, but this is 
> probably
> only true for high end cores, which do not need FPU emulation in the first 
> place.

FWIW the second has a long dependency chain on 'mask', whereas the first can 
execute
the shift/and in any order and then merge the results.
So on most superscalar cpu, or one with result delays for arithmetic, the first
is likely to be faster.

        David



_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Reply via email to