On Sat, May 12, 2012 at 11:36 AM, Maciej W. Rozycki
<ma...@linux-mips.org> wrote:
> On Sun, 6 May 2012, Andrew Pinski wrote:
>
>> >  For the record: MIPS processors that implement CLZ/CLO (for some reason
>> > CTZ/CTO haven't been added to the architecture, but these operations can
>> > be cheaply transformed into CLZ/CLO) generally have a dedicated unit that
>> > causes no pipeline stall for these instructions even in the simplest
>> > pipeline designs like the M4K -- IOW they are issued at the usual one
>> > instruction per pipeline clock rate.
>>
>> Even on Octeon this is true.  Though Octeon has seq/sneq too so it
>> does not matter in the end.
>
>  Does Octeon's pipeline qualify as simple?  For some reason I've thought
> it is a high-performance core.  The M4K is one of the smallest/simplest
> MIPS chips ever built.

Yes the octeon's pipeline qualifies as simple.  It is still an
in-order pipeline with few stages.  The high-performance of the core
is just the clock rate rather than the pipeline.  And the number of
cores on one chip is the other thing which makes it high performance.

>
>  And actually all MIPS processors (back to 1985's MIPS I ISA) support
> two-instruction set-if-equal and set-if-not-equal sequences:
>
>        xor     rd, rt, rs
>        sltiu   rd, rd, 1
>
> and:
>
>        xor     rd, rt, rs
>        sltu    rd, zero, rd
>
> respectively, that may still be more beneficial than any possible
> alternatives, especially ones involving branches.
>
>> Note I originally was the one who proposed this optimization for
>> PowerPC even before I saw what XLC did.  See PR 10588 (which I filed 9
>> years ago)  and it seems we are about to fix it soon.
>
>  For that -- set-if-zero and set-if-non-zero -- you can use the
> instructions as above (that are supported by all MIPS processors):
>
>        sltiu   rd, rs, 1
>
> and
>
>        sltu    rd, zero, rs
>
> However GCC doesn't seem smart enough to use them well with your example.
> I'd expect something like:
>
>        sltiu   $4, $4, 1
>        sltiu   $2, $5, 1
>        jr      $31
>         or     $2, $4, $2
>
> however I get:
>
>        beq     $4, $0, .L3
>         nop
>        jr      $31
>         sltiu  $2, $5, 1
> .L3:
>        jr      $31
>         li     $2, 1
>
> which is never faster and obviously not smaller either.  And there is
> really no need to avoid the second comparison as per logical OR rules here
> -- it's all in registers.


I have a few patches already in my queue to submit upstream to improve
the above case for MIPS.


>
>  This pessimisation is avoided for MIPS IV and more recent processors that
> have move-if-non-zero however (and the second comparison is always
> evaluated):
>
>        sltiu   $5, $5, 1
>        li      $2, 1
>        jr      $31
>         movn   $2, $5, $4
>
> Any chance to get it better with the fix you've mentioned?

The above is worse than using the or for at least the octeon as movn
is 3 cycles while or is only 1 cycle.  As I mentioned, I have a few
patches already in my queue which improves the code for MIPS (and
other targets too) but I have not got around to submitting them
upstream because I have been busy working on more patches.

Thanks,
Andrew Pinski

Reply via email to