On Sat, May 12, 2012 at 11:36 AM, Maciej W. Rozycki <ma...@linux-mips.org> wrote: > On Sun, 6 May 2012, Andrew Pinski wrote: > >> > For the record: MIPS processors that implement CLZ/CLO (for some reason >> > CTZ/CTO haven't been added to the architecture, but these operations can >> > be cheaply transformed into CLZ/CLO) generally have a dedicated unit that >> > causes no pipeline stall for these instructions even in the simplest >> > pipeline designs like the M4K -- IOW they are issued at the usual one >> > instruction per pipeline clock rate. >> >> Even on Octeon this is true. Though Octeon has seq/sneq too so it >> does not matter in the end. > > Does Octeon's pipeline qualify as simple? For some reason I've thought > it is a high-performance core. The M4K is one of the smallest/simplest > MIPS chips ever built.
Yes the octeon's pipeline qualifies as simple. It is still an in-order pipeline with few stages. The high-performance of the core is just the clock rate rather than the pipeline. And the number of cores on one chip is the other thing which makes it high performance. > > And actually all MIPS processors (back to 1985's MIPS I ISA) support > two-instruction set-if-equal and set-if-not-equal sequences: > > xor rd, rt, rs > sltiu rd, rd, 1 > > and: > > xor rd, rt, rs > sltu rd, zero, rd > > respectively, that may still be more beneficial than any possible > alternatives, especially ones involving branches. > >> Note I originally was the one who proposed this optimization for >> PowerPC even before I saw what XLC did. See PR 10588 (which I filed 9 >> years ago) and it seems we are about to fix it soon. > > For that -- set-if-zero and set-if-non-zero -- you can use the > instructions as above (that are supported by all MIPS processors): > > sltiu rd, rs, 1 > > and > > sltu rd, zero, rs > > However GCC doesn't seem smart enough to use them well with your example. > I'd expect something like: > > sltiu $4, $4, 1 > sltiu $2, $5, 1 > jr $31 > or $2, $4, $2 > > however I get: > > beq $4, $0, .L3 > nop > jr $31 > sltiu $2, $5, 1 > .L3: > jr $31 > li $2, 1 > > which is never faster and obviously not smaller either. And there is > really no need to avoid the second comparison as per logical OR rules here > -- it's all in registers. I have a few patches already in my queue to submit upstream to improve the above case for MIPS. > > This pessimisation is avoided for MIPS IV and more recent processors that > have move-if-non-zero however (and the second comparison is always > evaluated): > > sltiu $5, $5, 1 > li $2, 1 > jr $31 > movn $2, $5, $4 > > Any chance to get it better with the fix you've mentioned? The above is worse than using the or for at least the octeon as movn is 3 cycles while or is only 1 cycle. As I mentioned, I have a few patches already in my queue which improves the code for MIPS (and other targets too) but I have not got around to submitting them upstream because I have been busy working on more patches. Thanks, Andrew Pinski