在 2023/12/13 上午2:27, Xi Ruoyao 写道:
On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote: fld.s $f1,$r4,0 fld.s $f0,$r4,4 fld.s $f3,$r4,8 fld.s $f2,$r4,12 fcmp.slt.s $fcc1,$f0,$f3 fcmp.sgt.s $fcc0,$f1,$f2 movcf2gr $r13,$fcc1 movcf2gr $r12,$fcc0
There is also a problem that on 3A5000 MOVCF2GR requires 7 cycles, MOVCF2FR+MOVFR2GR is a cycle. 3A6000 has no problem.
or $r12,$r12,$r13 bnez $r12,.L3 fld.s $f4,$r4,16 fld.s $f5,$r4,20 or $r4,$r0,$r0 fcmp.sgt.s $fcc1,$f1,$f5 fcmp.slt.s $fcc0,$f0,$f4 movcf2gr $r12,$fcc1 movcf2gr $r13,$fcc0 or $r12,$r12,$r13 bnez $r12,.L2 fcmp.sgt.s $fcc1,$f3,$f5 fcmp.slt.s $fcc0,$f2,$f4 movcf2gr $r4,$fcc1 movcf2gr $r12,$fcc0 or $r4,$r4,$r12 xori $r4,$r4,1 slli.w $r4,$r4,0 jr $r1 .align 4 .L3: or $r4,$r0,$r0 .align 4 .L2: jr $r1 Per my micro-benchmark this is much faster than LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e. when the branches are not predictable). Note that there is a redundant slli.w instruction in the compiled code and I couldn't find a way to remove it (my trick in the TARGET_64BIT branch only works for simple examples). We may be able to handle via the ext_dce pass [1] in the future. [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html