在 2023/12/13 上午2:27, Xi Ruoyao 写道:
On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:

        fld.s   $f1,$r4,0
        fld.s   $f0,$r4,4
        fld.s   $f3,$r4,8
        fld.s   $f2,$r4,12
        fcmp.slt.s      $fcc1,$f0,$f3
        fcmp.sgt.s      $fcc0,$f1,$f2
        movcf2gr        $r13,$fcc1
        movcf2gr        $r12,$fcc0

There is also a problem that on 3A5000 MOVCF2GR requires 7 cycles,

MOVCF2FR+MOVFR2GR is a cycle. 3A6000 has no problem.

        or      $r12,$r12,$r13
        bnez    $r12,.L3
        fld.s   $f4,$r4,16
        fld.s   $f5,$r4,20
        or      $r4,$r0,$r0
        fcmp.sgt.s      $fcc1,$f1,$f5
        fcmp.slt.s      $fcc0,$f0,$f4
        movcf2gr        $r12,$fcc1
        movcf2gr        $r13,$fcc0
        or      $r12,$r12,$r13
        bnez    $r12,.L2
        fcmp.sgt.s      $fcc1,$f3,$f5
        fcmp.slt.s      $fcc0,$f2,$f4
        movcf2gr        $r4,$fcc1
        movcf2gr        $r12,$fcc0
        or      $r4,$r4,$r12
        xori    $r4,$r4,1
        slli.w  $r4,$r4,0
        jr      $r1
        .align  4
.L3:
        or      $r4,$r0,$r0
        .align  4
.L2:
        jr      $r1

Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).

Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples).  We may be able to handle via
the ext_dce pass [1] in the future.

[1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html


Reply via email to