https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751

--- Comment #23 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Richard and Richi.

I found a way to simulate "undefine" in COND_LEN_xxx pattern for the ELSE value
that doesn't matter.

First, return size type 0 in else_value target hook:

/* Use size_type 0 which is represented as const0_rtx in RTL to simulate
   undefine else value since GCC doesn't undefine value in TREE/GIMPLE
   representation.

   TODO: We may will need to support undefine value in TREE/GIMPLE middle-end
   IR. But current approach is good enough for RVV codegen/performance.  */
static tree
riscv_preferred_else_value (unsigned ifn, tree vectype, unsigned int nops,
                            tree *ops)
{
  if (riscv_v_ext_mode_p (TYPE_MODE (vectype)))
    return build_zero_cst (size_type_node);

  return default_preferred_else_value (ifn, vectype, nops, ops);
}

Note that we can't return VECTOR_CST with all 0. 
Since a VECTROR_CST with all 0 may matter and the real value we need.

So, to simulate "undefine", I pass a '0' which will be represented as
const0_rtx in RTX.

So the IR will be:

vect__7.12_8 = .COND_LEN_DIV ({ -1, ... }, vect__4.8_22, vect__6.11_9, 0
(undefine ELSE value), _37, 0);

Then I relax the predicate in COND_LEN_xxx pattern. It works and pass all
the tests.

Consider this following case:

void
foo (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
    a[i] = a[i] / b[i];
}

Before:
foo:
        ble     a2,zero,.L5
        mv      a4,a0
        vsetvli a5,zero,e32,m8,ta,ma
        vmv.v.i v4,0
.L3:
        vsetvli a5,a2,e32,m8,tu,ma
        vmv8r.v v1,v4
        slli    a3,a5,2
        vle32.v v3,0(a0)
        vle32.v v2,0(a1)
        sub     a2,a2,a5
        vdiv.vv v1,v3,v2
        vse32.v v1,0(a4)
        add     a0,a0,a3
        add     a1,a1,a3
        add     a4,a4,a3
        bne     a2,zero,.L3
.L5:
        ret

After:

foo:
        ble     a2,zero,.L5
        mv      a4,a0
.L3:
        vsetvli a5,a2,e32,m8,ta,ma
        slli    a3,a5,2
        vle32.v v8,0(a0)
        vle32.v v16,0(a1)
        sub     a2,a2,a5
        vdiv.vv v8,v8,v16
        vse32.v v8,0(a4)
        add     a0,a0,a3
        add     a1,a1,a3
        add     a4,a4,a3
        bne     a2,zero,.L3
.L5:
        ret


Not so elegant. But it does fix the performance/codegen issue in RVV.

Reply via email to