https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116142

            Bug ID: 116142
           Summary: vec_widen_smult_{odd,even}_M ineffective for a simple
                    widening dot product (vect_used_by_reduction is not
                    set?)
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: xry111 at gcc dot gnu.org
  Target Milestone: ---

On LoongArch, for this test case:

short x[8], y[8];

int dot() {
        int ret = 0;
        for (int i = 0; i < 8; i++)
                ret += x[i] * y[i];
        return ret;
}

The compiler produces:

        vld     $vr1,$r12,0
        vld     $vr0,$r12,32
        vslti.h $vr2,$vr1,0
        vslti.h $vr3,$vr0,0
        vilvh.h $vr4,$vr2,$vr1
        vilvh.h $vr5,$vr3,$vr0
        vilvl.h $vr1,$vr2,$vr1
        vilvl.h $vr0,$vr3,$vr0
        vmul.w  $vr0,$vr0,$vr1
        vmadd.w $vr0,$vr5,$vr4
        vhaddw.d.w      $vr0,$vr0,$vr0
        vhaddw.q.d      $vr0,$vr0,$vr0
        vpickve2gr.w    $r4,$vr0,0
        slli.w  $r4,$r4,0
        jr      $r1

This is stupid and we just want:

        vld     $vr1,$r12,0
        vld     $vr0,$r12,32
        vmulwev.w.h $vr2, $vr1, $vr0
        vmulwod.w.h $vr3, $vr1, $vr0
        vadd.w  $vr2, $vr2, $vr3
        vhaddw.d.w      $vr2,$vr2,$vr2
        vhaddw.q.d      $vr2,$vr2,$vr2
        vpickve2gr.w    $r4,$vr2,0
        jr      $r1

After reading GCC internal I found we missed vec_widen_smult_even_v8hi and
vec_widen_smult_odd_v8hi.  So I added them:

+(define_expand "vec_widen_smult_even_v8hi"
+  [(match_operand:V4SI 0 "register_operand" "=f")
+   (match_operand:V8HI 1 "register_operand" " f")
+   (match_operand:V8HI 2 "register_operand" " f")]
+  "ISA_HAS_LSX"
+{
+  emit_insn (gen_lsx_vmulwev_w_h (operands[0], operands[1], operands[2]));
+  DONE;
+})
+
+(define_expand "vec_widen_smult_odd_v8hi"
+  [(match_operand:V4SI 0 "register_operand" "=f")
+   (match_operand:V8HI 1 "register_operand" " f")
+   (match_operand:V8HI 2 "register_operand" " f")]
+  "ISA_HAS_LSX"
+{
+  emit_insn (gen_lsx_vmulwod_w_h (operands[0], operands[1], operands[2]));
+  DONE;
+})
+

But they are not used at all, despite some comment in tree-vect-stmts.cc
suggests this approach should work:

         However, in the special case that the result of the widening
         operation is used in a reduction computation only, the order doesn't
         matter (because when vectorizing a reduction we change the order of
         the computation).  Some targets can take advantage of this and
         generate more efficient code.  For example, targets like Altivec,
         that support widen_mult using a sequence of {mult_even,mult_odd}
         generate the following vectors:
                vect1: [res1,res3,res5,res7],
                vect2: [res2,res4,res6,res8].

... ...

      if (vect_loop
          && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
          && !nested_in_vect_loop_p (vect_loop, stmt_info)
          && supportable_widening_operation (vinfo, VEC_WIDEN_MULT_EVEN_EXPR,
                                             stmt_info, vectype_out,
                                             vectype_in, code1,
                                             code2, multi_step_cvt,
                                             interm_types))
        {
          /* Elements in a vector with vect_used_by_reduction property cannot
             be reordered if the use chain with this property does not have the
             same operation.  One such an example is s += a * b, where elements
             in a and b cannot be reordered.  Here we check if the vector
defined
             by STMT is only directly used in the reduction statement.  */
          tree lhs = gimple_assign_lhs (stmt_info->stmt);
          stmt_vec_info use_stmt_info = loop_info->lookup_single_use (lhs);
          if (use_stmt_info
              && STMT_VINFO_DEF_TYPE (use_stmt_info) == vect_reduction_def)
            return true;
        }
      c1 = VEC_WIDEN_MULT_LO_EXPR;
      c2 = VEC_WIDEN_MULT_HI_EXPR;
      break;

When I use gdb to debug cc1 I found STMT_VINFO_RELEVANT (stmt_info) is never
vect_used_by_reduction for my test case, but (obviously) the result of the
multiplication is only used by a reduction.  So IMO there's a bug in tree
optimization.  Or am I missing something?

Reply via email to