https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116142
Bug ID: 116142 Summary: vec_widen_smult_{odd,even}_M ineffective for a simple widening dot product (vect_used_by_reduction is not set?) Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: xry111 at gcc dot gnu.org Target Milestone: --- On LoongArch, for this test case: short x[8], y[8]; int dot() { int ret = 0; for (int i = 0; i < 8; i++) ret += x[i] * y[i]; return ret; } The compiler produces: vld $vr1,$r12,0 vld $vr0,$r12,32 vslti.h $vr2,$vr1,0 vslti.h $vr3,$vr0,0 vilvh.h $vr4,$vr2,$vr1 vilvh.h $vr5,$vr3,$vr0 vilvl.h $vr1,$vr2,$vr1 vilvl.h $vr0,$vr3,$vr0 vmul.w $vr0,$vr0,$vr1 vmadd.w $vr0,$vr5,$vr4 vhaddw.d.w $vr0,$vr0,$vr0 vhaddw.q.d $vr0,$vr0,$vr0 vpickve2gr.w $r4,$vr0,0 slli.w $r4,$r4,0 jr $r1 This is stupid and we just want: vld $vr1,$r12,0 vld $vr0,$r12,32 vmulwev.w.h $vr2, $vr1, $vr0 vmulwod.w.h $vr3, $vr1, $vr0 vadd.w $vr2, $vr2, $vr3 vhaddw.d.w $vr2,$vr2,$vr2 vhaddw.q.d $vr2,$vr2,$vr2 vpickve2gr.w $r4,$vr2,0 jr $r1 After reading GCC internal I found we missed vec_widen_smult_even_v8hi and vec_widen_smult_odd_v8hi. So I added them: +(define_expand "vec_widen_smult_even_v8hi" + [(match_operand:V4SI 0 "register_operand" "=f") + (match_operand:V8HI 1 "register_operand" " f") + (match_operand:V8HI 2 "register_operand" " f")] + "ISA_HAS_LSX" +{ + emit_insn (gen_lsx_vmulwev_w_h (operands[0], operands[1], operands[2])); + DONE; +}) + +(define_expand "vec_widen_smult_odd_v8hi" + [(match_operand:V4SI 0 "register_operand" "=f") + (match_operand:V8HI 1 "register_operand" " f") + (match_operand:V8HI 2 "register_operand" " f")] + "ISA_HAS_LSX" +{ + emit_insn (gen_lsx_vmulwod_w_h (operands[0], operands[1], operands[2])); + DONE; +}) + But they are not used at all, despite some comment in tree-vect-stmts.cc suggests this approach should work: However, in the special case that the result of the widening operation is used in a reduction computation only, the order doesn't matter (because when vectorizing a reduction we change the order of the computation). Some targets can take advantage of this and generate more efficient code. For example, targets like Altivec, that support widen_mult using a sequence of {mult_even,mult_odd} generate the following vectors: vect1: [res1,res3,res5,res7], vect2: [res2,res4,res6,res8]. ... ... if (vect_loop && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction && !nested_in_vect_loop_p (vect_loop, stmt_info) && supportable_widening_operation (vinfo, VEC_WIDEN_MULT_EVEN_EXPR, stmt_info, vectype_out, vectype_in, code1, code2, multi_step_cvt, interm_types)) { /* Elements in a vector with vect_used_by_reduction property cannot be reordered if the use chain with this property does not have the same operation. One such an example is s += a * b, where elements in a and b cannot be reordered. Here we check if the vector defined by STMT is only directly used in the reduction statement. */ tree lhs = gimple_assign_lhs (stmt_info->stmt); stmt_vec_info use_stmt_info = loop_info->lookup_single_use (lhs); if (use_stmt_info && STMT_VINFO_DEF_TYPE (use_stmt_info) == vect_reduction_def) return true; } c1 = VEC_WIDEN_MULT_LO_EXPR; c2 = VEC_WIDEN_MULT_HI_EXPR; break; When I use gdb to debug cc1 I found STMT_VINFO_RELEVANT (stmt_info) is never vect_used_by_reduction for my test case, but (obviously) the result of the multiplication is only used by a reduction. So IMO there's a bug in tree optimization. Or am I missing something?