I see, let the vec_dup enter the rtx_cost again to append the total to vmv, I
have a try testing. For example with below change:
+ switch (rcode)
+ {
+ case VEC_DUPLICATE:
+ *total += get_vector_costs ()->regmove->GR2VR * COSTS_N_INSNS
(1);
+ break;
+ case PLUS:
+ {
+ rtx op_0 = XEXP (x, 0); + rtx op_1 = XEXP (x, 1);
+ if (GET_CODE (op_0) == VEC_DUPLICATE
+ || GET_CODE (op_1) == VEC_DUPLICATE)
+ *total += get_vector_costs ()->regmove->GR2VR * COSTS_N_INSNS
(1);
+ else
+ *total = COSTS_N_INSNS (1);
+ break;
+ }
+ default:
+ *total = COSTS_N_INSNS (1);
+ break;
+ }
+
+ return true;
For case_0, GR2VR is 0, we will have late-combine as blow:
51 │ trying to combine definition of r135 in:
52 │ 11: r135:RVVM1SI=vec_duplicate(r150:DI#0)
53 │ into:
54 │ 18: r147:RVVM1SI=r146:RVVM1SI+r135:RVVM1SI
55 │ REG_DEAD r146:RVVM1SI
56 │ successfully matched this instruction to *add_vx_rvvm1si:
57 │ (set (reg:RVVM1SI 147 [ vect__6.8_16 ])
58 │ (plus:RVVM1SI (vec_duplicate:RVVM1SI (subreg/s/u:SI (reg:DI 150
[ x ]) 0))
59 │ (reg:RVVM1SI 146)))
60 │ original cost = 8 + 4 (weighted: 39.483637), replacement cost = 8
(weighted: 64.727273); rejecting replacement
The vadd v, vec_dup(x) seems has the same cost as vec_dup here. I am also
confused about the how we calculate the
vadd v, vec_dup(x), can we just set its' cost to vadd.vx? given we have
define_insn_and_split to match the pattern and
emit the vadd.vx directly. And it matches the expr we mentioned vadd.vv + vec
== vadd.vx.
Please help to correct me if misunderstanding.
Yes, that doesn't look quite correct yet.
I think the issue is that using
*total += get_vector_costs ()->regmove->GR2VR * COSTS_N_INSNS (1);
as I suggested initializes total with an estimate of the mode size (total = 8
for me) before we get to riscv_rtx_cost. This makes the rest of the
costs (which we assume to be relative to 4) inaccurate.
So try
*total = get_vector_costs ()->regmove->GR2VR * COSTS_N_INSNS (1);
for the vec_dup case and
*total = COST_N_INSNS (1) + get_vector_costs ()->regmove->GR2VR *
COSTS_N_INSNS (1);
for the vx case.
Then we should perform the combination for GR2VR == 0 and not for GR2VR > 0.