Hi Juzhe, Demin, Jeff, This email is intended to continue the discussion started in https://marc.info/?l=gcc-patches&m=170927452922009&w=2 about combining vfmv.v.f and vfmxx.vv instructions into the scalar-vector form vfmxx.vf.
There was a mention on that thread of the potential usefulness of the late-combine pass (added last month in https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792f97b44ffc5e6a967292b3747fd835e99396e7) in making this transformation. However, when I tried it out with my testcase at https://godbolt.org/z/o8oPzo7qY, I found it unable to handle these complex post-split1 patterns for broadcast and vfmacc: (insn 129 128 130 3 (set (reg:RVVM4SF 168 [ _61 ]) (if_then_else:RVVM4SF (unspec:RVVMF8BI [ (const_vector:RVVMF8BI [ (const_int 1 [0x1]) repeated x16 ]) (const_int 16 [0x10]) (const_int 2 [0x2]) repeated x2 (const_int 0 [0]) (reg:SI 66 vl) (reg:SI 67 vtype) ] UNSPEC_VPREDICATE) (vec_duplicate:RVVM4SF (mem:SF (reg:SI 143 [ ivtmp.21 ]) [1 MEM[(float *)_145]+0 S4 A32])) (unspec:RVVM4SF [ (reg:SI 0 zero) ] UNSPEC_VUNDEF))) "/app/example.c":19:53 4019 {*pred_broadcastrvvm4sf_zvfh} (nil)) [ ... ] (insn 131 130 34 3 (set (reg:RVVM4SF 139 [ D__lsm.10 ]) (if_then_else:RVVM4SF (unspec:RVVMF8BI [ (const_vector:RVVMF8BI [ (const_int 1 [0x1]) repeated x16 ]) (const_int 16 [0x10]) (const_int 2 [0x2]) repeated x2 (const_int 0 [0]) (const_int 7 [0x7]) (reg:SI 66 vl) (reg:SI 67 vtype) (reg:SI 69 frm) ] UNSPEC_VPREDICATE) (plus:RVVM4SF (mult:RVVM4SF (reg/v:RVVM4SF 135 [ row ]) (reg:RVVM4SF 168 [ _61 ])) (reg:RVVM4SF 139 [ D__lsm.10 ])) (unspec:RVVM4SF [ (reg:SI 0 zero) ] UNSPEC_VUNDEF))) "/app/example.c":19:36 15007 {*pred_mul_addrvvm4sf_undef} (nil)) I'm no expert on this, but what's stopping us from adding some vector-scalar split patterns alongside vector-vector ones in autovec.md to fix this? For instance, the addition of fma<mode>4_scalar insn_and_split like this: diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md index d5793ac..bf54d71 100644 --- a/gcc/config/riscv/autovec.md +++ b/gcc/config/riscv/autovec.md @@ -1229,2 +1229,22 @@ +(define_insn_and_split "fma<mode>4_scalar" + [(set (match_operand:V_VLSF 0 "register_operand") + (plus:V_VLSF + (mult:V_VLSF + (vec_duplicate:V_VLSF (match_operand:SF 1 "direct_broadcast_operand")) + (match_operand:V_VLSF 2 "register_operand")) + (match_operand:V_VLSF 3 "register_operand")))] + "TARGET_VECTOR && can_create_pseudo_p ()" + "#" + "&& 1" + [(const_int 0)] + { + rtx ops[] = {operands[0], operands[1], operands[2], operands[3], + operands[0]}; + riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (PLUS, <MODE>mode), + riscv_vector::TERNARY_OP_FRM_DYN, ops); + DONE; + } + [(set_attr "type" "vector")]) + ;; ------------------------------------------------------------------------- does lead to vfmacc.vf instructions being emitted instead of vfmacc.vv's for the testcase linked above. What do you think about this approach to implement this optimization? Am I missing anything important? Maybe split1 is too early to determine the final instruction format (.vf vs .vv) and we should strive to recombine during late-combine2? Also, is there anyone working on this optimization at the present moment? Many thanks in advance, Artemiy