Hi, For this case: 1. last_combine1 the general pattern(plus (mult a b) c) can’t be combined. 2. last_bombine2 vec_duplicated is expanded to broadcast which can’t be handled by last_combine.
For RVV, I think last_combine2 has no chance to combine anything because of fully expansion. Do I understand correctly? After Robin’s review, a more general method needed, the policy is to prevent vec_duplicate from expanded to broadcast. So more testcases are involved regardless of cmp. I’m checking and fixing the failed testcases now. Regards, Demin From: 钟居哲 <juzhe.zh...@rivai.ai> Sent: 2024年7月25日 6:24 To: Artemiy Volkov <artemiy.vol...@synopsys.com>; Demin Han <demin....@starfivetech.com>; Jeff Law <jeffreya...@gmail.com> Cc: gcc <gcc@gcc.gnu.org>; rdapp.gcc <rdapp....@gmail.com> Subject: Re: [RISC-V] Combining vfmv and .vv instructions into a .vf instruction I think Demin is working on it. And Robin is reviewer of this stuff. ________________________________ juzhe.zh...@rivai.ai<mailto:juzhe.zh...@rivai.ai> From: Artemiy Volkov<mailto:artemiy.vol...@synopsys.com> Date: 2024-07-25 01:25 To: juzhe.zh...@rivai.ai<mailto:juzhe.zh...@rivai.ai>; demin....@starfivetech.com<mailto:demin....@starfivetech.com>; jeffreya...@gmail.com<mailto:jeffreya...@gmail.com> CC: gcc@gcc.gnu.org<mailto:gcc@gcc.gnu.org> Subject: [RISC-V] Combining vfmv and .vv instructions into a .vf instruction Hi Juzhe, Demin, Jeff, This email is intended to continue the discussion started in https://marc.info/?l=gcc-patches&m=170927452922009&w=2 about combining vfmv.v.f and vfmxx.vv instructions into the scalar-vector form vfmxx.vf. There was a mention on that thread of the potential usefulness of the late-combine pass (added last month in https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792f97b44ffc5e6a967292b3747fd835e99396e7) in making this transformation. However, when I tried it out with my testcase at https://godbolt.org/z/o8oPzo7qY, I found it unable to handle these complex post-split1 patterns for broadcast and vfmacc: (insn 129 128 130 3 (set (reg:RVVM4SF 168 [ _61 ]) (if_then_else:RVVM4SF (unspec:RVVMF8BI [ (const_vector:RVVMF8BI [ (const_int 1 [0x1]) repeated x16 ]) (const_int 16 [0x10]) (const_int 2 [0x2]) repeated x2 (const_int 0 [0]) (reg:SI 66 vl) (reg:SI 67 vtype) ] UNSPEC_VPREDICATE) (vec_duplicate:RVVM4SF (mem:SF (reg:SI 143 [ ivtmp.21 ]) [1 MEM[(float *)_145]+0 S4 A32])) (unspec:RVVM4SF [ (reg:SI 0 zero) ] UNSPEC_VUNDEF))) "/app/example.c":19:53 4019 {*pred_broadcastrvvm4sf_zvfh} (nil)) [ ... ] (insn 131 130 34 3 (set (reg:RVVM4SF 139 [ D__lsm.10 ]) (if_then_else:RVVM4SF (unspec:RVVMF8BI [ (const_vector:RVVMF8BI [ (const_int 1 [0x1]) repeated x16 ]) (const_int 16 [0x10]) (const_int 2 [0x2]) repeated x2 (const_int 0 [0]) (const_int 7 [0x7]) (reg:SI 66 vl) (reg:SI 67 vtype) (reg:SI 69 frm) ] UNSPEC_VPREDICATE) (plus:RVVM4SF (mult:RVVM4SF (reg/v:RVVM4SF 135 [ row ]) (reg:RVVM4SF 168 [ _61 ])) (reg:RVVM4SF 139 [ D__lsm.10 ])) (unspec:RVVM4SF [ (reg:SI 0 zero) ] UNSPEC_VUNDEF))) "/app/example.c":19:36 15007 {*pred_mul_addrvvm4sf_undef} (nil)) I'm no expert on this, but what's stopping us from adding some vector-scalar split patterns alongside vector-vector ones in autovec.md to fix this? For instance, the addition of fma<mode>4_scalar insn_and_split like this: diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md index d5793ac..bf54d71 100644 --- a/gcc/config/riscv/autovec.md +++ b/gcc/config/riscv/autovec.md @@ -1229,2 +1229,22 @@ +(define_insn_and_split "fma<mode>4_scalar" + [(set (match_operand:V_VLSF 0 "register_operand") + (plus:V_VLSF + (mult:V_VLSF + (vec_duplicate:V_VLSF (match_operand:SF 1 "direct_broadcast_operand")) + (match_operand:V_VLSF 2 "register_operand")) + (match_operand:V_VLSF 3 "register_operand")))] + "TARGET_VECTOR && can_create_pseudo_p ()" + "#" + "&& 1" + [(const_int 0)] + { + rtx ops[] = {operands[0], operands[1], operands[2], operands[3], + operands[0]}; + riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (PLUS, <MODE>mode), + riscv_vector::TERNARY_OP_FRM_DYN, ops); + DONE; + } + [(set_attr "type" "vector")]) + ;; ------------------------------------------------------------------------- does lead to vfmacc.vf instructions being emitted instead of vfmacc.vv's for the testcase linked above. What do you think about this approach to implement this optimization? Am I missing anything important? Maybe split1 is too early to determine the final instruction format (.vf vs .vv) and we should strive to recombine during late-combine2? Also, is there anyone working on this optimization at the present moment? Many thanks in advance, Artemiy