https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121451
Bug ID: 121451 Summary: RISC-V: zero-stride load broadcast vs. vector-scalar Product: gcc Version: 15.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: parras at gcc dot gnu.org Target Milestone: --- https://godbolt.org/z/brW6sG7KM Reduced from 538.imagick topblock #0 (11 insns, 36.75%) We get the following assembly: fld fa5,0(a1) vfmv.v.f v3,fa5 vfmacc.vv v1,v3,v2 But since PR119100 we should get: fld fa5,0(a4) vfmacc.vf v1,fa5,v2 What is preventing the combination here is the vec_duplicate operand being a mem: (set (reg:RVVM1DF 157 [ _20 ]) (vec_duplicate:RVVM1DF (mem:DF (reg/v/f:DI 153 [ g ]) [1 *g_16(D)+0 S8 A64]))) OTOH this seems to be candidate for a zero-stride load broadcast: vlse64.v v3,0(a1),zero vfmacc.vv v1,v3,v2 However since r16-2452-gf796f819c35cc0 this case is explicitly handled as a regular broadcast (implying the vfmv). Is there a reason to prefer forcing unconditionally the memory operand into a register (fld + vfmv) over a zero-stride load (vlse)? bool can_be_broadcast_p (rtx op) { ... if (FLOAT_MODE_P (mode) && (memory_operand (op, mode) || CONSTANT_P (op)) && can_create_pseudo_p ()) return true; I also noticed the tunable discussed in PR118734 but the decision made here does not involve it.