https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121451

            Bug ID: 121451
           Summary: RISC-V: zero-stride load broadcast vs. vector-scalar
           Product: gcc
           Version: 15.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: parras at gcc dot gnu.org
  Target Milestone: ---

https://godbolt.org/z/brW6sG7KM
Reduced from 538.imagick topblock #0 (11 insns, 36.75%)

We get the following assembly:

fld     fa5,0(a1)
vfmv.v.f        v3,fa5
vfmacc.vv       v1,v3,v2

But since PR119100 we should get:

fld     fa5,0(a4)
vfmacc.vf       v1,fa5,v2

What is preventing the combination here is the vec_duplicate operand being a
mem:

(set (reg:RVVM1DF 157 [ _20 ])
        (vec_duplicate:RVVM1DF (mem:DF (reg/v/f:DI 153 [ g ]) [1 *g_16(D)+0 S8
A64])))

OTOH this seems to be candidate for a zero-stride load broadcast:

vlse64.v        v3,0(a1),zero
vfmacc.vv       v1,v3,v2

However since r16-2452-gf796f819c35cc0 this case is explicitly handled as a
regular broadcast (implying the vfmv). Is there a reason to prefer forcing
unconditionally the memory operand into a register (fld + vfmv) over a
zero-stride load (vlse)?

bool
can_be_broadcast_p (rtx op)
{
...
  if (FLOAT_MODE_P (mode)
      && (memory_operand (op, mode) || CONSTANT_P (op))
      && can_create_pseudo_p ())
    return true;

I also noticed the tunable discussed in PR118734 but the decision made here
does not involve it.

Reply via email to