[PATCH] RISC-V: Rework broadcast handling [PR121073].

Robin Dapp Tue, 22 Jul 2025 02:04:13 -0700

Hi,

During the last weeks it became clear that our current broadcast
handling needs an overhaul in order to improve maintainability.
PR121073 showed that my intermediate fix wasn't enough and caused
regressions.


This patch now goes a first step towards untangling broadcast
(vmv.v.x), "set first" (vmv.s.x), and zero-strided load (vlse).
Also can_be_broadcast_p is rewritten and strided_broadcast_p is
introduced to make the distinction clear directly in the predicates.

Due to the pervasiveness of the patterns I needed to touch a lot
of places and tried to clear up some things while at it.  The patch
therefore also introduces new helpers expand_broadcast for vmv.v.x
that dispatches to regular as well as strided broadcast and
expand_set_first that does the same thing for vmv.s.x.

The non-strided fallbacks are now implemented as splitters of the
strided variants.  This makes it easier to see where and when things
happen.

The test cases I touched appeared wrong to me so this patch sets a new
baseline for some of the scalar_move tests.

There is still work to be done but IMHO that can be deferred: It would
be clearer if the three broadcast-like variants differed not just in
name but also in RTL pattern so matching is not as confusing.  Right now
vmv.v.x and vmv.s.x only differ in the mask and are interchangeable by
just changing it from "all ones" to a "single one".

As last time, I regtested on rv64 and rv32 with strided_broadcast turned
on and off.  Note there are regressions cond_fma_fnma-[78].c.  Those are
due to the patch exposing more fwprop/late-combine opportunities.  For
fma/fnma we don't yet have proper costing for vv/vx in place but I'll
expect that to be addressed soon and figured we can live with those for
the time being.

Regards
Robin

        PR target/121073

gcc/ChangeLog:

        * config/riscv/autovec-opt.md: Use new helpers.
        * config/riscv/autovec.md: Ditto.
        * config/riscv/predicates.md (strided_broadcast_mask_operand):
        New predicate.
        (strided_broadcast_operand): Ditto.
        (any_broadcast_operand): Ditto.
        * config/riscv/riscv-protos.h (expand_broadcast): Declare.
        (expand_set_first): Ditto.
        (expand_set_first_tu): Ditto.
        (strided_broadcast_p): Ditto.
        * config/riscv/riscv-string.cc (expand_vec_setmem): Use new
        helpers.
        * config/riscv/riscv-v.cc (expand_broadcast): New functionk.
        (expand_set_first): Ditto.
        (expand_set_first_tu): Ditto.
        (expand_const_vec_duplicate): Use new helpers.
        (expand_const_vector_duplicate_repeating): Ditto.
        (expand_const_vector_duplicate_default): Ditto.
        (sew64_scalar_helper): Ditto.
        (expand_vector_init_merge_repeating_sequence): Ditto.
        (expand_reduction): Ditto.
        (strided_broadcast_p): New function.
        (whole_reg_to_reg_move_p): Use new helpers.
        * config/riscv/riscv-vector-builtins-bases.cc: Use either
        broadcast or strided broadcast.
        * config/riscv/riscv-vector-builtins.cc 
(function_expander::use_ternop_insn):
        Ditto.
        (function_expander::use_widen_ternop_insn): Ditto.
        (function_expander::use_scalar_broadcast_insn): Ditto.
        * config/riscv/riscv-vector-builtins.h: Declare scalar
        broadcast.
        * config/riscv/vector.md (*pred_broadcast<mode>): Split into
        regular and strided broadcast.
        (*pred_broadcast<mode>_zvfh): Split.
        (pred_broadcast<mode>_zvfh): Ditto.
        (*pred_broadcast<mode>_zvfhmin): Ditto.
        (@pred_strided_broadcast<mode>): Ditto.
        (*pred_strided_broadcast<mode>): Ditto.
        (*pred_strided_broadcast<mode>_zvfhmin): Ditto.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-6.c: Adjust test
        expectation.
        * gcc.target/riscv/rvv/base/scalar_move-5.c: Ditto.
        * gcc.target/riscv/rvv/base/scalar_move-6.c: Ditto.
        * gcc.target/riscv/rvv/base/scalar_move-7.c: Ditto.
        * gcc.target/riscv/rvv/base/scalar_move-8.c: Ditto.
        * gcc.target/riscv/rvv/base/scalar_move-9.c: Ditto.
        * gcc.target/riscv/rvv/pr121073.c: New test.
---
gcc/config/riscv/autovec-opt.md               |   3 +-
gcc/config/riscv/autovec.md                   |   7 +-
gcc/config/riscv/predicates.md                |  13 +
gcc/config/riscv/riscv-protos.h               |   4 +
gcc/config/riscv/riscv-string.cc              |   6 +-
gcc/config/riscv/riscv-v.cc                   | 215 +++++++---
.../riscv/riscv-vector-builtins-bases.cc      |   3 +-
gcc/config/riscv/riscv-vector-builtins.cc     |  41 +-
gcc/config/riscv/riscv-vector-builtins.h      |   1 +
gcc/config/riscv/vector.md                    | 382 ++++++++++--------
.../riscv/rvv/autovec/vls-vlmax/repeat-6.c    |   2 +-
.../gcc.target/riscv/rvv/base/scalar_move-5.c |   6 +-
.../gcc.target/riscv/rvv/base/scalar_move-6.c |   4 +-
.../gcc.target/riscv/rvv/base/scalar_move-7.c |   6 +-
.../gcc.target/riscv/rvv/base/scalar_move-8.c |   5 +-
.../gcc.target/riscv/rvv/base/scalar_move-9.c |   1 -
gcc/testsuite/gcc.target/riscv/rvv/pr121073.c |  12 +
17 files changed, 461 insertions(+), 250 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/pr121073.c

diff --git a/gcc/config/riscv/autovec-opt.md b/gcc/config/riscv/autovec-opt.md
index f372f0e6a69..5300d370a11 100644
--- a/gcc/config/riscv/autovec-opt.md
+++ b/gcc/config/riscv/autovec-opt.md
@@ -1838,8 +1838,7 @@ (define_insn_and_split "*extend_vf_<mode>"
    emit_insn (gen_extend<vsubel><vel>2(tmp, operands[1]));

    rtx ops[] = {operands[0], tmp};
-    riscv_vector::emit_vlmax_insn (code_for_pred_broadcast (<MODE>mode),
-                                   riscv_vector::UNARY_OP, ops);
+    riscv_vector::expand_broadcast (<MODE>mode, ops);
    DONE;
  }
  [(set_attr "type" "vfwmuladd")]
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index 2e86826f286..220a8828639 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1359,9 +1359,7 @@ (define_expand "vec_set<mode>"
  if (operands[2] == const0_rtx)
    {
      rtx ops[] = {operands[0], operands[0], operands[1]};
-      riscv_vector::emit_nonvlmax_insn (code_for_pred_broadcast (<MODE>mode),
-                                       riscv_vector::SCALAR_MOVE_MERGED_OP_TU,
-                                       ops, CONST1_RTX (Pmode));
+      riscv_vector::expand_set_first_tu (<MODE>mode, ops);
    }
  else
    {
@@ -1385,8 +1383,7 @@ (define_expand "vec_set<mode>"
         VL we need for the slide.  */
      rtx tmp = gen_reg_rtx (<MODE>mode);
      rtx ops1[] = {tmp, operands[1]};
-      emit_nonvlmax_insn (code_for_pred_broadcast (<MODE>mode),
-                           riscv_vector::UNARY_OP, ops1, length);
+      riscv_vector::expand_broadcast (<MODE>mode, ops1, length);

      /* Slide exactly one element up leaving the tail elements
         unchanged.  */
diff --git a/gcc/config/riscv/predicates.md b/gcc/config/riscv/predicates.md
index 1f9a6b562e5..381f96c3e72 100644
--- a/gcc/config/riscv/predicates.md
+++ b/gcc/config/riscv/predicates.md
@@ -517,6 +517,10 @@ (define_predicate "vector_mask_operand"
       (match_operand 0 "vector_all_trues_mask_operand")))

(define_predicate "vector_broadcast_mask_operand"
+  (ior (match_operand 0 "vector_least_significant_set_mask_operand")
+       (match_operand 0 "vector_all_trues_mask_operand")))
+
+(define_predicate "strided_broadcast_mask_operand"
  (ior (match_operand 0 "vector_least_significant_set_mask_operand")
    (ior (match_operand 0 "register_operand")
         (match_operand 0 "vector_all_trues_mask_operand"))))
@@ -619,6 +623,15 @@ (define_special_predicate "vector_any_register_operand"
(define_predicate "direct_broadcast_operand"
  (match_test "riscv_vector::can_be_broadcast_p (op)"))

+;; A strided broadcast is just a fallback pattern that loads from
+;; memory.
+(define_predicate "strided_broadcast_operand"
+  (match_test "riscv_vector::strided_broadcast_p (op)"))
+
+(define_predicate "any_broadcast_operand"
+  (ior (match_operand 0 "direct_broadcast_operand")
+       (match_operand 0 "strided_broadcast_operand")))
+
;; A CONST_INT operand that has exactly two bits cleared.
(define_predicate "const_nottwobits_operand"
  (and (match_code "const_int")
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index a41c4c299fa..0379f2ce256 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -695,6 +695,9 @@ bool expand_block_move (rtx, rtx, rtx, bool);
machine_mode preferred_simd_mode (scalar_mode);
machine_mode get_mask_mode (machine_mode);
void expand_vec_series (rtx, rtx, rtx, rtx = 0);
+void expand_broadcast (machine_mode, rtx *, rtx = 0);
+void expand_set_first (machine_mode, rtx *, rtx = 0);
+void expand_set_first_tu (machine_mode, rtx *, rtx = 0);
void expand_vec_init (rtx, rtx);
void expand_vec_perm (rtx, rtx, rtx, rtx);
void expand_select_vl (rtx *);
@@ -762,6 +765,7 @@ enum vlmul_type get_vlmul (rtx_insn *);
int count_regno_occurrences (rtx_insn *, unsigned int);
bool imm_avl_p (machine_mode);
bool can_be_broadcast_p (rtx);
+bool strided_broadcast_p (rtx);
bool gather_scatter_valid_offset_p (machine_mode);
HOST_WIDE_INT estimated_poly_value (poly_int64, unsigned int);
bool whole_reg_to_reg_move_p (rtx *, machine_mode, int);
diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index 90801899ec8..61c4a095ab4 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -1625,16 +1625,14 @@ expand_vec_setmem (rtx dst_in, rtx length_in, rtx 
fill_value_in)
     Otherwise, use a predicated store.  */
  if (known_eq (GET_MODE_SIZE (info.vmode), INTVAL (info.avl)))
    {
-      emit_vlmax_insn (code_for_pred_broadcast (info.vmode), UNARY_OP,
-                      broadcast_ops);
+      riscv_vector::expand_broadcast (info.vmode, broadcast_ops);
      emit_move_insn (dst, fill_value);
    }
  else
    {
      if (!satisfies_constraint_vl (info.avl))
        info.avl = force_reg (Pmode, info.avl);
-      emit_nonvlmax_insn (code_for_pred_broadcast (info.vmode),
-                         riscv_vector::UNARY_OP, broadcast_ops, info.avl);
+      riscv_vector::expand_broadcast (info.vmode, broadcast_ops, info.avl);
      machine_mode mask_mode
        = riscv_vector::get_vector_mode (BImode, GET_MODE_NUNITS (info.vmode))
          .require ();
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 242ac087764..a051bc94fc2 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -1190,6 +1190,59 @@ expand_vector_init_trailing_same_elem (rtx target,
  return false;
}

+/* Helper function to emit a vmv.vx/vi and float variants.
+   If VL is not given a VLMAX insn will be emitted, otherwise
+   a non-VLMAX insn with length VL.
+   If the value to be broadcast is not suitable for vmv.vx
+   fall back to a vlse with zero stride.  This itself has a
+   fallback if the uarch prefers not to use a strided load
+   for broadcast.  */
+
+void
+expand_broadcast (machine_mode mode, rtx *ops, rtx vl)
+{
+  rtx elt = ops[1];
+  avl_type type = vl ? NONVLMAX : VLMAX;
+  if (can_be_broadcast_p (elt))
+    emit_avltype_insn (code_for_pred_broadcast (mode), UNARY_OP, ops,
+                      type, vl);
+  else
+    emit_avltype_insn (code_for_pred_strided_broadcast (mode),
+                      UNARY_OP, ops, type, vl);
+}
+
+/* Similar to expand_broadcast but emits a vmv.s.x/vfmv.s.f instead.  */
+
+void
+expand_set_first (machine_mode mode, rtx *ops, rtx vl)
+{
+  rtx elt = ops[1];
+  avl_type type = vl ? NONVLMAX : VLMAX;
+  if (can_be_broadcast_p (elt))
+    emit_avltype_insn (code_for_pred_broadcast (mode),
+                       SCALAR_MOVE_OP, ops, type, vl);
+  else
+    emit_avltype_insn (code_for_pred_strided_broadcast (mode),
+                       SCALAR_MOVE_OP, ops, type, vl);
+}
+
+/* Similar to expand_set_first but keeping the tail elements
+   unchanged (TU) */
+
+void
+expand_set_first_tu (machine_mode mode, rtx *ops, rtx vl)
+{
+  rtx elt = ops[2];
+  if (!vl)
+    vl = const1_rtx;
+  if (can_be_broadcast_p (elt))
+    emit_nonvlmax_insn (code_for_pred_broadcast (mode),
+                       SCALAR_MOVE_MERGED_OP_TU, ops, vl);
+  else
+    emit_nonvlmax_insn (code_for_pred_strided_broadcast (mode),
+                       SCALAR_MOVE_MERGED_OP_TU, ops, vl);
+}
+
static void
expand_const_vec_duplicate (rtx target, rtx src, rtx elt)
{
@@ -1226,7 +1279,7 @@ expand_const_vec_duplicate (rtx target, rtx src, rtx elt)
      if (lra_in_progress)
        {
          rtx ops[] = {result, elt};
-         emit_vlmax_insn (code_for_pred_broadcast (mode), UNARY_OP, ops);
+         expand_broadcast (mode, ops);
        }
      else
        {
@@ -1278,8 +1331,7 @@ expand_const_vector_duplicate_repeating (rtx target, 
rvv_builder *builder)
    {
      dup = gen_reg_rtx (builder->new_mode ());
      rtx ops[] = {dup, ele};
-      emit_vlmax_insn (code_for_pred_broadcast (builder->new_mode ()),
-                      UNARY_OP, ops);
+      expand_broadcast (builder->new_mode (), ops);
    }
  else
    dup = expand_vector_broadcast (builder->new_mode (), ele);
@@ -1322,8 +1374,7 @@ expand_const_vector_duplicate_default (rtx target, 
rvv_builder *builder)

  rtx tmp1 = gen_reg_rtx (builder->mode ());
  rtx dup_ops[] = {tmp1, builder->elt (0)};
-  emit_vlmax_insn (code_for_pred_broadcast (builder->mode ()), UNARY_OP,
-                  dup_ops);
+  expand_broadcast (builder->mode (), dup_ops);

  for (unsigned int i = 1; i < builder->npatterns (); i++)
    {
@@ -2136,18 +2187,27 @@ has_vi_variant_p (rtx_code code, rtx x)
    }
}

+/* This is a helper for binary ops with DImode scalar operands that are
+   broadcast (like vadd.vx v1, a1).
+   Instead of having similar code for all the expanders this function
+   TODO.  */
+
bool
sew64_scalar_helper (rtx *operands, rtx *scalar_op, rtx vl,
                     machine_mode vector_mode, bool has_vi_variant_p,
                     void (*emit_vector_func) (rtx *, rtx), enum avl_type type)
{
  machine_mode scalar_mode = GET_MODE_INNER (vector_mode);
+
+  /* If the scalar broadcast op fits an immediate, use the
+     vop.vi variant if there is one.  */
  if (has_vi_variant_p)
    {
      *scalar_op = force_reg (scalar_mode, *scalar_op);
      return false;
    }

+  /* On a 64-bit target we can always use the vop.vx variant.  */
  if (TARGET_64BIT)
    {
      if (!rtx_equal_p (*scalar_op, const0_rtx))
@@ -2155,6 +2215,8 @@ sew64_scalar_helper (rtx *operands, rtx *scalar_op, rtx 
vl,
      return false;
    }

+  /* For 32 bit and if there is no vop.vi variant for a 32-bit immediate
+     we need to use the sign-extending (SI -> DI) vop.vx variants.  */
  if (immediate_operand (*scalar_op, Pmode))
    {
      if (!rtx_equal_p (*scalar_op, const0_rtx))
@@ -2164,40 +2226,28 @@ sew64_scalar_helper (rtx *operands, rtx *scalar_op, rtx 
vl,
      return false;
    }

-  bool avoid_strided_broadcast = false;
+  /* Now we're left with a 64-bit immediate or a register.
+     We cannot use a vop.vx variant but must broadcast the value first
+     and switch to a vop.vv variant.
+     Broadcast can either be done via vlse64.v v1, reg, zero
+     or by loading one 64-bit element (vle64.v) and using a
+     broadcast vrgather.vi.  */
+  gcc_assert (!TARGET_64BIT
+             && (CONST_INT_P (*scalar_op)
+                 || register_operand (*scalar_op, scalar_mode)));
+
  if (CONST_INT_P (*scalar_op))
    {
      if (maybe_gt (GET_MODE_SIZE (scalar_mode), GET_MODE_SIZE (Pmode)))
-       {
-         if (strided_load_broadcast_p ())
-           *scalar_op = force_const_mem (scalar_mode, *scalar_op);
-         else
-           avoid_strided_broadcast = true;
-       }
+       *scalar_op = force_const_mem (scalar_mode, *scalar_op);
      else
        *scalar_op = force_reg (scalar_mode, *scalar_op);
    }

  rtx tmp = gen_reg_rtx (vector_mode);
-  if (!avoid_strided_broadcast)
-    {
-      rtx ops[] = {tmp, *scalar_op};
-      emit_avltype_insn (code_for_pred_broadcast (vector_mode), UNARY_OP, ops,
-                        type, vl);
-    }
-  else
-    {
-      /* Load scalar as V1DI and broadcast via vrgather.vi.  */
-      rtx tmp1 = gen_reg_rtx (V1DImode);
-      emit_move_insn (tmp1, lowpart_subreg (V1DImode, *scalar_op,
-                                           scalar_mode));
-      tmp1 = lowpart_subreg (vector_mode, tmp1, V1DImode);
-
-      rtx ops[] = {tmp, tmp1, CONST0_RTX (Pmode)};
-      emit_vlmax_insn (code_for_pred_gather_scalar (vector_mode),
-                      BINARY_OP, ops);
-    }
-
+  rtx ops[] = {tmp, *scalar_op};
+  emit_avltype_insn (code_for_pred_strided_broadcast (vector_mode),
+                    UNARY_OP, ops, type, vl);
  emit_vector_func (operands, tmp);

  return true;
@@ -2591,8 +2641,7 @@ expand_vector_init_merge_repeating_sequence (rtx target,

  /* Step 1: Broadcast the first pattern.  */
  rtx ops[] = {target, force_reg (builder.inner_mode (), builder.elt (0))};
-  emit_vlmax_insn (code_for_pred_broadcast (builder.mode ()),
-                   UNARY_OP, ops);
+  expand_broadcast (builder.mode (), ops);
  /* Step 2: Merge the rest iteration of pattern.  */
  for (unsigned int i = 1; i < builder.npatterns (); i++)
    {
@@ -2605,8 +2654,7 @@ expand_vector_init_merge_repeating_sequence (rtx target,
      if (full_nelts <= builder.inner_bits_size ()) /* vmv.s.x.  */
        {
          rtx ops[] = {dup, merge_mask};
-         emit_nonvlmax_insn (code_for_pred_broadcast (GET_MODE (dup)),
-                              SCALAR_MOVE_OP, ops, CONST1_RTX (Pmode));
+         expand_set_first (GET_MODE (dup), ops);
        }
      else /* vmv.v.x.  */
        {
@@ -2614,8 +2662,7 @@ expand_vector_init_merge_repeating_sequence (rtx target,
                       force_reg (GET_MODE_INNER (mask_int_mode), merge_mask)};
          rtx vl = gen_int_mode (CEIL (full_nelts, builder.inner_bits_size ()),
                                 Pmode);
-         emit_nonvlmax_insn (code_for_pred_broadcast (mask_int_mode), UNARY_OP,
-                              ops, vl);
+         expand_broadcast (mask_int_mode, ops, vl);
        }

      emit_move_insn (mask, gen_lowpart (mask_bit_mode, dup));
@@ -4706,20 +4753,20 @@ expand_reduction (unsigned unspec, unsigned 
unspec_for_vl0_safe,

  rtx m1_tmp = gen_reg_rtx (m1_mode);
  rtx scalar_move_ops[] = {m1_tmp, init};
-  insn_code icode = code_for_pred_broadcast (m1_mode);
  if (need_mask_operand_p (insn_flags))
    {
      if (need_vl0_safe)
-       emit_nonvlmax_insn (icode, SCALAR_MOVE_OP, scalar_move_ops, const1_rtx);
+       expand_set_first (m1_mode, scalar_move_ops, const1_rtx);
      else
-       emit_nonvlmax_insn (icode, SCALAR_MOVE_OP, scalar_move_ops, vl_op);
+       expand_set_first (m1_mode, scalar_move_ops, vl_op);
    }
  else
-    emit_vlmax_insn (icode, SCALAR_MOVE_OP, scalar_move_ops);
+    expand_set_first (m1_mode, scalar_move_ops);

  rtx m1_tmp2 = gen_reg_rtx (m1_mode);
  rtx reduc_ops[] = {m1_tmp2, vector_src, m1_tmp};

+  insn_code icode;
  if (need_vl0_safe)
    icode = code_for_pred (unspec_for_vl0_safe, vmode);
  else
@@ -5808,25 +5855,84 @@ count_regno_occurrences (rtx_insn *rinsn, unsigned int 
regno)
  return count;
}

-/* Return true if the OP can be directly broadcast.  */
+/* Return true if the OP can be broadcast with a
+   v[f]mv.v.[xif] instruction.  */
+
bool
can_be_broadcast_p (rtx op)
{
  machine_mode mode = GET_MODE (op);
-  /* We don't allow RA (register allocation) reload generate
-    (vec_duplicate:DI reg) in RV32 system wheras we allow
-    (vec_duplicate:DI mem) in RV32 system.  */
-  if (!can_create_pseudo_p () && !FLOAT_MODE_P (mode)
-      && maybe_gt (GET_MODE_SIZE (mode), GET_MODE_SIZE (Pmode))
-      && !satisfies_constraint_Wdm (op))
+
+  /* Zero always works and we can always put an immediate into a
+     register.
+     What's tricky is that for an immediate we don't know the
+     register's mode it will end up in, i.e. what element size
+     we want to broadcast.  So even if the immediate is small it might
+     still end up in a DImode register that we cannot broadcast.
+     vmv.s.x, i.e. a single-element set can handle this, though,
+     because it implicitly sign-extends to SEW.  */
+  if (rtx_equal_p (op, CONST0_RTX (mode))
+      || const_int_operand (op, Xmode))
+    return true;
+
+  /* Do not accept DImode broadcasts on !TARGET_64BIT.  Those
+     are handled by strided broadcast.  */
+  if (INTEGRAL_MODE_P (mode)
+      && maybe_gt (GET_MODE_SIZE (mode), UNITS_PER_WORD))
+    return false;
+
+  /* Non-register operands that can be forced into a register we can
+     handle.  These don't need to use strided broadcast. */
+  if (INTEGRAL_MODE_P (mode)
+      && (memory_operand (op, mode) || CONST_POLY_INT_P (op))
+      && can_create_pseudo_p ())
+    return true;
+
+  /* Likewise, do not accept HFmode broadcast if we don't have
+     vfmv.v.f for 16-bit registers available.  */
+  if (mode == HFmode && !TARGET_ZVFH)
+    return false;
+
+  /* Same for float, just that we can always handle 64-bit doubles
+     even on !TARGET_64BIT.  We have ruled out 16-bit HF already
+     above.  */
+  if (FLOAT_MODE_P (mode)
+      && (memory_operand (op, mode) || CONSTANT_P (op))
+      && can_create_pseudo_p ())
+    return true;
+
+  /* After excluding all the cases we cannot handle the register types
+     that remain can always be broadcast.  */
+  if (register_operand (op, mode))
+    return true;
+
+  return false;
+}
+
+/* Returns true for all operands that cannot use vmv.vx, vfmv.vf,
+   vmv.s.x, or vfmv.s.f but rather need to go via memory.  */
+
+bool
+strided_broadcast_p (rtx op)
+{
+  machine_mode mode = GET_MODE (op);
+  if (!memory_operand (op, mode)
+      && !register_operand (op, mode)
+      && !rtx_equal_p (op, CONST0_RTX (mode))
+      && !const_int_operand (op, mode))
    return false;

-  if (satisfies_constraint_K (op) || register_operand (op, mode)
-      || (strided_load_broadcast_p () && satisfies_constraint_Wdm (op))
-      || rtx_equal_p (op, CONST0_RTX (mode)))
+  /* !TARGET64_BIT does not have a vmv.v.x/vmv.s.x for 64-bit
+     DImode elements.  */
+  if (INTEGRAL_MODE_P (mode)
+      && maybe_gt (GET_MODE_SIZE (mode), UNITS_PER_WORD))
    return true;

-  return can_create_pseudo_p () && nonmemory_operand (op, mode);
+  /* Zvfhmin does not have a vfmv.v.f/vfmv.s.f.  for 16-bit elements.  */
+  if (!TARGET_ZVFH && mode == HFmode)
+    return true;
+
+  return false;
}

void
@@ -5941,7 +6047,10 @@ whole_reg_to_reg_move_p (rtx *ops, machine_mode mode, 
int avl_type_index)
  return false;
}

-/* Return true if we can transform vmv.v.x/vfmv.v.f to vmv.s.x/vfmv.s.f.  */
+/* Return true if we can transform vmv.v.x/vfmv.v.f to vmv.s.x/vfmv.s.f.
+   That's the case if we're dealing with a scalar broadcast that
+   has VL = 1.  */
+
bool
splat_to_scalar_move_p (rtx *ops)
{
diff --git a/gcc/config/riscv/riscv-vector-builtins-bases.cc 
b/gcc/config/riscv/riscv-vector-builtins-bases.cc
index bf5172c6e04..7e4d396f05a 100644
--- a/gcc/config/riscv/riscv-vector-builtins-bases.cc
+++ b/gcc/config/riscv/riscv-vector-builtins-bases.cc
@@ -643,7 +643,8 @@ public:
        return e.use_exact_insn (code_for_pred_mov (e.vector_mode ()));
      case OP_TYPE_x:
      case OP_TYPE_f:
-       return e.use_exact_insn (code_for_pred_broadcast (e.vector_mode ()));
+       return e.use_scalar_broadcast_insn
+         (code_for_pred_broadcast (e.vector_mode ()));
      default:
        gcc_unreachable ();
      }
diff --git a/gcc/config/riscv/riscv-vector-builtins.cc 
b/gcc/config/riscv/riscv-vector-builtins.cc
index 8810af0d9cc..0db7549fc5c 100644
--- a/gcc/config/riscv/riscv-vector-builtins.cc
+++ b/gcc/config/riscv/riscv-vector-builtins.cc
@@ -4753,7 +4753,10 @@ function_expander::use_ternop_insn (bool vd_accum_p, 
insn_code icode)
}

/* Implement the call using instruction ICODE, with a 1:1 mapping between
-   arguments and input operands.  */
+   arguments and input operands.
+   There are operands that cannot be broadcast using v[f]mv.  In that case
+   we switch to a strided broadcast.  */
+
rtx
function_expander::use_widen_ternop_insn (insn_code icode)
{
@@ -4794,7 +4797,10 @@ function_expander::use_widen_ternop_insn (insn_code 
icode)
}

/* Implement the call using instruction ICODE, with a 1:1 mapping between
-   arguments and input operands.  */
+   arguments and input operands.
+   There are operands that cannot be broadcast using v[f]mv.  In that case
+   we switch to a strided broadcast.  */
+
rtx
function_expander::use_scalar_move_insn (insn_code icode)
{
@@ -4812,6 +4818,37 @@ function_expander::use_scalar_move_insn (insn_code icode)
  for (int argno = arg_offset; argno < call_expr_nargs (exp); argno++)
    add_input_operand (argno);

+  if (!can_be_broadcast_p (m_ops[3].value))
+    icode = code_for_pred_strided_broadcast (vector_mode ());
+
+  add_input_operand (Pmode, get_tail_policy_for_pred (pred));
+  add_input_operand (Pmode, get_mask_policy_for_pred (pred));
+  add_input_operand (Pmode, get_avl_type_rtx (avl_type::NONVLMAX));
+  return generate_insn (icode);
+}
+
+/* Implement the call using instruction ICODE, with a 1:1 mapping between
+   arguments and input operands.  */
+rtx
+function_expander::use_scalar_broadcast_insn (insn_code icode)
+{
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+
+  /* Record the offset to get the argument.  */
+  int arg_offset = 0;
+  add_all_one_mask_operand (mask_mode ());
+
+  if (use_real_merge_p (pred))
+    add_input_operand (arg_offset++);
+  else
+    add_vundef_operand (mode);
+
+  for (int argno = arg_offset; argno < call_expr_nargs (exp); argno++)
+    add_input_operand (argno);
+
+  if (!can_be_broadcast_p (m_ops[3].value))
+    icode = code_for_pred_strided_broadcast (vector_mode ());
+
  add_input_operand (Pmode, get_tail_policy_for_pred (pred));
  add_input_operand (Pmode, get_mask_policy_for_pred (pred));
  add_input_operand (Pmode, get_avl_type_rtx (avl_type::NONVLMAX));
diff --git a/gcc/config/riscv/riscv-vector-builtins.h 
b/gcc/config/riscv/riscv-vector-builtins.h
index 1f2587ab6af..86d81154db9 100644
--- a/gcc/config/riscv/riscv-vector-builtins.h
+++ b/gcc/config/riscv/riscv-vector-builtins.h
@@ -497,6 +497,7 @@ public:
  rtx use_ternop_insn (bool, insn_code);
  rtx use_widen_ternop_insn (insn_code);
  rtx use_scalar_move_insn (insn_code);
+  rtx use_scalar_broadcast_insn (insn_code);
  rtx generate_insn (insn_code);

  /* The function call expression.  */
diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index baf215b8a44..946819f537f 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -1551,20 +1551,44 @@ (define_expand "movmisalign<mode>"
(define_expand "vec_duplicate<mode>"
  [(set (match_operand:V_VLS 0 "register_operand")
        (vec_duplicate:V_VLS
-          (match_operand:<VEL> 1 "direct_broadcast_operand")))]
+          (match_operand:<VEL> 1 "any_broadcast_operand")))]
  "TARGET_VECTOR"
  {
-    /* Early expand DImode broadcast in RV32 system to avoid RA reload
-       generate (set (reg) (vec_duplicate:DI)).  */
+    /* Don't keep a DImode broadcast for RV32 in the vec_duplicate form.
+       Otherwise combine or late combine could end up doing
+             "64-bit broadcast" (!= vmv.v.x)
+            + vadd.vv
+           = vadd.vx
+       which would be invalid.  */
    bool gt_p = maybe_gt (GET_MODE_SIZE (<VEL>mode), GET_MODE_SIZE (Pmode));
    if (!FLOAT_MODE_P (<VEL>mode) && gt_p)
      {
-        riscv_vector::emit_vlmax_insn (code_for_pred_broadcast (<MODE>mode),
-                                      riscv_vector::UNARY_OP, operands);
-       DONE;
+        riscv_vector::emit_vlmax_insn
+         (code_for_pred_strided_broadcast
+           (<MODE>mode), riscv_vector::UNARY_OP, operands);
+       DONE;
      }
-    /* Otherwise, allow it fall into general vec_duplicate pattern
-       which allow us to have vv->vx combine optimization in later pass.  */
+
+    /* Even though we can eventually broadcast any permissible
+       constant by moving it into a register we need to force
+       any non-immediate one into a register here.
+       If we didn't do that we couldn't fwprop/late-combine
+             vec_duplicate 123.45f
+           + vfadd.vv
+           = vfadd.vf
+       because the constant is valid for vec_duplicate but not
+       for vfadd.vf.  Therefore we need to do
+             fa0 = 123.45f
+             vec_duplicate fa0
+           + vfadd.vv
+           = vfadd.vf  */
+    if (!satisfies_constraint_P (operands[1])
+       && !satisfies_constraint_J (operands[1])
+       && !rtx_equal_p (operands[1], CONST0_RTX (<VEL>mode))
+       && !memory_operand (operands[1], <VEL>mode))
+      operands[1] = force_reg (<VEL>mode, operands[1]);
+
+    /* Otherwise keep the vec_duplicate pattern until split.  */
  })

;; According to GCC internal:
@@ -1574,28 +1598,20 @@ (define_expand "vec_duplicate<mode>"
(define_insn_and_split "*vec_duplicate<mode>"
  [(set (match_operand:V_VLS 0 "register_operand")
        (vec_duplicate:V_VLS
-          (match_operand:<VEL> 1 "direct_broadcast_operand")))]
+          (match_operand:<VEL> 1 "any_broadcast_operand")))]
  "TARGET_VECTOR && can_create_pseudo_p ()"
  "#"
  "&& 1"
  [(const_int 0)]
  {
-    if (!strided_load_broadcast_p ()
-       && TARGET_ZVFHMIN && !TARGET_ZVFH && <VEL>mode == HFmode)
-      {
-       /* For Float16, reinterpret as HImode, broadcast and reinterpret
-          back.  */
-       poly_uint64 nunits = GET_MODE_NUNITS (<MODE>mode);
-       machine_mode vmodehi
-         = riscv_vector::get_vector_mode (HImode, nunits).require ();
-       rtx ops[] = {lowpart_subreg (vmodehi, operands[0], <MODE>mode),
-                    lowpart_subreg (HImode, operands[1], HFmode)};
-       riscv_vector::emit_vlmax_insn (code_for_pred_broadcast (vmodehi),
-                                      riscv_vector::UNARY_OP, ops);
-      }
-    else
+    if (riscv_vector::can_be_broadcast_p (operands[1]))
      riscv_vector::emit_vlmax_insn (code_for_pred_broadcast (<MODE>mode),
                                     riscv_vector::UNARY_OP, operands);
+    else
+      riscv_vector::emit_vlmax_insn (code_for_pred_strided_broadcast
+                                    (<MODE>mode), riscv_vector::UNARY_OP,
+                                    operands);
+
    DONE;
  }
  [(set_attr "type" "vector")]
@@ -2141,69 +2157,45 @@ (define_expand "@pred_broadcast<mode>"
          (match_operand:V_VLS 2 "vector_merge_operand")))]
  "TARGET_VECTOR"
{
-  /* Transform vmv.v.x/vfmv.v.f (avl = 1) into vmv.s.x since vmv.s.x/vfmv.s.f
-     has better chances to do vsetvl fusion in vsetvl pass.  */
  bool wrap_vec_dup = true;
  rtx vec_cst = NULL_RTX;
-  if (riscv_vector::splat_to_scalar_move_p (operands))
-    {
-      operands[1] = riscv_vector::gen_scalar_move_mask (<VM>mode);
-      operands[3] = force_reg (<VEL>mode, operands[3]);
-    }
-  else if (immediate_operand (operands[3], <VEL>mode)
-          && (vec_cst = gen_const_vec_duplicate (<MODE>mode, operands[3]))
-          && (/* -> pred_broadcast<mode>_zero */
-              (vector_least_significant_set_mask_operand (operands[1],
-                                                          <VM>mode)
-               && vector_const_0_operand (vec_cst, <MODE>mode))
-              || (/* pred_broadcast<mode>_imm */
-                  vector_all_trues_mask_operand (operands[1], <VM>mode)
-                  && vector_const_int_or_double_0_operand (vec_cst,
-                                                           <MODE>mode))))
+  if (immediate_operand (operands[3], <VEL>mode)
+      && (vec_cst = gen_const_vec_duplicate (<MODE>mode, operands[3]))
+      && (/* -> pred_broadcast<mode>_zero */
+         (vector_least_significant_set_mask_operand (operands[1],
+                                                     <VM>mode)
+          && vector_const_0_operand (vec_cst, <MODE>mode))
+         || (/* pred_broadcast<mode>_imm */
+             vector_all_trues_mask_operand (operands[1], <VM>mode)
+             && vector_const_int_or_double_0_operand (vec_cst,
+                                                      <MODE>mode))))
    {
      operands[3] = vec_cst;
      wrap_vec_dup = false;
    }
-  /* Handle vmv.s.x instruction (Wb1 mask) which has memory scalar.  */
-  else if (satisfies_constraint_Wdm (operands[3]))
-    {
-      if (satisfies_constraint_Wb1 (operands[1]))
-       {
-         /* Case 1: vmv.s.x (TA, x == memory) ==> vlse.v (TA)  */
-         if (satisfies_constraint_vu (operands[2]))
-           operands[1] = CONSTM1_RTX (<VM>mode);
-         else if (GET_MODE_BITSIZE (<VEL>mode) > GET_MODE_BITSIZE (Pmode))
-           {
-             /* Case 2: vmv.s.x (TU, x == memory) ==>
-                          vl = 0 or 1; + vlse.v (TU) in RV32 system  */
-             operands[4] = riscv_vector::gen_avl_for_scalar_move (operands[4]);
-             operands[1] = CONSTM1_RTX (<VM>mode);
-           }
-         else
-           /* Case 3: load x (memory) to register.  */
-           operands[3] = force_reg (<VEL>mode, operands[3]);
-       }
-    }
-  else if (GET_MODE_BITSIZE (<VEL>mode) > GET_MODE_BITSIZE (Pmode)
-          && (immediate_operand (operands[3], Pmode)
+  else if (GET_MODE_SIZE (<VEL>mode) > UNITS_PER_WORD
+          && satisfies_constraint_Wb1 (operands[1])
+          && (immediate_operand (operands[3], Xmode)
               || (CONST_POLY_INT_P (operands[3])
                   && known_ge (rtx_to_poly_int64 (operands[3]), 0U)
-                  && known_le (rtx_to_poly_int64 (operands[3]), GET_MODE_SIZE 
(<MODE>mode)))))
+                  && known_le (rtx_to_poly_int64 (operands[3]),
+                               GET_MODE_SIZE (<MODE>mode)))))
    {
      rtx tmp = gen_reg_rtx (Pmode);
      poly_int64 value = rtx_to_poly_int64 (operands[3]);
-      emit_move_insn (tmp, gen_int_mode (value, Pmode));
+      emit_move_insn (tmp, gen_int_mode (value, Xmode));
      operands[3] = gen_rtx_SIGN_EXTEND (<VEL>mode, tmp);
    }
-  /* Never load (const_int 0) into a register, that's silly.  */
-  else if (operands[3] == CONST0_RTX (<VEL>mode))
+
+  /* For a vmv.v.x never load (const_int 0) or valid immediate operands
+     into a register, because we can use vmv.v.i.  */
+  else if (satisfies_constraint_Wc1 (operands[1])
+      && (satisfies_constraint_P (operands[3])
+         || operands[3] == CONST0_RTX (<VEL>mode)))
    ;
-  /* If we're broadcasting [-16..15] across more than just
-     element 0, then we can use vmv.v.i directly, thus avoiding
-     the load of the constant into a GPR.  */
-  else if (CONST_INT_P (operands[3])
-          && IN_RANGE (INTVAL (operands[3]), -16, 15)
-          && !satisfies_constraint_Wb1 (operands[1]))
+  /* For vmv.s.x we have vmv.s.x v1, zero.  */
+  else if (satisfies_constraint_Wb1 (operands[1])
+          && operands[3] == CONST0_RTX (<VEL>mode))
    ;
  else
    operands[3] = force_reg (<VEL>mode, operands[3]);
@@ -2211,131 +2203,68 @@ (define_expand "@pred_broadcast<mode>"
    operands[3] = gen_rtx_VEC_DUPLICATE (<MODE>mode, operands[3]);
})

-(define_insn_and_split "*pred_broadcast<mode>"
-  [(set (match_operand:V_VLSI 0 "register_operand"                 "=vr, vr, vd, 
vd, vr, vr, vr, vr")
+(define_insn_and_rewrite "*pred_broadcast<mode>"
+  [(set (match_operand:V_VLSI 0 "register_operand"                 "=vr, vr, vr, 
vr")
        (if_then_else:V_VLSI
          (unspec:<VM>
-           [(match_operand:<VM> 1 "vector_broadcast_mask_operand" "Wc1,Wc1, vm, 
vm,Wc1,Wc1,Wb1,Wb1")
-            (match_operand 4 "vector_length_operand"              
"rvl,rvl,rvl,rvl,rvl,rvl,rvl,rvl")
-            (match_operand 5 "const_int_operand"                  "  i,  i,  i,  i, 
 i,  i,  i,  i")
-            (match_operand 6 "const_int_operand"                  "  i,  i,  i,  i, 
 i,  i,  i,  i")
-            (match_operand 7 "const_int_operand"                  "  i,  i,  i,  i, 
 i,  i,  i,  i")
+           [(match_operand:<VM> 1 "vector_broadcast_mask_operand" 
"Wc1,Wc1,Wb1,Wb1")
+            (match_operand 4 "vector_length_operand"            
"rvl,rvl,rvl,rvl")
+            (match_operand 5 "const_int_operand"                  "  i,  i,  i,  
i")
+            (match_operand 6 "const_int_operand"                  "  i,  i,  i,  
i")
+            (match_operand 7 "const_int_operand"                  "  i,  i,  i,  
i")
             (reg:SI VL_REGNUM)
             (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
          (vec_duplicate:V_VLSI
-           (match_operand:<VEL> 3 "direct_broadcast_operand"       
"rP,rP,Wdm,Wdm,Wdm,Wdm, rJ, rJ"))
-         (match_operand:V_VLSI 2 "vector_merge_operand"            "vu, 0, vu,  0, 
vu,  0, vu,  0")))]
+           (match_operand:<VEL> 3 "direct_broadcast_operand"      " rP, rP, rJ, 
rJ"))
+         (match_operand:V_VLSI 2 "vector_merge_operand"           " vu,  0, vu,  
0")))]
  "TARGET_VECTOR"
  "@
   vmv.v.%o3\t%0,%3
   vmv.v.%o3\t%0,%3
-   vlse<sew>.v\t%0,%3,zero,%1.t
-   vlse<sew>.v\t%0,%3,zero,%1.t
-   vlse<sew>.v\t%0,%3,zero
-   vlse<sew>.v\t%0,%3,zero
   vmv.s.x\t%0,%z3
   vmv.s.x\t%0,%z3"
-  "(register_operand (operands[3], <VEL>mode)
-  || CONST_POLY_INT_P (operands[3]))
-  && GET_MODE_BITSIZE (<VEL>mode) > GET_MODE_BITSIZE (Pmode)"
-  [(const_int 0)]
-  {
-    gcc_assert (can_create_pseudo_p ());
-    if (CONST_POLY_INT_P (operands[3]))
-      {
-       rtx tmp = gen_reg_rtx (<VEL>mode);
-       emit_move_insn (tmp, operands[3]);
-       operands[3] = tmp;
-      }
-
-    /* For SEW = 64 in RV32 system, we expand vmv.s.x:
-       andi a2,a2,1
-       vsetvl zero,a2,e64
-       vlse64.v  */
-    if (satisfies_constraint_Wb1 (operands[1]))
-      {
-       operands[4] = riscv_vector::gen_avl_for_scalar_move (operands[4]);
-       operands[1] = CONSTM1_RTX (<VM>mode);
-      }
-
-    /* If the target doesn't want a strided-load broadcast we go with a regular
-       V1DImode load and a broadcast gather.  */
-    if (strided_load_broadcast_p ())
-      {
-       rtx mem = assign_stack_local (<VEL>mode, GET_MODE_SIZE (<VEL>mode),
-                                     GET_MODE_ALIGNMENT (<VEL>mode));
-       mem = validize_mem (mem);
-       emit_move_insn (mem, operands[3]);
-       mem = gen_rtx_MEM (<VEL>mode, force_reg (Pmode, XEXP (mem, 0)));
-
-       emit_insn
-         (gen_pred_broadcast<mode>
-          (operands[0], operands[1], operands[2], mem,
-           operands[4], operands[5], operands[6], operands[7]));
-      }
-    else
-      {
-       rtx tmp = gen_reg_rtx (V1DImode);
-       emit_move_insn (tmp, lowpart_subreg (V1DImode, operands[3],
-                                            <VEL>mode));
-       tmp = lowpart_subreg (<MODE>mode, tmp, V1DImode);
-
-       emit_insn
-         (gen_pred_gather<mode>_scalar
-          (operands[0], operands[1], operands[2], tmp, CONST0_RTX (Pmode),
-           operands[4], operands[5], operands[6], operands[7]));
-      }
-    DONE;
-  }
-  [(set_attr "type" "vimov,vimov,vlds,vlds,vlds,vlds,vimovxv,vimovxv")
+  "&& (operands[1] == CONSTM1_RTX (<VM>mode)
+       && operands[4] == CONST1_RTX (Pmode)
+       && (register_operand (operands[3], <VEL>mode)
+           || satisfies_constraint_J (operands[3])))"
+{
+  /* A broadcast of a single element is just a vmv.s.x.  */
+  operands[1] = riscv_vector::gen_scalar_move_mask (<VM>mode);
+}
+  [(set_attr "type" "vimov,vimov,vimovxv,vimovxv")
   (set_attr "mode" "<MODE>")])

-(define_insn "*pred_broadcast<mode>_zvfh"
-  [(set (match_operand:V_VLSF    0 "register_operand"              "=vr,  vr,  vr,  
vr")
+(define_insn_and_rewrite "pred_broadcast<mode>_zvfh"
+  [(set (match_operand:V_VLSF    0 "register_operand"              "=vr, vr, vr, 
vr")
        (if_then_else:V_VLSF
          (unspec:<VM>
-           [(match_operand:<VM> 1 "vector_broadcast_mask_operand" "Wc1, Wc1, Wb1, 
Wb1")
-            (match_operand      4 "vector_length_operand"         "rvl, rvl, rvl, 
rvl")
-            (match_operand      5 "const_int_operand"             "  i,   i,   i,   
i")
-            (match_operand      6 "const_int_operand"             "  i,   i,   i,   
i")
-            (match_operand      7 "const_int_operand"             "  i,   i,   i,   
i")
+           [(match_operand:<VM> 1 "vector_broadcast_mask_operand" 
"Wc1,Wc1,Wb1,Wb1")
+            (match_operand      4 "vector_length_operand"       
"rvl,rvl,rvl,rvl")
+            (match_operand      5 "const_int_operand"             "  i,  i,  i,  
i")
+            (match_operand      6 "const_int_operand"             "  i,  i,  i,  
i")
+            (match_operand      7 "const_int_operand"             "  i,  i,  i,  
i")
             (reg:SI VL_REGNUM)
             (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
          (vec_duplicate:V_VLSF
-           (match_operand:<VEL> 3 "direct_broadcast_operand"      "  f,   f,   f,   
f"))
-         (match_operand:V_VLSF  2 "vector_merge_operand"          " vu,   0,  vu,   
0")))]
+           (match_operand:<VEL> 3 "direct_broadcast_operand"      "  f,  f,  f,  
f"))
+         (match_operand:V_VLSF  2 "vector_merge_operand"          " vu,  0, vu,  
0")))]
  "TARGET_VECTOR"
  "@
   vfmv.v.f\t%0,%3
   vfmv.v.f\t%0,%3
   vfmv.s.f\t%0,%3
   vfmv.s.f\t%0,%3"
+  "&& (operands[1] == CONSTM1_RTX (<VM>mode)
+       && operands[4] == CONST1_RTX (Pmode)
+       && (register_operand (operands[3], <VEL>mode)
+           || satisfies_constraint_J (operands[3])))"
+{
+  /* A broadcast of a single element is just a vfmv.s.f.  */
+  operands[1] = riscv_vector::gen_scalar_move_mask (<VM>mode);
+}
  [(set_attr "type" "vfmov,vfmov,vfmovfv,vfmovfv")
   (set_attr "mode" "<MODE>")])

-(define_insn "*pred_broadcast<mode>_zvfhmin"
-  [(set (match_operand:V_VLSF_ZVFHMIN   0 "register_operand"              "=vr,  
vr,  vr,  vr")
-       (if_then_else:V_VLSF_ZVFHMIN
-         (unspec:<VM>
-           [(match_operand:<VM>        1 "vector_broadcast_mask_operand" " vm,  vm, 
Wc1, Wc1")
-            (match_operand             4 "vector_length_operand"         "rvl, rvl, 
rvl, rvl")
-            (match_operand             5 "const_int_operand"             "  i,   i, 
  i,   i")
-            (match_operand             6 "const_int_operand"             "  i,   i, 
  i,   i")
-            (match_operand             7 "const_int_operand"             "  i,   i, 
  i,   i")
-            (reg:SI VL_REGNUM)
-            (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
-         (vec_duplicate:V_VLSF_ZVFHMIN
-           (match_operand:<VEL>        3 "direct_broadcast_operand"      "  A,   A, 
  A,   A"))
-         (match_operand:V_VLSF_ZVFHMIN 2 "vector_merge_operand"          " vu,   0, 
 vu,   0")))]
-  "TARGET_VECTOR && strided_load_broadcast_p ()"
-  "@
-   vlse<sew>.v\t%0,%3,zero,%1.t
-   vlse<sew>.v\t%0,%3,zero,%1.t
-   vlse<sew>.v\t%0,%3,zero
-   vlse<sew>.v\t%0,%3,zero"
-  [(set_attr "type" "vlds,vlds,vlds,vlds")
-   (set_attr "mode" "<MODE>")])
-
(define_insn "*pred_broadcast<mode>_extended_scalar"
  [(set (match_operand:V_VLSI_D 0 "register_operand"               "=vr, vr, vr, 
vr")
        (if_then_else:V_VLSI_D
@@ -2398,6 +2327,119 @@ (define_insn "*pred_broadcast<mode>_imm"
  [(set_attr "type" "vimov,vimov")
   (set_attr "mode" "<MODE>")])

+(define_expand "@pred_strided_broadcast<mode>"
+  [(set (match_operand:V_VLS 0 "register_operand")
+       (if_then_else:V_VLS
+         (unspec:<VM>
+           [(match_operand:<VM> 1 "strided_broadcast_mask_operand")
+            (match_operand 4 "vector_length_operand")
+            (match_operand 5 "const_int_operand")
+            (match_operand 6 "const_int_operand")
+            (match_operand 7 "const_int_operand")
+            (reg:SI VL_REGNUM)
+            (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
+         (vec_duplicate:V_VLS
+           (match_operand:<VEL> 3 "strided_broadcast_operand"))
+         (match_operand:V_VLS 2 "vector_merge_operand")))]
+  "TARGET_VECTOR"
+{
+  if (satisfies_constraint_Wb1 (operands[1]))
+    {
+    /* TODO*/
+      /* If we're asked to set a single element (like vmv.s.x but we
+        need to go via memory here) and the tail policy is agnostic
+        we are allowed to overwrite all elements and pretend to do
+        a vmv.v.x but via memory.
+        Thus, set the mask to broadcast.  */
+      operands[1] = CONSTM1_RTX (<VM>mode);
+      if (!satisfies_constraint_vu (operands[2])
+         && GET_MODE_SIZE (<VEL>mode) > UNITS_PER_WORD)
+       {
+         /* Case 2: vmv.s.x (TU, x == memory) ==>
+            vl = 0 or 1; + vlse.v (TU) in RV32 system  */
+         /* In this case we must not overwrite the residual elements,
+            so set the vector length to 0/1.  */
+         operands[4] = riscv_vector::gen_avl_for_scalar_move (operands[4]);
+       }
+    }
+})
+
+(define_insn_and_split "*pred_strided_broadcast<mode>"
+  [(set (match_operand:V_VLSI 0 "register_operand"                  "=vd, vd, vr, 
vr")
+       (if_then_else:V_VLSI
+         (unspec:<VM>
+           [(match_operand:<VM> 1 "strided_broadcast_mask_operand" " vm, 
vm,Wc1,Wc1")
+            (match_operand 4 "vector_length_operand"               
"rvl,rvl,rvl,rvl")
+            (match_operand 5 "const_int_operand"                   "  i,  i,  i,  
i")
+            (match_operand 6 "const_int_operand"                   "  i,  i,  i,  
i")
+            (match_operand 7 "const_int_operand"                   "  i,  i,  i,  
i")
+            (reg:SI VL_REGNUM)
+            (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
+         (vec_duplicate:V_VLSI
+           (match_operand:<VEL> 3 "strided_broadcast_operand"      "  A,  A,  A,  
A"))
+         (match_operand:V_VLSI 2 "vector_merge_operand"            " vu,  0, vu,  
0")))]
+  "TARGET_VECTOR"
+  "@
+   vlse<sew>.v\t%0,%3,zero,%1.t
+   vlse<sew>.v\t%0,%3,zero,%1.t
+   vlse<sew>.v\t%0,%3,zero
+   vlse<sew>.v\t%0,%3,zero"
+  "&& !strided_load_broadcast_p () && can_create_pseudo_p ()"
+  [(const_int 0)]
+  {
+    rtx tmp = gen_reg_rtx (V1DImode);
+    emit_move_insn (tmp, gen_lowpart (V1DImode, operands[3]));
+    tmp = lowpart_subreg (<MODE>mode, tmp, V1DImode);
+
+    emit_insn
+      (gen_pred_gather<mode>_scalar
+       (operands[0], operands[1], operands[2], tmp, CONST0_RTX (Pmode),
+       operands[4], operands[5], operands[6], operands[7]));
+    DONE;
+  }
+  [(set_attr "type" "vlds,vlds,vlds,vlds")
+   (set_attr "mode" "<MODE>")])
+
+(define_insn_and_split "*pred_strided_broadcast<mode>_zvfhmin"
+  [(set (match_operand:V_VLSF_ZVFHMIN   0 "register_operand"               "=vr,  
vr,  vr,  vr")
+       (if_then_else:V_VLSF_ZVFHMIN
+         (unspec:<VM>
+           [(match_operand:<VM>        1 "strided_broadcast_mask_operand" " vm,  
vm, Wc1, Wc1")
+            (match_operand             4 "vector_length_operand"          "rvl, 
rvl, rvl, rvl")
+            (match_operand             5 "const_int_operand"              "  i,   
i,   i,   i")
+            (match_operand             6 "const_int_operand"              "  i,   
i,   i,   i")
+            (match_operand             7 "const_int_operand"              "  i,   
i,   i,   i")
+            (reg:SI VL_REGNUM)
+            (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
+         (vec_duplicate:V_VLSF_ZVFHMIN
+           (match_operand:<VEL>        3 "strided_broadcast_operand"      "  A,   
A,   A,   A"))
+         (match_operand:V_VLSF_ZVFHMIN 2 "vector_merge_operand"           " vu,   
0,  vu,   0")))]
+  "TARGET_VECTOR"
+  "@
+   vlse<sew>.v\t%0,%3,zero,%1.t
+   vlse<sew>.v\t%0,%3,zero,%1.t
+   vlse<sew>.v\t%0,%3,zero
+   vlse<sew>.v\t%0,%3,zero"
+  "&& !strided_load_broadcast_p ()
+   && <VEL>mode == HFmode
+   && can_create_pseudo_p ()"
+  [(const_int 0)]
+  {
+    poly_uint64 nunits = GET_MODE_NUNITS (<MODE>mode);
+    machine_mode vmodehi
+      = riscv_vector::get_vector_mode (HImode, nunits).require ();
+    rtx ops[] = {gen_lowpart (vmodehi, operands[0]),
+                gen_lowpart (HImode, operands[3])};
+    riscv_vector::emit_avltype_insn (code_for_pred_broadcast (vmodehi),
+                                    riscv_vector::UNARY_OP, ops,
+                                    (riscv_vector::avl_type) INTVAL 
(operands[7]),
+                                    operands[4]);
+    DONE;
+  }
+  [(set_attr "type" "vlds,vlds,vlds,vlds")
+   (set_attr "mode" "<MODE>")])
+
+
;; 
-------------------------------------------------------------------------------
;; ---- Predicated Strided loads/stores
;; 
-------------------------------------------------------------------------------
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-6.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-6.c
index 4dc5703d894..0fa1ea049b1 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-6.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-6.c
@@ -72,7 +72,7 @@ f_vnx128qi (int8_t *out)
  *(vnx128qi *) out = v;
}

-/* { dg-final { scan-assembler-times {vmv.v.x\tv[0-9]+,\s*[a-x0-9]+} 6 } } */
+/* { dg-final { scan-assembler-times {vmv.v.x\tv[0-9]+,\s*[a-x0-9]+} 7 } } */
/* { dg-final { scan-assembler-times {slli\t[a-x0-9]+,\s*[a-x0-9]+,\s*8} 6 } } 
*/
/* { dg-final { scan-assembler-times {or\t[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+} 
6 } } */
/* { dg-final { scan-assembler-times 
{vslide1down\.vx\tv[0-9]+,\s*v[0-9]+,\s*[a-x0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-5.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-5.c
index 04dec7bc8dc..4f6785ace68 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-5.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-5.c
@@ -6,9 +6,9 @@

/*
** foo:
-** addi\t[a-x0-9]+,\s*[a-x0-9]+,100
+** ...
** vsetvli\tzero,a2,e64,m2,t[au],m[au]
-** vlse64.v\tv[0-9]+,0\([a-x0-9]+\),zero
+** vmv.s.x\tv[0-9]+.*
** vs2r.v\tv[0-9]+,0\([a-x0-9]+\)
** ret
*/
@@ -23,7 +23,7 @@ void foo (void *base, void *out, size_t vl)
** foo2:
** fld\tfa[0-9]+,\s*100\(a0\)
** vsetvli\tzero,a2,e64,m2,t[au],m[au]
-** vfmv\.v\.f\tv[0-9]+,\s*fa[0-9]+
+** vfmv\.s\.f\tv[0-9]+,\s*fa[0-9]+
** vs2r.v\tv[0-9]+,0\([a-x0-9]+\)
** ret
*/
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-6.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-6.c
index 0ebb92eda42..a8c9263c4df 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-6.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-6.c
@@ -23,7 +23,7 @@ void foo (void *base, void *out, size_t vl)
** foo2:
** fld\tfa[0-9]+,\s*100\(a0\)
** vsetvli\tzero,a2,e64,m2,t[au],m[au]
-** vfmv\.v\.f\tv[0-9]+,\s*fa[0-9]+
+** vfmv\.s\.f\tv[0-9]+,\s*fa[0-9]+
** vs2r.v\tv[0-9]+,0\([a-x0-9]+\)
** ret
*/
@@ -52,7 +52,7 @@ void foo3 (void *base, void *out, size_t vl)
/*
** foo4:
** ...
-** vfmv\.v\.f\tv[0-9]+,\s*fa[0-9]+
+** vfmv\.s\.f\tv[0-9]+,\s*fa[0-9]+
** ...
** ret
*/
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-7.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-7.c
index 512fa62858a..cf53aca5c62 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-7.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-7.c
@@ -6,9 +6,9 @@

/*
** foo:
-** addi\t[a-x0-9]+,\s*[a-x0-9]+,100
+** ...
** vsetvli\tzero,a2,e64,m2,t[au],m[au]
-** vlse64.v\tv[0-9]+,0\([a-x0-9]+\),zero
+** vmv\.v\.x\tv[0-9]+,\s*a[0-9]+
** vs2r.v\tv[0-9]+,0\([a-x0-9]+\)
** ret
*/
@@ -37,7 +37,7 @@ void foo2 (void *base, void *out, size_t vl)
/*
** foo3:
** ...
-** vlse64.v\tv[0-9]+,0\([a-x0-9]+\),zero
+** vmv\.v\.x\tv[0-9]+,\s*a[0-9]+
** ...
** ret
*/
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-8.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-8.c
index d9d10f3702a..fd3b7c57510 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-8.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-8.c
@@ -175,9 +175,8 @@ void foo12 (void *base, void *out, size_t vl)
/*
** foo13:
** ...
-** vmv.v.x\tv[0-9]+,\s*[a-x0-9]+
+** vlse64.v\tv[0-9]+,0\([a-x0-9]+\),zero
** ...
-** ret
*/
void foo13 (void *base, void *out, size_t vl)
{
@@ -189,7 +188,7 @@ void foo13 (void *base, void *out, size_t vl)
/*
** foo14:
** ...
-** vmv.v.x\tv[0-9]+,\s*[a-x0-9]+
+** vlse64.v\tv[0-9]+,0\([a-x0-9]+\),zero
** ...
*/
void foo14 (void *base, void *out, size_t vl)
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-9.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-9.c
index 80ee1b5f0c9..64c22dd39e6 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-9.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/scalar_move-9.c
@@ -23,4 +23,3 @@ vuint64m2_t f3(vuint64m2_t var_17, uint64_t var_60, size_t vl)

/* { dg-final { scan-assembler-times 
{vsetivli\s+zero,\s*0,\s*e64,\s*m2,\s*t[au],\s*m[au]} 1 } } */
/* { dg-final { scan-assembler-times 
{vsetivli\s+zero,\s*1,\s*e64,\s*m2,\s*t[au],\s*m[au]} 1 } } */
-/* { dg-final { scan-assembler-times {sgtu} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/pr121073.c 
b/gcc/testsuite/gcc.target/riscv/rvv/pr121073.c
new file mode 100644
index 00000000000..e095c271c5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/pr121073.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -O3 -mrvv-vector-bits=zvl -fsigned-char 
-fno-strict-aliasing -fwrapv -Wno-stringop-overflow 
-Wno-aggressive-loop-optimizations" } */
+
+int a;
+unsigned char p[1][21];
+void init() {
+  for (int s = 0; s < 21; ++s)
+    for (int t = 0; t < 21; ++t)
+      p[s][t] = 39;
+  for (short t = 0; t < 9; t += -5077966496202321318LL + 28071)
+    a = p[3][t] && p[2][t];
+}
--
2.50.0

[PATCH] RISC-V: Rework broadcast handling [PR121073].

Reply via email to