Jennifer Schmitz <jschm...@nvidia.com> writes:
> SVE loads/stores using predicates that select the bottom 8, 16, 32, 64,
> or 128 bits of a register can be folded to ASIMD LDR/STR, thus avoiding the
> predicate.
> For example,
> svuint8_t foo (uint8_t *x) {
>   return svld1 (svwhilelt_b8 (0, 16), x);
> }
> was previously compiled to:
> foo:
>       ptrue   p3.b, vl16
>       ld1b    z0.b, p3/z, [x0]
>       ret
>
> and is now compiled to:
> foo:
>       ldr     q0, [x0]
>       ret
>
> The optimization is applied during the expand pass and was implemented
> by making the following changes to maskload<mode><vpred> and
> maskstore<mode><vpred>:
> - the existing define_insns were renamed and new define_expands for maskloads
>   and maskstores were added with predicates for the SVE predicate that match
>   both register operands and constant-vector operands.
> - if the SVE predicate is a constant vector and contains a pattern as
>   described above, an ASIMD load/store is emitted instead of the SVE 
> load/store.
>
> The patch implements the optimization for LD1 and ST1, for 8-bit, 16-bit,
> 32-bit, 64-bit, and 128-bit moves, for all full SVE data vector modes.
> Note that VNx8HFmode and VNx2BFmode with a VL2 pattern were excluded, because
> there are no move patterns for V2HFmode and V2BFmode (yet).
>
> Follow-up patches for LD2/3/4 and ST2/3/4 and potentially partial SVE vector
> modes are planned.
>
> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>
> gcc/
>       PR target/117978
>       * config/aarch64/aarch64-protos.h: Declare
>       aarch64_simd_container_mode, aarch64_sve_full_data_mode_p,
>       aarch64_count_pred_pat_128, aarch64_emit_load_store_through_mode.
>       * config/aarch64/aarch64-sve.md
>       (maskload<mode><vpred>): New define_expand folding maskloads with
>       certain predicate patterns to ASIMD loads.
>       (*aarch64_maskload<mode><vpred>): Renamed from maskload<mode><vpred>.
>       (maskstore<mode><vpred>): New define_expand folding maskstores with
>       certain predicate patterns to ASIMD stores.
>       (*aarch64_maskstore<mode><vpred>): Renamed from maskstore<mode><vpred>.
>       * config/aarch64/aarch64.cc
>       (aarch64_sve_full_data_mode_p): New function returning true if a given
>       mode is a full SVE data vector mode.
>       (aarch64_emit_load_store_through_mode): New function emitting a
>       load/store through subregs of a given mode.
>       (aarch64_emit_sve_pred_move): Refactor to use
>       aarch64_emit_load_store_through_mode.
>       (aarch64_v8_mode): New function returning an 8-bit mode.
>       (aarch64_v16_mode): New function returning a 16-bit mode.
>       (aarch64_v32_mode): New function returning a 32-bit mode.
>       (aarch64_simd_container_mode): Make public and extend to find
>       8-bit, 16-bit, and 32-bit container modes.
>       (aarch64_count_pred_pat_128): New function to find SVE predicates
>       with VL1, VL2, VL4, VL8, or VL16 patterns.
>       * config/aarch64/iterators.md (elem_bits): Extend to cover partial
>       SVE vector modes.
>       * config/aarch64/predicates.md (aarch64_sve_reg_or_const_pred): New
>       predicate matching register operands or constant-vector operands.
>
> gcc/testsuite/
>       PR target/117978
>       * gcc.target/aarch64/sve/acle/general/whilelt_5.c: Adjust expected
>       outcome.
>       * gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c: New test.
>       * gcc.target/aarch64/sve/while_7.c: Adjust expected outcome.
>       * gcc.target/aarch64/sve/while_9.c: Adjust expected outcome.
> ---
>  gcc/config/aarch64/aarch64-protos.h           |   4 +
>  gcc/config/aarch64/aarch64-sve.md             |  62 ++++++++-
>  gcc/config/aarch64/aarch64.cc                 | 128 +++++++++++++++---
>  gcc/config/aarch64/iterators.md               |  19 ++-
>  gcc/config/aarch64/predicates.md              |   4 +
>  .../aarch64/sve/acle/general/whilelt_5.c      |  24 +++-
>  .../aarch64/sve/ldst_ptrue_pat_128_to_neon.c  |  81 +++++++++++
>  .../gcc.target/aarch64/sve/while_7.c          |   4 +-
>  .../gcc.target/aarch64/sve/while_9.c          |   2 +-
>  9 files changed, 296 insertions(+), 32 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 1ca86c9d175..a03f091fe3a 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -857,6 +857,7 @@ enum aarch64_symbol_type 
> aarch64_classify_symbolic_expression (rtx);
>  bool aarch64_advsimd_struct_mode_p (machine_mode mode);
>  opt_machine_mode aarch64_v64_mode (scalar_mode);
>  opt_machine_mode aarch64_v128_mode (scalar_mode);
> +machine_mode aarch64_simd_container_mode (scalar_mode, poly_int64);
>  opt_machine_mode aarch64_full_sve_mode (scalar_mode);
>  bool aarch64_can_const_movi_rtx_p (rtx x, machine_mode mode);
>  bool aarch64_valid_fp_move (rtx, rtx, machine_mode);
> @@ -903,8 +904,10 @@ opt_machine_mode aarch64_advsimd_vector_array_mode 
> (machine_mode,
>                                                   unsigned HOST_WIDE_INT);
>  opt_machine_mode aarch64_sve_data_mode (scalar_mode, poly_uint64);
>  bool aarch64_sve_mode_p (machine_mode);
> +bool aarch64_sve_full_data_mode_p (machine_mode);
>  HOST_WIDE_INT aarch64_fold_sve_cnt_pat (aarch64_svpattern, unsigned int);
>  bool aarch64_sve_cnt_immediate_p (rtx);
> +int aarch64_count_pred_pat_128 (rtx, machine_mode);
>  bool aarch64_sve_scalar_inc_dec_immediate_p (rtx);
>  bool aarch64_sve_rdvl_immediate_p (rtx);
>  bool aarch64_sve_addvl_addpl_immediate_p (rtx);
> @@ -1026,6 +1029,7 @@ rtx aarch64_ptrue_reg (machine_mode, unsigned int);
>  rtx aarch64_ptrue_reg (machine_mode, machine_mode);
>  rtx aarch64_pfalse_reg (machine_mode);
>  bool aarch64_sve_same_pred_for_ptest_p (rtx *, rtx *);
> +void aarch64_emit_load_store_through_mode (rtx, rtx, machine_mode);
>  void aarch64_emit_sve_pred_move (rtx, rtx, rtx);
>  void aarch64_expand_sve_mem_move (rtx, rtx, machine_mode);
>  bool aarch64_maybe_expand_sve_subreg_move (rtx, rtx);
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index d4af3706294..d9392e3611a 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -1286,7 +1286,36 @@
>  ;; -------------------------------------------------------------------------
>  
>  ;; Predicated LD1 (single).
> -(define_insn "maskload<mode><vpred>"
> +(define_expand "maskload<mode><vpred>"
> +  [(set (match_operand:SVE_ALL 0 "register_operand")
> +     (unspec:SVE_ALL
> +       [(match_operand:<VPRED> 2 "aarch64_sve_reg_or_const_pred")

I don't think we need a new predicate for this.  nonmemory_operand
should be good enough.

> +        (match_operand:SVE_ALL 1 "memory_operand")
> +        (match_operand:SVE_ALL 3 "aarch64_maskload_else_operand")]
> +       UNSPEC_LD1_SVE))]
> +  "TARGET_SVE"
> +  {
> +    int pat_cnt = aarch64_count_pred_pat_128 (operands[2], <MODE>mode);
> +    int width = <elem_bits> * pat_cnt;
> +    if (aarch64_sve_full_data_mode_p (<MODE>mode)
> +     && pat_cnt && (pat_cnt == 1 || !BYTES_BIG_ENDIAN)
> +     && known_le (width, 128))
> +      {
> +     machine_mode mode = aarch64_simd_container_mode (<VEL>mode, width);

I'm not sure this needs to be so complex.  It should be enough to pick
an arbitrary 128-bit vector mode (say V16QI) for width==128,
an arbitrary 64-bit vector mode (say V8QI) for width==64,
and an integer mode for smaller widths
(int_mode_for_size (...).require ()).

Experimenting locally, I assume you did the above to avoid ICEs while
forming the subregs.  But that's something we should fix in
aarch64_emit_load_store_through_mode.  Writing it as:

/* Emit a load/store from a subreg of SRC to a subreg of DEST.
   The subregs have mode NEW_MODE. Use only for reg<->mem moves.  */
void
aarch64_emit_load_store_through_mode (rtx dest, rtx src, machine_mode new_mode)
{
  gcc_assert ((REG_P (src) && MEM_P (dest))
              || (REG_P (dest) && MEM_P (src)));
  auto mode = GET_MODE (dest);
  auto int_mode = aarch64_sve_int_mode (mode);
  if (MEM_P (src))
    {
      rtx tmp = force_reg (new_mode, adjust_address (src, new_mode, 0));
      tmp = force_lowpart_subreg (int_mode, tmp, new_mode);
      emit_move_insn (dest, force_lowpart_subreg (mode, tmp, int_mode));
    }
  else
    {
      src = force_lowpart_subreg (int_mode, src, mode);
      emit_move_insn (adjust_address (dest, new_mode, 0),
                      force_lowpart_subreg (new_mode, src, int_mode));
    }
}

should cope with the general case, and means that the new test has
no SVE loads and stores.

> +     if (mode != VOIDmode)
> +     {
> +       aarch64_emit_load_store_through_mode (operands[0],
> +                                             operands[1], mode);
> +       DONE;
> +     }
> +      }
> +    if (!REG_P (operands[2]))

Probably better to use if (CONSTANT_P (...)) or if (!register_operand (...)),
so that we don't create a new register if operands[2] is a subreg.

> +      operands[2] = force_reg (<VPRED>mode, operands[2]);
> +  }
> +)
> +
> +;; Predicated LD1 (single).
> +(define_insn "*aarch64_maskload<mode><vpred>"
>    [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
>       (unspec:SVE_ALL
>         [(match_operand:<VPRED> 2 "register_operand" "Upl")
> @@ -2287,7 +2316,36 @@
>  ;; -------------------------------------------------------------------------
>  
>  ;; Predicated ST1 (single).
> -(define_insn "maskstore<mode><vpred>"
> +(define_expand "maskstore<mode><vpred>"
> +  [(set (match_operand:SVE_ALL 0 "memory_operand")
> +     (unspec:SVE_ALL
> +       [(match_operand:<VPRED> 2 "aarch64_sve_reg_or_const_pred")
> +        (match_operand:SVE_ALL 1 "register_operand")
> +        (match_dup 0)]
> +       UNSPEC_ST1_SVE))]
> +  "TARGET_SVE"
> +  {
> +    int pat_cnt = aarch64_count_pred_pat_128 (operands[2], <MODE>mode);
> +    int width = <elem_bits> * pat_cnt;
> +    if (aarch64_sve_full_data_mode_p (<MODE>mode)
> +     && pat_cnt && (pat_cnt == 1 || !BYTES_BIG_ENDIAN)
> +     && known_le (width, 128))
> +      {
> +     machine_mode mode = aarch64_simd_container_mode (<VEL>mode, width);
> +     if (mode != VOIDmode)
> +     {
> +       aarch64_emit_load_store_through_mode (operands[0],
> +                                             operands[1], mode);
> +       DONE;
> +     }

This seems a bit too complex for cut-&-paste.  Could you put it in a common
aarch64.cc helper?

> +      }
> +    if (!REG_P (operands[2]))
> +      operands[2] = force_reg (<VPRED>mode, operands[2]);
> +  }
> +)
> +
> +;; Predicated ST1 (single).
> +(define_insn "*aarch64_maskstore<mode><vpred>"
>    [(set (match_operand:SVE_ALL 0 "memory_operand" "+m")
>       (unspec:SVE_ALL
>         [(match_operand:<VPRED> 2 "register_operand" "Upl")
> [...]
> @@ -23526,6 +23602,26 @@ aarch64_simd_valid_imm (rtx op, simd_immediate_info 
> *info,
>    return false;
>  }
>  
> +/* If PRED is a patterned SVE PTRUE predicate with patterns
> +   VL1, VL2, VL4, VL8, or VL16, return the number of active lanes
> +   for the mode MODE. Else return 0.  */
> +int
> +aarch64_count_pred_pat_128 (rtx pred, machine_mode mode)
> +{
> +  struct simd_immediate_info info;
> +  bool is_valid;
> +  is_valid = aarch64_simd_valid_imm (pred, &info, AARCH64_CHECK_MOV);
> +  if (!is_valid || info.insn != simd_immediate_info::PTRUE)
> +    return 0;
> +  aarch64_svpattern pattern = info.u.pattern;
> +  unsigned int cnt
> +    = aarch64_fold_sve_cnt_pat (pattern, 128 / GET_MODE_UNIT_BITSIZE (mode));
> +  if (pattern <= AARCH64_SV_VL16 && pow2p_hwi (cnt))
> +    return cnt;
> +  else
> +    return 0;
> +}
> +

It would be better to avoid the round-trip through aarch64_simd_valid_imm,
since the caller doesn't really care whether the length is a valid ptrue
operand.  How about instead:

int
aarch64_partial_ptrue_length (rtx pred)
{
  rtx_vector_builder builder;
  if (!aarch64_get_sve_pred_bits (builder, pred))
    return 0;

  auto elt_size = vector_element_size (GET_MODE_BITSIZE (GET_MODE (pred)),
                                       GET_MODE_NUNITS (GET_MODE (pred)));
  return aarch64_partial_ptrue_length (builder, elt_size);
}

This needs an extra change to aarch64_partial_ptrue_length to handle
the VL1 case:  (I remember the VL1 case being discussed recently in
a different context.)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index fff8d9da49d..1c4194ce883 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3667,6 +3674,14 @@ aarch64_partial_ptrue_length (rtx_vector_builder 
&builder,
   if (builder.nelts_per_pattern () == 3)
     return 0;
 
+  /* It is conservatively correct to drop the element size to a lower value,
+     and we must do so if the predicate consists of a leading "foreground"
+     sequence that is smaller than the element size.  Without this,
+     we would test only one bit and so treat everything as either an
+     all-true or an all-false predicate.  */
+  if (builder.nelts_per_pattern () == 2)
+    elt_size = MIN (elt_size, builder.npatterns ());
+
   /* Skip over leading set bits.  */
   unsigned int nelts = builder.encoded_nelts ();
   unsigned int i = 0;

I haven't bootstrapped & regression-tested that yet, but I'll give it a go.

On the tests: it would be good to have a second testcase for things that
can't use the new path, such as non-power-of-2 ptrue VLs, or passing
svptrue_b16() to an 8-bit load.  (Unless that's already covered elsewhere.)

Thanks,
Richard

Reply via email to