Kyrylo Tkachov <ktkac...@nvidia.com> writes:
> Hi Richard,
>
>> On 23 Oct 2024, at 11:30, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>>
>> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>> Hi all,
>>>
>>> Some vector rotate operations can be implemented in a single instruction
>>> rather than using the fallback SHL+USRA sequence.
>>> In particular, when the rotate amount is half the bitwidth of the element
>>> we can use a REV64,REV32,REV16 instruction.
>>> This patch adds this transformation in the recently added splitter for 
>>> vector
>>> rotates.
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>
>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>
>>> gcc/
>>>
>>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>>> Declare prototype.
>>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
>>> Call the above.
>>>
>>> gcc/testsuite/
>>>
>>> * gcc.target/aarch64/simd/pr117048_2.c: New test.
>>
>> Sorry to be awkward, but I still think at least part of this should be
>> target-independent.  Any rotate by a byte amount can be expressed as a
>> vector permutation in a target-independent way.  Target-independent code
>> can then use the usual optab routines to query whether the permutation
>> is possible and/or try to generate it.
>
> Thank you for elaborating. I had already prototyped the permute 
> index-computing code in my tree
> but was reluctant to using it during expand as I wanted the rotate RTX to be 
> available for combining
> into XAR so I felt a bit stuck. Having the code in a generic place but called 
> from the backend at a
> time of its choosing makes sense to me.
>
>>
>> I can see that it probably makes sense to leave target code to make
>> the decision about when to use the permutation strategy vs. other
>> approaches.  But the code to implement that strategy shouldn't need
>> to be target-specific.
>>
>> E.g. we could have a routine:
>>
>>  expand_rotate_as_vec_perm
>>
>> which checks whether the rotation amount is suitable and tries to
>> generate the permutation if so.
>
> I’ve implemented something like that in the attached patch.
> It seems to work on AArch64 but as mentioned in the commit message I’d like a 
> check on
> the big-endian logic, and perhaps some pointers on how/whether it should be 
> extended to
> VLA vectors.

Great!  Thanks for doing this.  Some comments on those aspects below,
but otherwise it LGTM.

>
> I’m updating the other patches in the series according to your feedback so 
> I’ll repost them once I’m done,
> just wanted to get this out for further iteration in the meantime.
> Thanks,
> Kyrill
>
>
>
>
>
>
> From 6c1794a574b5525b3b495ed505621a8af029e825 Mon Sep 17 00:00:00 2001
> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> Date: Wed, 16 Oct 2024 04:10:08 -0700
> Subject: [PATCH] aarch64: Optimize vector rotates as vector permutes where
>  possible
>
> Some vector rotate operations can be implemented in a single instruction
> rather than using the fallback SHL+USRA sequence.
> In particular, when the rotate amount is half the bitwidth of the element
> we can use a REV64,REV32,REV16 instruction.
> More generally, rotates by a byte amount can be implented using vector
> permutes.
> This patch adds such a generic routine in expmed.cc called
> expand_rotate_as_vec_perm that calculates the required permute indices
> and uses the expand_vec_perm_const interface.
>
> On aarch64 this ends up generating the single-instruction sequences above
> where possible and can use LDR+TBL sequences too, which are a good choice.
>
> For now, I have restricted the expand_rotate_as_vec_perm to fixed-width modes
> as I don't have much experience with using it for VLA modes, but I imagine
> it's extendable there.  In any case, the only use of expand_rotate_as_vec_perm
> is in aarch64-specific code that for now only handles fixed-width modes.
>
> A runtime aarch64 test is added to ensure the permute indices are not messed
> up.
> I'd appreciate a review of the BYTES_BIG_ENDIAN logic.  I've adjusted the
> permute vector indices in RTL for it and in the final AArch64 assembly the
> final vector loaded from .LC<N> is identical for little and big-endian, which
> I *think* is the correct behaviour.  For rotates by the half-width that should
> generate single REV64, REV32 instructions aarch64 does not seem to recognise
> them and falls back to an LDR+TBL for big-endian.  I'm not sure if that's
> simply missing logic in the vector permute expansion in aarch64 or if I should
> be passing a different permute vector.  If I delete the BYTES_BIG_ENDIAN hunk
> in expand_rotate_as_vec_perm the permute indices are reversed, but the REV*
> instructions are still not recognised.

Hmm, yeah, sounds like it might be missing logic, like you say.

> I don't have an easily accessible aarch64_be setup so I'd appreciate if 
> someone
> could run it there, hopefully the new execute test can help uncover any lane 
> index
> bugs.

Can't remember if the CI does aarch64_be or not.

> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>
> gcc/
>
>       * expmed.h (expand_rotate_as_vec_perm): Declare.
>       * expmed.cc (expand_rotate_as_vec_perm): Define.
>       * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>       Declare prototype.
>       * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>       * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
>       Call the above.
>
> gcc/testsuite/
>
>       * gcc.target/aarch64/vec-rot-exec.c: New test.
>       * gcc.target/aarch64/simd/pr117048_2.c: New test.
> ---
>  gcc/config/aarch64/aarch64-protos.h           |   1 +
>  gcc/config/aarch64/aarch64-simd.md            |   3 +
>  gcc/config/aarch64/aarch64.cc                 |  16 +++
>  gcc/expmed.cc                                 |  42 ++++++++
>  gcc/expmed.h                                  |   1 +
>  .../gcc.target/aarch64/simd/pr117048_2.c      |  66 ++++++++++++
>  .../gcc.target/aarch64/vec-rot-exec.c         | 101 ++++++++++++++++++
>  7 files changed, 230 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 75f30a52e61..fac7f9101a3 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -766,6 +766,7 @@ bool aarch64_rnd_imm_p (rtx);
>  bool aarch64_constant_address_p (rtx);
>  bool aarch64_emit_approx_div (rtx, rtx, rtx);
>  bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
> +bool aarch64_emit_opt_vec_rotate (rtx, rtx, rtx);
>  tree aarch64_vector_load_decl (tree);
>  rtx aarch64_gen_callee_cookie (aarch64_isa_mode, arm_pcs);
>  void aarch64_expand_call (rtx, rtx, rtx, bool);
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 6f69ccb6eba..b0aa6e0aded 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1313,6 +1313,9 @@
>           (match_dup 4))
>         (match_dup 3)))]
>    {
> +    if (aarch64_emit_opt_vec_rotate (operands[0], operands[1], operands[2]))
> +      DONE;
> +
>      operands[3] = reload_completed ? operands[0] : gen_reg_rtx (<MODE>mode);
>      rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]);
>      int bitwidth = GET_MODE_UNIT_SIZE (<MODE>mode) * BITS_PER_UNIT;
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 7fbe3a7380c..5b64978daba 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16027,6 +16027,22 @@ aarch64_emit_approx_div (rtx quo, rtx num, rtx den)
>    return true;
>  }
>  
> +/* Emit an optimized sequence to perform a vector rotate
> +   of REG by the vector constant amount AMNT and place the result
> +   in DST.  Return true iff successful.  */
> +
> +bool
> +aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt)
> +{
> +  machine_mode mode = GET_MODE (reg);
> +  /* Attempt to expand the rotate as a vector permute.
> +     For some rotate amounts they can be single instructions and
> +     even the general single-vector TBL permute has good throughput.  */
> +  if (expand_rotate_as_vec_perm (mode, dst, reg, amnt))
> +    return true;
> +  return false;
> +}
> +
>  /* Return the number of instructions that can be issued per cycle.  */
>  static int
>  aarch64_sched_issue_rate (void)
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index 4498f9f38a8..1d7d6a7a914 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -6286,6 +6286,48 @@ emit_store_flag_force (rtx target, enum rtx_code code, 
> rtx op0, rtx op1,
>    return target;
>  }
>  
> +/* Expand a vector (left) rotate of MODE of X by an immediate AMT as a vector
> +   permute operation.  Emit code to put the result in DST if successfull and
> +   return it.  Otherwise return NULL.  This is intended to implement vector
> +   rotates by byte amounts using vector permutes when the target does not 
> offer
> +   native vector rotate operations.  */
> +rtx
> +expand_rotate_as_vec_perm (machine_mode mode, rtx dst, rtx x, rtx amt)
> +{
> +  int rotamnt = INTVAL (unwrap_const_vec_duplicate (amt));

It's probably worth making this even more general by checking for
CONST_INT_P on the result of unwrap_const_vec_duplicate, to cope
gracefully with variable amounts.  (Or poly_int amounts :))

> +  if (rotamnt % BITS_PER_UNIT != 0)
> +    return NULL_RTX;
> +  machine_mode qimode;
> +  if (!qimode_for_vec_perm (mode).exists (&qimode))
> +    return NULL_RTX;
> +
> +  vec_perm_builder builder;
> +  unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));

simpler as GET_MODE_UNIT_SIZE

> +  unsigned total_units;
> +  /* TODO: Handle VLA vector rotates?  */
> +  if (!GET_MODE_SIZE (mode).is_constant (&total_units))
> +    return NULL_RTX;

Yeah.  I think we can do that by changing:

> +  builder.new_vector (total_units, 1, total_units);

to:

  builder.new_vector (total_units, 3, units);

unconditionally and making the outer loop below iterate exactly
three times (i.e. to nunits * 3).  It's ok if we generate more
indices than needed.

> +  int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT;
> +  for (unsigned j = 0; j < total_units; j += nunits)
> +    for (unsigned i = 0; i < nunits; i++)
> +      {
> +     unsigned idx = (rot_to_perm + i) % nunits + j;
> +     if (BYTES_BIG_ENDIAN)
> +       idx = total_units - idx - 1;

I think the endian adjustment should be local to the inner loop/vector
element only.  Since this would mean undoing the "nunits - " adjustment
above, how about something like:

  unsigned rot_bytes = rotamnt / BITS_PER_UNIT;
  unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes;
  ...
      builder.quick_push ((rot_to_perm + i) % nunits + j);

or whatever variation you prefer.

Hope I've got that right...

OK with those changes for the target-independent parts (and the
target-specific parts LGTM too FWIW).

Thanks,
Richard

> +     builder.quick_push (idx);
> +      }
> +  rtx perm_src = lowpart_subreg (qimode, x, mode);
> +  rtx perm_dst = lowpart_subreg (qimode, dst, mode);
> +  rtx res
> +    = expand_vec_perm_const (qimode, perm_src, perm_src, builder,
> +                          qimode, perm_dst);
> +  if (!res)
> +    return NULL_RTX;
> +  emit_move_insn (dst, lowpart_subreg (mode, res, qimode));
> +  return dst;
> +}
> +
>  /* Helper function for canonicalize_cmp_for_target.  Swap between inclusive
>     and exclusive ranges in order to create an equivalent comparison.  See
>     canonicalize_cmp_for_target for the possible cases.  */
> diff --git a/gcc/expmed.h b/gcc/expmed.h
> index 0a19176b77a..2414c3cb27d 100644
> --- a/gcc/expmed.h
> +++ b/gcc/expmed.h
> @@ -726,5 +726,6 @@ extern rtx expand_mult_highpart_adjust (scalar_int_mode, 
> rtx, rtx, rtx,
>                                       rtx, int);
>  extern rtx expmed_mult_highpart_optab (scalar_int_mode, rtx, rtx, rtx,
>                                      int, int);
> +extern rtx expand_rotate_as_vec_perm (machine_mode, rtx, rtx, rtx);
>  
>  #endif  // EXPMED_H
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c 
> b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
> new file mode 100644
> index 00000000000..7baf3581870
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
> @@ -0,0 +1,66 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mlittle-endian" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +typedef char __attribute__ ((vector_size (16))) v16qi;
> +typedef unsigned short __attribute__ ((vector_size (16))) v8hi;
> +typedef unsigned int __attribute__ ((vector_size (16))) v4si;
> +typedef unsigned long long __attribute__ ((vector_size (16))) v2di;
> +typedef unsigned short __attribute__ ((vector_size (8))) v4hi;
> +typedef unsigned int __attribute__ ((vector_size (8))) v2si;
> +
> +/*
> +** G1:
> +**   rev64   v0\.4s, v0\.4s
> +**   ret 
> +*/
> +v2di
> +G1 (v2di r)
> +{
> +  return (r >> 32) | (r << 32);
> +}
> +
> +/*
> +** G2:
> +**   rev32   v0\.8h, v0\.8h
> +**   ret 
> +*/
> +v4si
> +G2 (v4si r)
> +{
> +  return (r >> 16) | (r << 16);
> +}
> +
> +/*
> +** G3:
> +**   rev16   v0\.16b, v0\.16b
> +**   ret 
> +*/
> +v8hi
> +G3 (v8hi r)
> +{
> +  return (r >> 8) | (r << 8);
> +}
> +
> +/*
> +** G4:
> +**   rev32   v0\.4h, v0\.4h
> +**   ret 
> +*/
> +v2si
> +G4 (v2si r)
> +{
> +  return (r >> 16) | (r << 16);
> +}
> +
> +/*
> +** G5:
> +**   rev16   v0\.8b, v0\.8b
> +**   ret 
> +*/
> +v4hi
> +G5 (v4hi r)
> +{
> +  return (r >> 8) | (r << 8);
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c 
> b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
> new file mode 100644
> index 00000000000..130a9b1aa64
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
> @@ -0,0 +1,101 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2" } */
> +
> +typedef char __attribute__ ((vector_size (16))) v16qi;
> +typedef unsigned short __attribute__ ((vector_size (16))) v8hi;
> +typedef unsigned int __attribute__ ((vector_size (16))) v4si;
> +typedef unsigned long long __attribute__ ((vector_size (16))) v2di;
> +typedef char __attribute__ ((vector_size (8))) v8qi;
> +typedef unsigned short __attribute__ ((vector_size (8))) v4hi;
> +typedef unsigned int __attribute__ ((vector_size (8))) v2si;
> +#define VEC_ELTS(X) (sizeof (X) / (sizeof (X[0])))
> +
> +static const char __attribute__ ((aligned (16))) *str = 
> "abcdefghijklmnopqrstuvwxyz";
> +
> +unsigned long long
> +__attribute__((noipa,noinline)) 
> +rot_64_one (unsigned long long x, unsigned amt)
> +{
> +  return (x << amt) | (x >> (64 - amt));
> +}
> +unsigned int
> +__attribute__((noipa,noinline)) 
> +rot_32_one (unsigned int x, unsigned amt)
> +{
> +  return (x << amt) | (x >> (32 - amt));
> +}
> +
> +unsigned short
> +__attribute__((noipa,noinline)) 
> +rot_16_one (unsigned short x, unsigned short amt)
> +{
> +  return (x << amt) | (x >> (16 - amt));
> +}
> +
> +
> +#define ROTFUNC(M,S,A)                                       \
> +M                                                    \
> +__attribute__((noipa,noinline))                      \
> +rot_##M##_##S##_##A (M x)                            \
> +{                                                    \
> +  return (x << A) | (x >> (S - A));                  \
> +}                                                    \
> +                                                     \
> +void                                                 \
> +test_rot_##M##_##S##_##A (void)                              \
> +{                                                    \
> +  M vec = *(M *)str;                                 \
> +  M res = rot_##M##_##S##_##A (vec);                 \
> +  for (__SIZE_TYPE__ i = 0; i < VEC_ELTS (vec); i++) \
> +    if (res[i] != rot_##S##_one (vec[i], A))         \
> +      __builtin_abort ();                            \
> +}
> +
> +ROTFUNC (v2di, 64, 56)
> +ROTFUNC (v2di, 64, 48)
> +ROTFUNC (v2di, 64, 40)
> +ROTFUNC (v2di, 64, 32)
> +ROTFUNC (v2di, 64, 24)
> +ROTFUNC (v2di, 64, 16)
> +ROTFUNC (v2di, 64, 8)
> +
> +ROTFUNC (v4si, 32, 24)
> +ROTFUNC (v4si, 32, 16)
> +ROTFUNC (v4si, 32, 8)
> +
> +ROTFUNC (v8hi, 16, 8)
> +
> +ROTFUNC (v2si, 32, 24)
> +ROTFUNC (v2si, 32, 16)
> +ROTFUNC (v2si, 32, 8)
> +
> +ROTFUNC (v4hi, 16, 8)
> +
> +#define CALL_TEST(M,S,A) test_rot_##M##_##S##_##A ()
> +
> +int
> +main (void)
> +{
> +  CALL_TEST (v2di, 64, 56);
> +  CALL_TEST (v2di, 64, 48);
> +  CALL_TEST (v2di, 64, 40);
> +  CALL_TEST (v2di, 64, 32);
> +  CALL_TEST (v2di, 64, 24);
> +  CALL_TEST (v2di, 64, 16);
> +  CALL_TEST (v2di, 64, 8);
> +
> +  CALL_TEST (v4si, 32, 24);
> +  CALL_TEST (v4si, 32, 16);
> +  CALL_TEST (v4si, 32, 8);
> +
> +  CALL_TEST (v8hi, 16, 8);
> +
> +  CALL_TEST (v2si, 32, 24);
> +  CALL_TEST (v2si, 32, 16);
> +  CALL_TEST (v2si, 32, 8);
> +
> +  CALL_TEST (v4hi, 16, 8);
> +
> +  return 0;
> +}
> +

Reply via email to