Kyrylo Tkachov <ktkac...@nvidia.com> writes: > Hi Richard, > >> On 23 Oct 2024, at 11:30, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> >> Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>> Hi all, >>> >>> Some vector rotate operations can be implemented in a single instruction >>> rather than using the fallback SHL+USRA sequence. >>> In particular, when the rotate amount is half the bitwidth of the element >>> we can use a REV64,REV32,REV16 instruction. >>> This patch adds this transformation in the recently added splitter for >>> vector >>> rotates. >>> Bootstrapped and tested on aarch64-none-linux-gnu. >>> >>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>> >>> gcc/ >>> >>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate): >>> Declare prototype. >>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement. >>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>): >>> Call the above. >>> >>> gcc/testsuite/ >>> >>> * gcc.target/aarch64/simd/pr117048_2.c: New test. >> >> Sorry to be awkward, but I still think at least part of this should be >> target-independent. Any rotate by a byte amount can be expressed as a >> vector permutation in a target-independent way. Target-independent code >> can then use the usual optab routines to query whether the permutation >> is possible and/or try to generate it. > > Thank you for elaborating. I had already prototyped the permute > index-computing code in my tree > but was reluctant to using it during expand as I wanted the rotate RTX to be > available for combining > into XAR so I felt a bit stuck. Having the code in a generic place but called > from the backend at a > time of its choosing makes sense to me. > >> >> I can see that it probably makes sense to leave target code to make >> the decision about when to use the permutation strategy vs. other >> approaches. But the code to implement that strategy shouldn't need >> to be target-specific. >> >> E.g. we could have a routine: >> >> expand_rotate_as_vec_perm >> >> which checks whether the rotation amount is suitable and tries to >> generate the permutation if so. > > I’ve implemented something like that in the attached patch. > It seems to work on AArch64 but as mentioned in the commit message I’d like a > check on > the big-endian logic, and perhaps some pointers on how/whether it should be > extended to > VLA vectors.
Great! Thanks for doing this. Some comments on those aspects below, but otherwise it LGTM. > > I’m updating the other patches in the series according to your feedback so > I’ll repost them once I’m done, > just wanted to get this out for further iteration in the meantime. > Thanks, > Kyrill > > > > > > > From 6c1794a574b5525b3b495ed505621a8af029e825 Mon Sep 17 00:00:00 2001 > From: Kyrylo Tkachov <ktkac...@nvidia.com> > Date: Wed, 16 Oct 2024 04:10:08 -0700 > Subject: [PATCH] aarch64: Optimize vector rotates as vector permutes where > possible > > Some vector rotate operations can be implemented in a single instruction > rather than using the fallback SHL+USRA sequence. > In particular, when the rotate amount is half the bitwidth of the element > we can use a REV64,REV32,REV16 instruction. > More generally, rotates by a byte amount can be implented using vector > permutes. > This patch adds such a generic routine in expmed.cc called > expand_rotate_as_vec_perm that calculates the required permute indices > and uses the expand_vec_perm_const interface. > > On aarch64 this ends up generating the single-instruction sequences above > where possible and can use LDR+TBL sequences too, which are a good choice. > > For now, I have restricted the expand_rotate_as_vec_perm to fixed-width modes > as I don't have much experience with using it for VLA modes, but I imagine > it's extendable there. In any case, the only use of expand_rotate_as_vec_perm > is in aarch64-specific code that for now only handles fixed-width modes. > > A runtime aarch64 test is added to ensure the permute indices are not messed > up. > I'd appreciate a review of the BYTES_BIG_ENDIAN logic. I've adjusted the > permute vector indices in RTL for it and in the final AArch64 assembly the > final vector loaded from .LC<N> is identical for little and big-endian, which > I *think* is the correct behaviour. For rotates by the half-width that should > generate single REV64, REV32 instructions aarch64 does not seem to recognise > them and falls back to an LDR+TBL for big-endian. I'm not sure if that's > simply missing logic in the vector permute expansion in aarch64 or if I should > be passing a different permute vector. If I delete the BYTES_BIG_ENDIAN hunk > in expand_rotate_as_vec_perm the permute indices are reversed, but the REV* > instructions are still not recognised. Hmm, yeah, sounds like it might be missing logic, like you say. > I don't have an easily accessible aarch64_be setup so I'd appreciate if > someone > could run it there, hopefully the new execute test can help uncover any lane > index > bugs. Can't remember if the CI does aarch64_be or not. > Bootstrapped and tested on aarch64-none-linux-gnu. > > Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > gcc/ > > * expmed.h (expand_rotate_as_vec_perm): Declare. > * expmed.cc (expand_rotate_as_vec_perm): Define. > * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate): > Declare prototype. > * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement. > * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>): > Call the above. > > gcc/testsuite/ > > * gcc.target/aarch64/vec-rot-exec.c: New test. > * gcc.target/aarch64/simd/pr117048_2.c: New test. > --- > gcc/config/aarch64/aarch64-protos.h | 1 + > gcc/config/aarch64/aarch64-simd.md | 3 + > gcc/config/aarch64/aarch64.cc | 16 +++ > gcc/expmed.cc | 42 ++++++++ > gcc/expmed.h | 1 + > .../gcc.target/aarch64/simd/pr117048_2.c | 66 ++++++++++++ > .../gcc.target/aarch64/vec-rot-exec.c | 101 ++++++++++++++++++ > 7 files changed, 230 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c > > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > index 75f30a52e61..fac7f9101a3 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -766,6 +766,7 @@ bool aarch64_rnd_imm_p (rtx); > bool aarch64_constant_address_p (rtx); > bool aarch64_emit_approx_div (rtx, rtx, rtx); > bool aarch64_emit_approx_sqrt (rtx, rtx, bool); > +bool aarch64_emit_opt_vec_rotate (rtx, rtx, rtx); > tree aarch64_vector_load_decl (tree); > rtx aarch64_gen_callee_cookie (aarch64_isa_mode, arm_pcs); > void aarch64_expand_call (rtx, rtx, rtx, bool); > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index 6f69ccb6eba..b0aa6e0aded 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -1313,6 +1313,9 @@ > (match_dup 4)) > (match_dup 3)))] > { > + if (aarch64_emit_opt_vec_rotate (operands[0], operands[1], operands[2])) > + DONE; > + > operands[3] = reload_completed ? operands[0] : gen_reg_rtx (<MODE>mode); > rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]); > int bitwidth = GET_MODE_UNIT_SIZE (<MODE>mode) * BITS_PER_UNIT; > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 7fbe3a7380c..5b64978daba 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16027,6 +16027,22 @@ aarch64_emit_approx_div (rtx quo, rtx num, rtx den) > return true; > } > > +/* Emit an optimized sequence to perform a vector rotate > + of REG by the vector constant amount AMNT and place the result > + in DST. Return true iff successful. */ > + > +bool > +aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt) > +{ > + machine_mode mode = GET_MODE (reg); > + /* Attempt to expand the rotate as a vector permute. > + For some rotate amounts they can be single instructions and > + even the general single-vector TBL permute has good throughput. */ > + if (expand_rotate_as_vec_perm (mode, dst, reg, amnt)) > + return true; > + return false; > +} > + > /* Return the number of instructions that can be issued per cycle. */ > static int > aarch64_sched_issue_rate (void) > diff --git a/gcc/expmed.cc b/gcc/expmed.cc > index 4498f9f38a8..1d7d6a7a914 100644 > --- a/gcc/expmed.cc > +++ b/gcc/expmed.cc > @@ -6286,6 +6286,48 @@ emit_store_flag_force (rtx target, enum rtx_code code, > rtx op0, rtx op1, > return target; > } > > +/* Expand a vector (left) rotate of MODE of X by an immediate AMT as a vector > + permute operation. Emit code to put the result in DST if successfull and > + return it. Otherwise return NULL. This is intended to implement vector > + rotates by byte amounts using vector permutes when the target does not > offer > + native vector rotate operations. */ > +rtx > +expand_rotate_as_vec_perm (machine_mode mode, rtx dst, rtx x, rtx amt) > +{ > + int rotamnt = INTVAL (unwrap_const_vec_duplicate (amt)); It's probably worth making this even more general by checking for CONST_INT_P on the result of unwrap_const_vec_duplicate, to cope gracefully with variable amounts. (Or poly_int amounts :)) > + if (rotamnt % BITS_PER_UNIT != 0) > + return NULL_RTX; > + machine_mode qimode; > + if (!qimode_for_vec_perm (mode).exists (&qimode)) > + return NULL_RTX; > + > + vec_perm_builder builder; > + unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode)); simpler as GET_MODE_UNIT_SIZE > + unsigned total_units; > + /* TODO: Handle VLA vector rotates? */ > + if (!GET_MODE_SIZE (mode).is_constant (&total_units)) > + return NULL_RTX; Yeah. I think we can do that by changing: > + builder.new_vector (total_units, 1, total_units); to: builder.new_vector (total_units, 3, units); unconditionally and making the outer loop below iterate exactly three times (i.e. to nunits * 3). It's ok if we generate more indices than needed. > + int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT; > + for (unsigned j = 0; j < total_units; j += nunits) > + for (unsigned i = 0; i < nunits; i++) > + { > + unsigned idx = (rot_to_perm + i) % nunits + j; > + if (BYTES_BIG_ENDIAN) > + idx = total_units - idx - 1; I think the endian adjustment should be local to the inner loop/vector element only. Since this would mean undoing the "nunits - " adjustment above, how about something like: unsigned rot_bytes = rotamnt / BITS_PER_UNIT; unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes; ... builder.quick_push ((rot_to_perm + i) % nunits + j); or whatever variation you prefer. Hope I've got that right... OK with those changes for the target-independent parts (and the target-specific parts LGTM too FWIW). Thanks, Richard > + builder.quick_push (idx); > + } > + rtx perm_src = lowpart_subreg (qimode, x, mode); > + rtx perm_dst = lowpart_subreg (qimode, dst, mode); > + rtx res > + = expand_vec_perm_const (qimode, perm_src, perm_src, builder, > + qimode, perm_dst); > + if (!res) > + return NULL_RTX; > + emit_move_insn (dst, lowpart_subreg (mode, res, qimode)); > + return dst; > +} > + > /* Helper function for canonicalize_cmp_for_target. Swap between inclusive > and exclusive ranges in order to create an equivalent comparison. See > canonicalize_cmp_for_target for the possible cases. */ > diff --git a/gcc/expmed.h b/gcc/expmed.h > index 0a19176b77a..2414c3cb27d 100644 > --- a/gcc/expmed.h > +++ b/gcc/expmed.h > @@ -726,5 +726,6 @@ extern rtx expand_mult_highpart_adjust (scalar_int_mode, > rtx, rtx, rtx, > rtx, int); > extern rtx expmed_mult_highpart_optab (scalar_int_mode, rtx, rtx, rtx, > int, int); > +extern rtx expand_rotate_as_vec_perm (machine_mode, rtx, rtx, rtx); > > #endif // EXPMED_H > diff --git a/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c > b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c > new file mode 100644 > index 00000000000..7baf3581870 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c > @@ -0,0 +1,66 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mlittle-endian" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +typedef char __attribute__ ((vector_size (16))) v16qi; > +typedef unsigned short __attribute__ ((vector_size (16))) v8hi; > +typedef unsigned int __attribute__ ((vector_size (16))) v4si; > +typedef unsigned long long __attribute__ ((vector_size (16))) v2di; > +typedef unsigned short __attribute__ ((vector_size (8))) v4hi; > +typedef unsigned int __attribute__ ((vector_size (8))) v2si; > + > +/* > +** G1: > +** rev64 v0\.4s, v0\.4s > +** ret > +*/ > +v2di > +G1 (v2di r) > +{ > + return (r >> 32) | (r << 32); > +} > + > +/* > +** G2: > +** rev32 v0\.8h, v0\.8h > +** ret > +*/ > +v4si > +G2 (v4si r) > +{ > + return (r >> 16) | (r << 16); > +} > + > +/* > +** G3: > +** rev16 v0\.16b, v0\.16b > +** ret > +*/ > +v8hi > +G3 (v8hi r) > +{ > + return (r >> 8) | (r << 8); > +} > + > +/* > +** G4: > +** rev32 v0\.4h, v0\.4h > +** ret > +*/ > +v2si > +G4 (v2si r) > +{ > + return (r >> 16) | (r << 16); > +} > + > +/* > +** G5: > +** rev16 v0\.8b, v0\.8b > +** ret > +*/ > +v4hi > +G5 (v4hi r) > +{ > + return (r >> 8) | (r << 8); > +} > + > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c > b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c > new file mode 100644 > index 00000000000..130a9b1aa64 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c > @@ -0,0 +1,101 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2" } */ > + > +typedef char __attribute__ ((vector_size (16))) v16qi; > +typedef unsigned short __attribute__ ((vector_size (16))) v8hi; > +typedef unsigned int __attribute__ ((vector_size (16))) v4si; > +typedef unsigned long long __attribute__ ((vector_size (16))) v2di; > +typedef char __attribute__ ((vector_size (8))) v8qi; > +typedef unsigned short __attribute__ ((vector_size (8))) v4hi; > +typedef unsigned int __attribute__ ((vector_size (8))) v2si; > +#define VEC_ELTS(X) (sizeof (X) / (sizeof (X[0]))) > + > +static const char __attribute__ ((aligned (16))) *str = > "abcdefghijklmnopqrstuvwxyz"; > + > +unsigned long long > +__attribute__((noipa,noinline)) > +rot_64_one (unsigned long long x, unsigned amt) > +{ > + return (x << amt) | (x >> (64 - amt)); > +} > +unsigned int > +__attribute__((noipa,noinline)) > +rot_32_one (unsigned int x, unsigned amt) > +{ > + return (x << amt) | (x >> (32 - amt)); > +} > + > +unsigned short > +__attribute__((noipa,noinline)) > +rot_16_one (unsigned short x, unsigned short amt) > +{ > + return (x << amt) | (x >> (16 - amt)); > +} > + > + > +#define ROTFUNC(M,S,A) \ > +M \ > +__attribute__((noipa,noinline)) \ > +rot_##M##_##S##_##A (M x) \ > +{ \ > + return (x << A) | (x >> (S - A)); \ > +} \ > + \ > +void \ > +test_rot_##M##_##S##_##A (void) \ > +{ \ > + M vec = *(M *)str; \ > + M res = rot_##M##_##S##_##A (vec); \ > + for (__SIZE_TYPE__ i = 0; i < VEC_ELTS (vec); i++) \ > + if (res[i] != rot_##S##_one (vec[i], A)) \ > + __builtin_abort (); \ > +} > + > +ROTFUNC (v2di, 64, 56) > +ROTFUNC (v2di, 64, 48) > +ROTFUNC (v2di, 64, 40) > +ROTFUNC (v2di, 64, 32) > +ROTFUNC (v2di, 64, 24) > +ROTFUNC (v2di, 64, 16) > +ROTFUNC (v2di, 64, 8) > + > +ROTFUNC (v4si, 32, 24) > +ROTFUNC (v4si, 32, 16) > +ROTFUNC (v4si, 32, 8) > + > +ROTFUNC (v8hi, 16, 8) > + > +ROTFUNC (v2si, 32, 24) > +ROTFUNC (v2si, 32, 16) > +ROTFUNC (v2si, 32, 8) > + > +ROTFUNC (v4hi, 16, 8) > + > +#define CALL_TEST(M,S,A) test_rot_##M##_##S##_##A () > + > +int > +main (void) > +{ > + CALL_TEST (v2di, 64, 56); > + CALL_TEST (v2di, 64, 48); > + CALL_TEST (v2di, 64, 40); > + CALL_TEST (v2di, 64, 32); > + CALL_TEST (v2di, 64, 24); > + CALL_TEST (v2di, 64, 16); > + CALL_TEST (v2di, 64, 8); > + > + CALL_TEST (v4si, 32, 24); > + CALL_TEST (v4si, 32, 16); > + CALL_TEST (v4si, 32, 8); > + > + CALL_TEST (v8hi, 16, 8); > + > + CALL_TEST (v2si, 32, 24); > + CALL_TEST (v2si, 32, 16); > + CALL_TEST (v2si, 32, 8); > + > + CALL_TEST (v4hi, 16, 8); > + > + return 0; > +} > +