Thank you for the suggestions! I’m trying them out now. > On 24 Oct 2024, at 21:11, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Kyrylo Tkachov <ktkac...@nvidia.com> writes: >> Hi Richard, >> >>> On 23 Oct 2024, at 11:30, Richard Sandiford <richard.sandif...@arm.com> >>> wrote: >>> >>> Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>>> Hi all, >>>> >>>> Some vector rotate operations can be implemented in a single instruction >>>> rather than using the fallback SHL+USRA sequence. >>>> In particular, when the rotate amount is half the bitwidth of the element >>>> we can use a REV64,REV32,REV16 instruction. >>>> This patch adds this transformation in the recently added splitter for >>>> vector >>>> rotates. >>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>> >>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>>> >>>> gcc/ >>>> >>>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate): >>>> Declare prototype. >>>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement. >>>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>): >>>> Call the above. >>>> >>>> gcc/testsuite/ >>>> >>>> * gcc.target/aarch64/simd/pr117048_2.c: New test. >>> >>> Sorry to be awkward, but I still think at least part of this should be >>> target-independent. Any rotate by a byte amount can be expressed as a >>> vector permutation in a target-independent way. Target-independent code >>> can then use the usual optab routines to query whether the permutation >>> is possible and/or try to generate it. >> >> Thank you for elaborating. I had already prototyped the permute >> index-computing code in my tree >> but was reluctant to using it during expand as I wanted the rotate RTX to be >> available for combining >> into XAR so I felt a bit stuck. Having the code in a generic place but >> called from the backend at a >> time of its choosing makes sense to me. >> >>> >>> I can see that it probably makes sense to leave target code to make >>> the decision about when to use the permutation strategy vs. other >>> approaches. But the code to implement that strategy shouldn't need >>> to be target-specific. >>> >>> E.g. we could have a routine: >>> >>> expand_rotate_as_vec_perm >>> >>> which checks whether the rotation amount is suitable and tries to >>> generate the permutation if so. >> >> I’ve implemented something like that in the attached patch. >> It seems to work on AArch64 but as mentioned in the commit message I’d like >> a check on >> the big-endian logic, and perhaps some pointers on how/whether it should be >> extended to >> VLA vectors. > > Great! Thanks for doing this. Some comments on those aspects below, > but otherwise it LGTM. > >> >> I’m updating the other patches in the series according to your feedback so >> I’ll repost them once I’m done, >> just wanted to get this out for further iteration in the meantime. >> Thanks, >> Kyrill >> >> >> >> >> >> >> From 6c1794a574b5525b3b495ed505621a8af029e825 Mon Sep 17 00:00:00 2001 >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Date: Wed, 16 Oct 2024 04:10:08 -0700 >> Subject: [PATCH] aarch64: Optimize vector rotates as vector permutes where >> possible >> >> Some vector rotate operations can be implemented in a single instruction >> rather than using the fallback SHL+USRA sequence. >> In particular, when the rotate amount is half the bitwidth of the element >> we can use a REV64,REV32,REV16 instruction. >> More generally, rotates by a byte amount can be implented using vector >> permutes. >> This patch adds such a generic routine in expmed.cc called >> expand_rotate_as_vec_perm that calculates the required permute indices >> and uses the expand_vec_perm_const interface. >> >> On aarch64 this ends up generating the single-instruction sequences above >> where possible and can use LDR+TBL sequences too, which are a good choice. >> >> For now, I have restricted the expand_rotate_as_vec_perm to fixed-width modes >> as I don't have much experience with using it for VLA modes, but I imagine >> it's extendable there. In any case, the only use of >> expand_rotate_as_vec_perm >> is in aarch64-specific code that for now only handles fixed-width modes. >> >> A runtime aarch64 test is added to ensure the permute indices are not messed >> up. >> I'd appreciate a review of the BYTES_BIG_ENDIAN logic. I've adjusted the >> permute vector indices in RTL for it and in the final AArch64 assembly the >> final vector loaded from .LC<N> is identical for little and big-endian, which >> I *think* is the correct behaviour. For rotates by the half-width that >> should >> generate single REV64, REV32 instructions aarch64 does not seem to recognise >> them and falls back to an LDR+TBL for big-endian. I'm not sure if that's >> simply missing logic in the vector permute expansion in aarch64 or if I >> should >> be passing a different permute vector. If I delete the BYTES_BIG_ENDIAN hunk >> in expand_rotate_as_vec_perm the permute indices are reversed, but the REV* >> instructions are still not recognised. > > Hmm, yeah, sounds like it might be missing logic, like you say. > >> I don't have an easily accessible aarch64_be setup so I'd appreciate if >> someone >> could run it there, hopefully the new execute test can help uncover any lane >> index >> bugs. > > Can't remember if the CI does aarch64_be or not. > >> Bootstrapped and tested on aarch64-none-linux-gnu. >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> >> * expmed.h (expand_rotate_as_vec_perm): Declare. >> * expmed.cc (expand_rotate_as_vec_perm): Define. >> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate): >> Declare prototype. >> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement. >> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>): >> Call the above. >> >> gcc/testsuite/ >> >> * gcc.target/aarch64/vec-rot-exec.c: New test. >> * gcc.target/aarch64/simd/pr117048_2.c: New test. >> --- >> gcc/config/aarch64/aarch64-protos.h | 1 + >> gcc/config/aarch64/aarch64-simd.md | 3 + >> gcc/config/aarch64/aarch64.cc | 16 +++ >> gcc/expmed.cc | 42 ++++++++ >> gcc/expmed.h | 1 + >> .../gcc.target/aarch64/simd/pr117048_2.c | 66 ++++++++++++ >> .../gcc.target/aarch64/vec-rot-exec.c | 101 ++++++++++++++++++ >> 7 files changed, 230 insertions(+) >> create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c >> create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c >> >> diff --git a/gcc/config/aarch64/aarch64-protos.h >> b/gcc/config/aarch64/aarch64-protos.h >> index 75f30a52e61..fac7f9101a3 100644 >> --- a/gcc/config/aarch64/aarch64-protos.h >> +++ b/gcc/config/aarch64/aarch64-protos.h >> @@ -766,6 +766,7 @@ bool aarch64_rnd_imm_p (rtx); >> bool aarch64_constant_address_p (rtx); >> bool aarch64_emit_approx_div (rtx, rtx, rtx); >> bool aarch64_emit_approx_sqrt (rtx, rtx, bool); >> +bool aarch64_emit_opt_vec_rotate (rtx, rtx, rtx); >> tree aarch64_vector_load_decl (tree); >> rtx aarch64_gen_callee_cookie (aarch64_isa_mode, arm_pcs); >> void aarch64_expand_call (rtx, rtx, rtx, bool); >> diff --git a/gcc/config/aarch64/aarch64-simd.md >> b/gcc/config/aarch64/aarch64-simd.md >> index 6f69ccb6eba..b0aa6e0aded 100644 >> --- a/gcc/config/aarch64/aarch64-simd.md >> +++ b/gcc/config/aarch64/aarch64-simd.md >> @@ -1313,6 +1313,9 @@ >> (match_dup 4)) >> (match_dup 3)))] >> { >> + if (aarch64_emit_opt_vec_rotate (operands[0], operands[1], operands[2])) >> + DONE; >> + >> operands[3] = reload_completed ? operands[0] : gen_reg_rtx (<MODE>mode); >> rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]); >> int bitwidth = GET_MODE_UNIT_SIZE (<MODE>mode) * BITS_PER_UNIT; >> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> index 7fbe3a7380c..5b64978daba 100644 >> --- a/gcc/config/aarch64/aarch64.cc >> +++ b/gcc/config/aarch64/aarch64.cc >> @@ -16027,6 +16027,22 @@ aarch64_emit_approx_div (rtx quo, rtx num, rtx den) >> return true; >> } >> >> +/* Emit an optimized sequence to perform a vector rotate >> + of REG by the vector constant amount AMNT and place the result >> + in DST. Return true iff successful. */ >> + >> +bool >> +aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt) >> +{ >> + machine_mode mode = GET_MODE (reg); >> + /* Attempt to expand the rotate as a vector permute. >> + For some rotate amounts they can be single instructions and >> + even the general single-vector TBL permute has good throughput. */ >> + if (expand_rotate_as_vec_perm (mode, dst, reg, amnt)) >> + return true; >> + return false; >> +} >> + >> /* Return the number of instructions that can be issued per cycle. */ >> static int >> aarch64_sched_issue_rate (void) >> diff --git a/gcc/expmed.cc b/gcc/expmed.cc >> index 4498f9f38a8..1d7d6a7a914 100644 >> --- a/gcc/expmed.cc >> +++ b/gcc/expmed.cc >> @@ -6286,6 +6286,48 @@ emit_store_flag_force (rtx target, enum rtx_code >> code, rtx op0, rtx op1, >> return target; >> } >> >> +/* Expand a vector (left) rotate of MODE of X by an immediate AMT as a >> vector >> + permute operation. Emit code to put the result in DST if successfull and >> + return it. Otherwise return NULL. This is intended to implement vector >> + rotates by byte amounts using vector permutes when the target does not >> offer >> + native vector rotate operations. */ >> +rtx >> +expand_rotate_as_vec_perm (machine_mode mode, rtx dst, rtx x, rtx amt) >> +{ >> + int rotamnt = INTVAL (unwrap_const_vec_duplicate (amt)); > > It's probably worth making this even more general by checking for > CONST_INT_P on the result of unwrap_const_vec_duplicate, to cope > gracefully with variable amounts. (Or poly_int amounts :)) > >> + if (rotamnt % BITS_PER_UNIT != 0) >> + return NULL_RTX; >> + machine_mode qimode; >> + if (!qimode_for_vec_perm (mode).exists (&qimode)) >> + return NULL_RTX; >> + >> + vec_perm_builder builder; >> + unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode)); > > simpler as GET_MODE_UNIT_SIZE > >> + unsigned total_units; >> + /* TODO: Handle VLA vector rotates? */ >> + if (!GET_MODE_SIZE (mode).is_constant (&total_units)) >> + return NULL_RTX; > > Yeah. I think we can do that by changing: > >> + builder.new_vector (total_units, 1, total_units); > > to: > > builder.new_vector (total_units, 3, units);
I think units here is the size in units of the fixed-width component of the mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI? What is the recommended API for getting that number out of the poly_uint64 mode size?. Is it just accessing coeffs[0]? > > unconditionally and making the outer loop below iterate exactly > three times (i.e. to nunits * 3). It's ok if we generate more > indices than needed. > >> + int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT; >> + for (unsigned j = 0; j < total_units; j += nunits) >> + for (unsigned i = 0; i < nunits; i++) >> + { >> + unsigned idx = (rot_to_perm + i) % nunits + j; >> + if (BYTES_BIG_ENDIAN) >> + idx = total_units - idx - 1; > > I think the endian adjustment should be local to the inner loop/vector > element only. Since this would mean undoing the "nunits - " adjustment > above, how about something like: > > unsigned rot_bytes = rotamnt / BITS_PER_UNIT; > unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes; > ... > builder.quick_push ((rot_to_perm + i) % nunits + j); > > or whatever variation you prefer. > > Hope I've got that right... Hmm, I’m getting some test failures and wrong indices when I try this. I think I can work out the indices and the loops for them but I’d like to work through an example. So say we are rotating a V4SImode vector by 16 (a REV32 instruction). The indices pushed into the byte permute vector with my original patch are: [2,3,0,1, 6,7,4,5, a,b,8,9, e,f,c,d] What sequence do we want to push for V4SImode now that we have 3 patterns in vector_builder? Is it repeating the above 3 times or is it interleaving each SImode entry somehow? Thanks, Kyrill > > OK with those changes for the target-independent parts (and the > target-specific parts LGTM too FWIW). > > Thanks, > Richard > >> + builder.quick_push (idx); >> + } >> + rtx perm_src = lowpart_subreg (qimode, x, mode); >> + rtx perm_dst = lowpart_subreg (qimode, dst, mode); >> + rtx res >> + = expand_vec_perm_const (qimode, perm_src, perm_src, builder, >> + qimode, perm_dst); >> + if (!res) >> + return NULL_RTX; >> + emit_move_insn (dst, lowpart_subreg (mode, res, qimode)); >> + return dst; >> +} >> + >> /* Helper function for canonicalize_cmp_for_target. Swap between inclusive >> and exclusive ranges in order to create an equivalent comparison. See >> canonicalize_cmp_for_target for the possible cases. */ >> diff --git a/gcc/expmed.h b/gcc/expmed.h >> index 0a19176b77a..2414c3cb27d 100644 >> --- a/gcc/expmed.h >> +++ b/gcc/expmed.h >> @@ -726,5 +726,6 @@ extern rtx expand_mult_highpart_adjust (scalar_int_mode, >> rtx, rtx, rtx, >> rtx, int); >> extern rtx expmed_mult_highpart_optab (scalar_int_mode, rtx, rtx, rtx, >> int, int); >> +extern rtx expand_rotate_as_vec_perm (machine_mode, rtx, rtx, rtx); >> >> #endif // EXPMED_H >> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c >> b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c >> new file mode 100644 >> index 00000000000..7baf3581870 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c >> @@ -0,0 +1,66 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-O2 -mlittle-endian" } */ >> +/* { dg-final { check-function-bodies "**" "" "" } } */ >> + >> +typedef char __attribute__ ((vector_size (16))) v16qi; >> +typedef unsigned short __attribute__ ((vector_size (16))) v8hi; >> +typedef unsigned int __attribute__ ((vector_size (16))) v4si; >> +typedef unsigned long long __attribute__ ((vector_size (16))) v2di; >> +typedef unsigned short __attribute__ ((vector_size (8))) v4hi; >> +typedef unsigned int __attribute__ ((vector_size (8))) v2si; >> + >> +/* >> +** G1: >> +** rev64 v0\.4s, v0\.4s >> +** ret >> +*/ >> +v2di >> +G1 (v2di r) >> +{ >> + return (r >> 32) | (r << 32); >> +} >> + >> +/* >> +** G2: >> +** rev32 v0\.8h, v0\.8h >> +** ret >> +*/ >> +v4si >> +G2 (v4si r) >> +{ >> + return (r >> 16) | (r << 16); >> +} >> + >> +/* >> +** G3: >> +** rev16 v0\.16b, v0\.16b >> +** ret >> +*/ >> +v8hi >> +G3 (v8hi r) >> +{ >> + return (r >> 8) | (r << 8); >> +} >> + >> +/* >> +** G4: >> +** rev32 v0\.4h, v0\.4h >> +** ret >> +*/ >> +v2si >> +G4 (v2si r) >> +{ >> + return (r >> 16) | (r << 16); >> +} >> + >> +/* >> +** G5: >> +** rev16 v0\.8b, v0\.8b >> +** ret >> +*/ >> +v4hi >> +G5 (v4hi r) >> +{ >> + return (r >> 8) | (r << 8); >> +} >> + >> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c >> b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c >> new file mode 100644 >> index 00000000000..130a9b1aa64 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c >> @@ -0,0 +1,101 @@ >> +/* { dg-do run } */ >> +/* { dg-options "-O2" } */ >> + >> +typedef char __attribute__ ((vector_size (16))) v16qi; >> +typedef unsigned short __attribute__ ((vector_size (16))) v8hi; >> +typedef unsigned int __attribute__ ((vector_size (16))) v4si; >> +typedef unsigned long long __attribute__ ((vector_size (16))) v2di; >> +typedef char __attribute__ ((vector_size (8))) v8qi; >> +typedef unsigned short __attribute__ ((vector_size (8))) v4hi; >> +typedef unsigned int __attribute__ ((vector_size (8))) v2si; >> +#define VEC_ELTS(X) (sizeof (X) / (sizeof (X[0]))) >> + >> +static const char __attribute__ ((aligned (16))) *str = >> "abcdefghijklmnopqrstuvwxyz"; >> + >> +unsigned long long >> +__attribute__((noipa,noinline)) >> +rot_64_one (unsigned long long x, unsigned amt) >> +{ >> + return (x << amt) | (x >> (64 - amt)); >> +} >> +unsigned int >> +__attribute__((noipa,noinline)) >> +rot_32_one (unsigned int x, unsigned amt) >> +{ >> + return (x << amt) | (x >> (32 - amt)); >> +} >> + >> +unsigned short >> +__attribute__((noipa,noinline)) >> +rot_16_one (unsigned short x, unsigned short amt) >> +{ >> + return (x << amt) | (x >> (16 - amt)); >> +} >> + >> + >> +#define ROTFUNC(M,S,A) \ >> +M \ >> +__attribute__((noipa,noinline)) \ >> +rot_##M##_##S##_##A (M x) \ >> +{ \ >> + return (x << A) | (x >> (S - A)); \ >> +} \ >> + \ >> +void \ >> +test_rot_##M##_##S##_##A (void) \ >> +{ \ >> + M vec = *(M *)str; \ >> + M res = rot_##M##_##S##_##A (vec); \ >> + for (__SIZE_TYPE__ i = 0; i < VEC_ELTS (vec); i++) \ >> + if (res[i] != rot_##S##_one (vec[i], A)) \ >> + __builtin_abort (); \ >> +} >> + >> +ROTFUNC (v2di, 64, 56) >> +ROTFUNC (v2di, 64, 48) >> +ROTFUNC (v2di, 64, 40) >> +ROTFUNC (v2di, 64, 32) >> +ROTFUNC (v2di, 64, 24) >> +ROTFUNC (v2di, 64, 16) >> +ROTFUNC (v2di, 64, 8) >> + >> +ROTFUNC (v4si, 32, 24) >> +ROTFUNC (v4si, 32, 16) >> +ROTFUNC (v4si, 32, 8) >> + >> +ROTFUNC (v8hi, 16, 8) >> + >> +ROTFUNC (v2si, 32, 24) >> +ROTFUNC (v2si, 32, 16) >> +ROTFUNC (v2si, 32, 8) >> + >> +ROTFUNC (v4hi, 16, 8) >> + >> +#define CALL_TEST(M,S,A) test_rot_##M##_##S##_##A () >> + >> +int >> +main (void) >> +{ >> + CALL_TEST (v2di, 64, 56); >> + CALL_TEST (v2di, 64, 48); >> + CALL_TEST (v2di, 64, 40); >> + CALL_TEST (v2di, 64, 32); >> + CALL_TEST (v2di, 64, 24); >> + CALL_TEST (v2di, 64, 16); >> + CALL_TEST (v2di, 64, 8); >> + >> + CALL_TEST (v4si, 32, 24); >> + CALL_TEST (v4si, 32, 16); >> + CALL_TEST (v4si, 32, 8); >> + >> + CALL_TEST (v8hi, 16, 8); >> + >> + CALL_TEST (v2si, 32, 24); >> + CALL_TEST (v2si, 32, 16); >> + CALL_TEST (v2si, 32, 8); >> + >> + CALL_TEST (v4hi, 16, 8); >> + >> + return 0; >> +} >> +