> On 27 Oct 2024, at 20:42, Jeff Law <jeffreya...@gmail.com> wrote: > > > > On 10/24/24 12:24 AM, Kyrylo Tkachov wrote: >>> On 24 Oct 2024, at 07:36, Jeff Law <jeffreya...@gmail.com> wrote: >>> >>> >>> >>> On 10/22/24 2:26 PM, Kyrylo Tkachov wrote: >>>> Hi all, >>>> With recent patch to improve detection of vector rotates at RTL level >>>> combine now tries matching a V8HImode rotate by 8 in the example in the >>>> testcase. We can teach AArch64 to emit a REV16 instruction for such a >>>> rotate >>>> but really this operation corresponds to the RTL code BSWAP, for which we >>>> already have the right patterns. BSWAP is arguably a simpler >>>> representation >>>> than ROTATE here because it has only one operand, so let's teach >>>> simplify-rtx >>>> to generate it. >>>> With this patch the testcase now generates the simplest form: >>>> .L2: >>>> ldr q31, [x1, x0] >>>> rev16 v31.16b, v31.16b >>>> str q31, [x0, x2] >>>> add x0, x0, 16 >>>> cmp x0, 2048 >>>> bne .L2 >>>> instead of the previous: >>>> .L2: >>>> ldr q31, [x1, x0] >>>> shl v30.8h, v31.8h, 8 >>>> usra v30.8h, v31.8h, 8 >>>> str q30, [x0, x2] >>>> add x0, x0, 16 >>>> cmp x0, 2048 >>>> bne .L2 >>>> IMO ideally the bswap detection would have been done during vectorisation >>>> time and used the expanders for that, but teaching simplify-rtx to do this >>>> transformation is fairly straightforward and, unlike at tree level, we have >>>> the native RTL BSWAP code. This change is not enough to generate the >>>> equivalent sequence in SVE, but that is something that should be tackled >>>> separately. >>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>> Signed-off-by: Kyrylo Tkachov<ktkac...@nvidia.com> >>>> gcc/ >>>> * simplify-rtx.cc (simplify_context::simplify_binary_operation_1): >>>> Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI). >>>> gcc/testsuite/ >>>> * gcc.target/aarch64/rot_to_bswap.c: New test. >>>> v2-0006-simplify-rtx-Simplify-ROTATE-HI-X-HI-8-into-BSWAP-HI.patch >>>> From 79e6dcf698361eae46d0e99f851077199a8ce43a Mon Sep 17 00:00:00 2001 >>>> From: Kyrylo Tkachov<ktkac...@nvidia.com> >>>> Date: Thu, 17 Oct 2024 06:39:57 -0700 >>>> Subject: [PATCH 6/6] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into >>>> BSWAP:HI >>>> (X) >>>> With recent patch to improve detection of vector rotates at RTL level >>>> combine now tries matching a V8HImode rotate by 8 in the example in the >>>> testcase. We can teach AArch64 to emit a REV16 instruction for such a >>>> rotate >>>> but really this operation corresponds to the RTL code BSWAP, for which we >>>> already have the right patterns. BSWAP is arguably a simpler >>>> representation >>>> than ROTATE here because it has only one operand, so let's teach >>>> simplify-rtx >>>> to generate it. >>>> With this patch the testcase now generates the simplest form: >>>> .L2: >>>> ldr q31, [x1, x0] >>>> rev16 v31.16b, v31.16b >>>> str q31, [x0, x2] >>>> add x0, x0, 16 >>>> cmp x0, 2048 >>>> bne .L2 >>>> instead of the previous: >>>> .L2: >>>> ldr q31, [x1, x0] >>>> shl v30.8h, v31.8h, 8 >>>> usra v30.8h, v31.8h, 8 >>>> str q30, [x0, x2] >>>> add x0, x0, 16 >>>> cmp x0, 2048 >>>> bne .L2 >>>> IMO ideally the bswap detection would have been done during vectorisation >>>> time and used the expanders for that, but teaching simplify-rtx to do this >>>> transformation is fairly straightforward and, unlike at tree level, we have >>>> the native RTL BSWAP code. This change is not enough to generate the >>>> equivalent sequence in SVE, but that is something that should be tackled >>>> separately. >>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>> Signed-off-by: Kyrylo Tkachov<ktkac...@nvidia.com> >>>> gcc/ >>>> * simplify-rtx.cc (simplify_context::simplify_binary_operation_1): >>>> Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI). >>>> gcc/testsuite/ >>>> * gcc.target/aarch64/rot_to_bswap.c: New test. >>>> --- >>>> gcc/simplify-rtx.cc | 6 +++++ >>>> .../gcc.target/aarch64/rot_to_bswap.c | 23 +++++++++++++++++++ >>>> 2 files changed, 29 insertions(+) >>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c >>>> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc >>>> index 089e03c2a7a..205a251f005 100644 >>>> --- a/gcc/simplify-rtx.cc >>>> +++ b/gcc/simplify-rtx.cc >>>> @@ -4328,6 +4328,12 @@ simplify_context::simplify_binary_operation_1 >>>> (rtx_code code, >>>> mode, op0, new_amount_rtx); >>>> } >>>> #endif >>>> + /* ROTATE/ROTATERT:HI (X:HI, 8) is BSWAP:HI (X). */ >>>> + tem = unwrap_const_vec_duplicate (trueop1); >>>> + if (GET_MODE_UNIT_BITSIZE (mode) == (2 * BITS_PER_UNIT) >>>> + && CONST_INT_P (tem) && INTVAL (tem) == BITS_PER_UNIT) >>>> + return simplify_gen_unary (BSWAP, mode, op0, mode); >>> So what about other modes? I haven't really pondered this, but isn't there >>> something similar for ROTATE:SI (X:SI, 16)? I guess the basic question is >>> whether or not this really need to be limited to HImode. >>> >> A (ROTATE:SI (X:SI, 16)) would represent a half-word swap, rather than a >> byte-swap. For example, 0x12345678 rotated by 16 gives 0x56781234 whereas a >> bswap would give 0x78563412. >> AArch64 does have native operations that perform these half-word (and word) >> swaps, but they are not RTL BSWAP operations unfortunately. >> So this pattern effectively only works for HI and vector HI modes. > Thanks for clarifying. This probably deserves expanding the comment a bit. > Perhaps something along these lines: > > /* ROTATE/ROTATERT:HI (X:HI, 8) is an RTL BSWAP:HI (X). Other > combinations such as SImode with a count of 16 do not > correspond to RTL BSWAP semantics. */ > > I was also slightly worried about targets were BITS_PER_UNIT isn't 8; I would > have swore we had one, but I don't see one. > > > OK with the comment addition. >
Thanks, I’ll extend the comment when I commit the series. Would you be able to help with the review of the first one in the series by any chance? https://gcc.gnu.org/pipermail/gcc-patches/2024-October/666605.html Thanks again, Kyrill > jeff