On 10/24/24 12:24 AM, Kyrylo Tkachov wrote:


On 24 Oct 2024, at 07:36, Jeff Law <jeffreya...@gmail.com> wrote:



On 10/22/24 2:26 PM, Kyrylo Tkachov wrote:
Hi all,
With recent patch to improve detection of vector rotates at RTL level
combine now tries matching a V8HImode rotate by 8 in the example in the
testcase.  We can teach AArch64 to emit a REV16 instruction for such a rotate
but really this operation corresponds to the RTL code BSWAP, for which we
already have the right patterns.  BSWAP is arguably a simpler representation
than ROTATE here because it has only one operand, so let's teach simplify-rtx
to generate it.
With this patch the testcase now generates the simplest form:
.L2:
         ldr     q31, [x1, x0]
         rev16   v31.16b, v31.16b
         str     q31, [x0, x2]
         add     x0, x0, 16
         cmp     x0, 2048
         bne     .L2
instead of the previous:
.L2:
         ldr     q31, [x1, x0]
         shl     v30.8h, v31.8h, 8
         usra    v30.8h, v31.8h, 8
         str     q30, [x0, x2]
         add     x0, x0, 16
         cmp     x0, 2048
         bne     .L2
IMO ideally the bswap detection would have been done during vectorisation
time and used the expanders for that, but teaching simplify-rtx to do this
transformation is fairly straightforward and, unlike at tree level, we have
the native RTL BSWAP code.  This change is not enough to generate the
equivalent sequence in SVE, but that is something that should be tackled
separately.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov<ktkac...@nvidia.com>
gcc/
* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).
gcc/testsuite/
* gcc.target/aarch64/rot_to_bswap.c: New test.
v2-0006-simplify-rtx-Simplify-ROTATE-HI-X-HI-8-into-BSWAP-HI.patch
 From 79e6dcf698361eae46d0e99f851077199a8ce43a Mon Sep 17 00:00:00 2001
From: Kyrylo Tkachov<ktkac...@nvidia.com>
Date: Thu, 17 Oct 2024 06:39:57 -0700
Subject: [PATCH 6/6] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI
  (X)
With recent patch to improve detection of vector rotates at RTL level
combine now tries matching a V8HImode rotate by 8 in the example in the
testcase.  We can teach AArch64 to emit a REV16 instruction for such a rotate
but really this operation corresponds to the RTL code BSWAP, for which we
already have the right patterns.  BSWAP is arguably a simpler representation
than ROTATE here because it has only one operand, so let's teach simplify-rtx
to generate it.
With this patch the testcase now generates the simplest form:
.L2:
         ldr     q31, [x1, x0]
         rev16   v31.16b, v31.16b
         str     q31, [x0, x2]
         add     x0, x0, 16
         cmp     x0, 2048
         bne     .L2
instead of the previous:
.L2:
         ldr     q31, [x1, x0]
         shl     v30.8h, v31.8h, 8
         usra    v30.8h, v31.8h, 8
         str     q30, [x0, x2]
         add     x0, x0, 16
         cmp     x0, 2048
         bne     .L2
IMO ideally the bswap detection would have been done during vectorisation
time and used the expanders for that, but teaching simplify-rtx to do this
transformation is fairly straightforward and, unlike at tree level, we have
the native RTL BSWAP code.  This change is not enough to generate the
equivalent sequence in SVE, but that is something that should be tackled
separately.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov<ktkac...@nvidia.com>
gcc/
* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).
gcc/testsuite/
* gcc.target/aarch64/rot_to_bswap.c: New test.
---
  gcc/simplify-rtx.cc                           |  6 +++++
  .../gcc.target/aarch64/rot_to_bswap.c         | 23 +++++++++++++++++++
  2 files changed, 29 insertions(+)
  create mode 100644 gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 089e03c2a7a..205a251f005 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -4328,6 +4328,12 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
        mode, op0, new_amount_rtx);
   }
  #endif
+      /* ROTATE/ROTATERT:HI (X:HI, 8) is BSWAP:HI (X).  */
+      tem = unwrap_const_vec_duplicate (trueop1);
+      if (GET_MODE_UNIT_BITSIZE (mode) == (2 * BITS_PER_UNIT)
+  && CONST_INT_P (tem) && INTVAL (tem) == BITS_PER_UNIT)
+ return simplify_gen_unary (BSWAP, mode, op0, mode);
So what about other modes?  I haven't really pondered this, but isn't there 
something similar for ROTATE:SI (X:SI, 16)?  I guess the basic question is 
whether or not this really need to be limited to HImode.


A (ROTATE:SI (X:SI, 16)) would represent a half-word swap, rather than a 
byte-swap. For example, 0x12345678 rotated by 16 gives 0x56781234 whereas a 
bswap would give 0x78563412.
AArch64 does have native operations that perform these half-word (and word) 
swaps, but they are not RTL BSWAP operations unfortunately.
So this pattern effectively only works for HI and vector HI modes.
Thanks for clarifying. This probably deserves expanding the comment a bit. Perhaps something along these lines:

/* ROTATE/ROTATERT:HI (X:HI, 8) is an RTL BSWAP:HI (X).  Other
   combinations such as SImode with a count of 16 do not
   correspond to RTL BSWAP semantics.  */

I was also slightly worried about targets were BITS_PER_UNIT isn't 8; I would have swore we had one, but I don't see one.


OK with the comment addition.

jeff

Reply via email to