Re: [PATCH 6/6] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI (X)

Kyrylo Tkachov Tue, 29 Oct 2024 03:15:57 -0700


> On 27 Oct 2024, at 20:42, Jeff Law <jeffreya...@gmail.com> wrote:
> 
> 
> 
> On 10/24/24 12:24 AM, Kyrylo Tkachov wrote:
>>> On 24 Oct 2024, at 07:36, Jeff Law <jeffreya...@gmail.com> wrote:
>>> 
>>> 
>>> 
>>> On 10/22/24 2:26 PM, Kyrylo Tkachov wrote:
>>>> Hi all,
>>>> With recent patch to improve detection of vector rotates at RTL level
>>>> combine now tries matching a V8HImode rotate by 8 in the example in the
>>>> testcase.  We can teach AArch64 to emit a REV16 instruction for such a 
>>>> rotate
>>>> but really this operation corresponds to the RTL code BSWAP, for which we
>>>> already have the right patterns.  BSWAP is arguably a simpler 
>>>> representation
>>>> than ROTATE here because it has only one operand, so let's teach 
>>>> simplify-rtx
>>>> to generate it.
>>>> With this patch the testcase now generates the simplest form:
>>>> .L2:
>>>>         ldr     q31, [x1, x0]
>>>>         rev16   v31.16b, v31.16b
>>>>         str     q31, [x0, x2]
>>>>         add     x0, x0, 16
>>>>         cmp     x0, 2048
>>>>         bne     .L2
>>>> instead of the previous:
>>>> .L2:
>>>>         ldr     q31, [x1, x0]
>>>>         shl     v30.8h, v31.8h, 8
>>>>         usra    v30.8h, v31.8h, 8
>>>>         str     q30, [x0, x2]
>>>>         add     x0, x0, 16
>>>>         cmp     x0, 2048
>>>>         bne     .L2
>>>> IMO ideally the bswap detection would have been done during vectorisation
>>>> time and used the expanders for that, but teaching simplify-rtx to do this
>>>> transformation is fairly straightforward and, unlike at tree level, we have
>>>> the native RTL BSWAP code.  This change is not enough to generate the
>>>> equivalent sequence in SVE, but that is something that should be tackled
>>>> separately.
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>> Signed-off-by: Kyrylo Tkachov<ktkac...@nvidia.com>
>>>> gcc/
>>>> * simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
>>>> Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).
>>>> gcc/testsuite/
>>>> * gcc.target/aarch64/rot_to_bswap.c: New test.
>>>> v2-0006-simplify-rtx-Simplify-ROTATE-HI-X-HI-8-into-BSWAP-HI.patch
>>>> From 79e6dcf698361eae46d0e99f851077199a8ce43a Mon Sep 17 00:00:00 2001
>>>> From: Kyrylo Tkachov<ktkac...@nvidia.com>
>>>> Date: Thu, 17 Oct 2024 06:39:57 -0700
>>>> Subject: [PATCH 6/6] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into 
>>>> BSWAP:HI
>>>>  (X)
>>>> With recent patch to improve detection of vector rotates at RTL level
>>>> combine now tries matching a V8HImode rotate by 8 in the example in the
>>>> testcase.  We can teach AArch64 to emit a REV16 instruction for such a 
>>>> rotate
>>>> but really this operation corresponds to the RTL code BSWAP, for which we
>>>> already have the right patterns.  BSWAP is arguably a simpler 
>>>> representation
>>>> than ROTATE here because it has only one operand, so let's teach 
>>>> simplify-rtx
>>>> to generate it.
>>>> With this patch the testcase now generates the simplest form:
>>>> .L2:
>>>>         ldr     q31, [x1, x0]
>>>>         rev16   v31.16b, v31.16b
>>>>         str     q31, [x0, x2]
>>>>         add     x0, x0, 16
>>>>         cmp     x0, 2048
>>>>         bne     .L2
>>>> instead of the previous:
>>>> .L2:
>>>>         ldr     q31, [x1, x0]
>>>>         shl     v30.8h, v31.8h, 8
>>>>         usra    v30.8h, v31.8h, 8
>>>>         str     q30, [x0, x2]
>>>>         add     x0, x0, 16
>>>>         cmp     x0, 2048
>>>>         bne     .L2
>>>> IMO ideally the bswap detection would have been done during vectorisation
>>>> time and used the expanders for that, but teaching simplify-rtx to do this
>>>> transformation is fairly straightforward and, unlike at tree level, we have
>>>> the native RTL BSWAP code.  This change is not enough to generate the
>>>> equivalent sequence in SVE, but that is something that should be tackled
>>>> separately.
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>> Signed-off-by: Kyrylo Tkachov<ktkac...@nvidia.com>
>>>> gcc/
>>>> * simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
>>>> Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).
>>>> gcc/testsuite/
>>>> * gcc.target/aarch64/rot_to_bswap.c: New test.
>>>> ---
>>>>  gcc/simplify-rtx.cc                           |  6 +++++
>>>>  .../gcc.target/aarch64/rot_to_bswap.c         | 23 +++++++++++++++++++
>>>>  2 files changed, 29 insertions(+)
>>>>  create mode 100644 gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c
>>>> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
>>>> index 089e03c2a7a..205a251f005 100644
>>>> --- a/gcc/simplify-rtx.cc
>>>> +++ b/gcc/simplify-rtx.cc
>>>> @@ -4328,6 +4328,12 @@ simplify_context::simplify_binary_operation_1 
>>>> (rtx_code code,
>>>>        mode, op0, new_amount_rtx);
>>>>   }
>>>>  #endif
>>>> +      /* ROTATE/ROTATERT:HI (X:HI, 8) is BSWAP:HI (X).  */
>>>> +      tem = unwrap_const_vec_duplicate (trueop1);
>>>> +      if (GET_MODE_UNIT_BITSIZE (mode) == (2 * BITS_PER_UNIT)
>>>> +  && CONST_INT_P (tem) && INTVAL (tem) == BITS_PER_UNIT)
>>>> + return simplify_gen_unary (BSWAP, mode, op0, mode);
>>> So what about other modes?  I haven't really pondered this, but isn't there 
>>> something similar for ROTATE:SI (X:SI, 16)?  I guess the basic question is 
>>> whether or not this really need to be limited to HImode.
>>> 
>> A (ROTATE:SI (X:SI, 16)) would represent a half-word swap, rather than a 
>> byte-swap. For example, 0x12345678 rotated by 16 gives 0x56781234 whereas a 
>> bswap would give 0x78563412.
>> AArch64 does have native operations that perform these half-word (and word) 
>> swaps, but they are not RTL BSWAP operations unfortunately.
>> So this pattern effectively only works for HI and vector HI modes.
> Thanks for clarifying.  This probably deserves  expanding the comment a bit.  
> Perhaps something along these lines:
> 
> /* ROTATE/ROTATERT:HI (X:HI, 8) is an RTL BSWAP:HI (X).  Other
>   combinations such as SImode with a count of 16 do not
>   correspond to RTL BSWAP semantics.  */
> 
> I was also slightly worried about targets were BITS_PER_UNIT isn't 8; I would 
> have swore we had one, but I don't see one.
> 
> 
> OK with the comment addition.
>


Thanks, I’ll extend the comment when I commit the series. Would you be able to 
help with the review of the first one in the series by any chance?
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/666605.html

Thanks again,
Kyrill


> jeff

Re: [PATCH 6/6] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI (X)

Reply via email to