> On 24 Oct 2024, at 07:36, Jeff Law <jeffreya...@gmail.com> wrote:
> 
> 
> 
> On 10/22/24 2:26 PM, Kyrylo Tkachov wrote:
>> Hi all,
>> With recent patch to improve detection of vector rotates at RTL level
>> combine now tries matching a V8HImode rotate by 8 in the example in the
>> testcase.  We can teach AArch64 to emit a REV16 instruction for such a rotate
>> but really this operation corresponds to the RTL code BSWAP, for which we
>> already have the right patterns.  BSWAP is arguably a simpler representation
>> than ROTATE here because it has only one operand, so let's teach simplify-rtx
>> to generate it.
>> With this patch the testcase now generates the simplest form:
>> .L2:
>>         ldr     q31, [x1, x0]
>>         rev16   v31.16b, v31.16b
>>         str     q31, [x0, x2]
>>         add     x0, x0, 16
>>         cmp     x0, 2048
>>         bne     .L2
>> instead of the previous:
>> .L2:
>>         ldr     q31, [x1, x0]
>>         shl     v30.8h, v31.8h, 8
>>         usra    v30.8h, v31.8h, 8
>>         str     q30, [x0, x2]
>>         add     x0, x0, 16
>>         cmp     x0, 2048
>>         bne     .L2
>> IMO ideally the bswap detection would have been done during vectorisation
>> time and used the expanders for that, but teaching simplify-rtx to do this
>> transformation is fairly straightforward and, unlike at tree level, we have
>> the native RTL BSWAP code.  This change is not enough to generate the
>> equivalent sequence in SVE, but that is something that should be tackled
>> separately.
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Signed-off-by: Kyrylo Tkachov<ktkac...@nvidia.com>
>> gcc/
>> * simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
>> Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).
>> gcc/testsuite/
>> * gcc.target/aarch64/rot_to_bswap.c: New test.
>> v2-0006-simplify-rtx-Simplify-ROTATE-HI-X-HI-8-into-BSWAP-HI.patch
>> From 79e6dcf698361eae46d0e99f851077199a8ce43a Mon Sep 17 00:00:00 2001
>> From: Kyrylo Tkachov<ktkac...@nvidia.com>
>> Date: Thu, 17 Oct 2024 06:39:57 -0700
>> Subject: [PATCH 6/6] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI
>>  (X)
>> With recent patch to improve detection of vector rotates at RTL level
>> combine now tries matching a V8HImode rotate by 8 in the example in the
>> testcase.  We can teach AArch64 to emit a REV16 instruction for such a rotate
>> but really this operation corresponds to the RTL code BSWAP, for which we
>> already have the right patterns.  BSWAP is arguably a simpler representation
>> than ROTATE here because it has only one operand, so let's teach simplify-rtx
>> to generate it.
>> With this patch the testcase now generates the simplest form:
>> .L2:
>>         ldr     q31, [x1, x0]
>>         rev16   v31.16b, v31.16b
>>         str     q31, [x0, x2]
>>         add     x0, x0, 16
>>         cmp     x0, 2048
>>         bne     .L2
>> instead of the previous:
>> .L2:
>>         ldr     q31, [x1, x0]
>>         shl     v30.8h, v31.8h, 8
>>         usra    v30.8h, v31.8h, 8
>>         str     q30, [x0, x2]
>>         add     x0, x0, 16
>>         cmp     x0, 2048
>>         bne     .L2
>> IMO ideally the bswap detection would have been done during vectorisation
>> time and used the expanders for that, but teaching simplify-rtx to do this
>> transformation is fairly straightforward and, unlike at tree level, we have
>> the native RTL BSWAP code.  This change is not enough to generate the
>> equivalent sequence in SVE, but that is something that should be tackled
>> separately.
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Signed-off-by: Kyrylo Tkachov<ktkac...@nvidia.com>
>> gcc/
>> * simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
>> Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).
>> gcc/testsuite/
>> * gcc.target/aarch64/rot_to_bswap.c: New test.
>> ---
>>  gcc/simplify-rtx.cc                           |  6 +++++
>>  .../gcc.target/aarch64/rot_to_bswap.c         | 23 +++++++++++++++++++
>>  2 files changed, 29 insertions(+)
>>  create mode 100644 gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c
>> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
>> index 089e03c2a7a..205a251f005 100644
>> --- a/gcc/simplify-rtx.cc
>> +++ b/gcc/simplify-rtx.cc
>> @@ -4328,6 +4328,12 @@ simplify_context::simplify_binary_operation_1 
>> (rtx_code code,
>>        mode, op0, new_amount_rtx);
>>   }
>>  #endif
>> +      /* ROTATE/ROTATERT:HI (X:HI, 8) is BSWAP:HI (X).  */
>> +      tem = unwrap_const_vec_duplicate (trueop1);
>> +      if (GET_MODE_UNIT_BITSIZE (mode) == (2 * BITS_PER_UNIT)
>> +  && CONST_INT_P (tem) && INTVAL (tem) == BITS_PER_UNIT)
>> + return simplify_gen_unary (BSWAP, mode, op0, mode);
> So what about other modes?  I haven't really pondered this, but isn't there 
> something similar for ROTATE:SI (X:SI, 16)?  I guess the basic question is 
> whether or not this really need to be limited to HImode.
> 

A (ROTATE:SI (X:SI, 16)) would represent a half-word swap, rather than a 
byte-swap. For example, 0x12345678 rotated by 16 gives 0x56781234 whereas a 
bswap would give 0x78563412.
AArch64 does have native operations that perform these half-word (and word) 
swaps, but they are not RTL BSWAP operations unfortunately.
So this pattern effectively only works for HI and vector HI modes.
Thanks,
Kyrill

> Jeff

Reply via email to