117048 aarch64: Use more canonical and optimization-friendly representation for XAR instruction

Richard Sandiford Fri, 11 Oct 2024 04:40:44 -0700

Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>> The pattern for the Advanced SIMD XAR instruction isn't very
>>> optimization-friendly at the moment.
>>> In the testcase from the PR once simlify-rtx has done its work it
>>> generates the RTL:
>>> (set (reg:V2DI 119 [ _14 ])
>>>    (rotate:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
>>>            (reg:V2DI 116 [ *m1_01_8(D) ]))
>>>        (const_vector:V2DI [
>>>                (const_int 32 [0x20]) repeated x2
>>>            ])))
>>> 
>>> which fails to match our XAR pattern because the pattern expects:
>>> 1) A ROTATERT instead of the ROTATE.  However, according to the RTL ops
>>> documentation the preferred form of rotate-by-immediate is ROTATE, which
>>> I take to mean it's the canonical form.
>>> ROTATE (x, C) <-> ROTATERT (x, MODE_WIDTH - C) so it's better to match just
>>> one canonical representation.
>>> 2) A CONST_INT shift amount whereas the midend asks for a repeated vector
>>> constant.
>> 
>> Following on from the 1/2 review, I'm surprised that the middle end
>> requires a vector.  I would have expected a scalar shift to work.
>> 
>> I agree it should be rotate rather than rotatert though.  Out of curiosity,
>> where do things go wrong if we just fix that, but keep the scalar shift
>> amount?
>
> The vector constant comes out of the test case using intrinsics such as 
> vshlq_u64 that take a vector as a shift amount.
> Our pattern for vector shift by immediate 
> aarch64_simd_imm_shl<mode><vczle><vczbe> expresses the shift amount
> as a vector so I suppose it all comes to that.
> The standard ashl<mode>3 expander does take a scalar shift amount but 
> explicitly creates a vector constant for the RTL passes.
> So it seems that we are de facto standardized on using vectors.


OK, thanks, makes sense.

Richard

> Naively, I’d hope recog would try both forms and save us the trouble of 
> worrying about it, but I think we’ve been reluctant to complicate recog that 
> way in the past.
>
>> 
>> No objection to switching to vectors in principle though, especially if it
>> matches what we do elsewhere.
>
> Thanks, I’ll adjust patch 1/2 in the meantime
> Kyrill
>
>> 
>> Thanks,
>> Richard
>> 
>>> 
>>> These issues are fixed by introducing a dedicated expander for the
>>> aarch64_xarqv2di name, needed by the arm_neon.h intrinsic, that translate
>>> the intrinsic-level CONST_INT immediate (the right-rotate amount) into
>>> a repeated vector constant subtracted from 64 to give the corresponding
>>> left-rotate amount that is fed to the new representation for the XAR
>>> define_insn that uses the ROTATE RTL code.  This is a similar approach
>>> to have we handle the discrepancy between intrinsic-level and RTL-level
>>> vector lane numbers for big-endian.
>>> 
>>> With this patch and [1/2] the arithmetic parts of the testcase now simplify
>>> to just one XAR instruction.
>>> 
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>> I’ll push it after patch approval of [1/2] leaving some time for comments.
>>> 
>>> I’ll note that the SVE2 patterns for XAR should also be improved in a 
>>> similar
>>> but that is a separate patch.
>>> 
>>> Thanks,
>>> Kyrill
>>> 
>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> 
>>> gcc/
>>>      PR target/117048
>>>      * config/aarch64/aarch64-simd.md (aarch64_xarqv2di): Redefine into a
>>>      define_expand.
>>>      (*aarch64_xarqv2di_insn): Define.
>>> 
>>> gcc/testsuite/
>>>      PR target/117048
>>>      * g++.target/aarch64/pr117048.C: New test.
>>> 
>>> From 4f699bf239a563a05e88da5958c44a643718852c Mon Sep 17 00:00:00 2001
>>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> Date: Wed, 9 Oct 2024 09:40:33 -0700
>>> Subject: [PATCH 2/2] PR target/117048 aarch64: Use more canonical and
>>> optimization-friendly representation for XAR instruction
>>> 
>>> The pattern for the Advanced SIMD XAR instruction isn't very
>>> optimization-friendly at the moment.
>>> In the testcase from the PR once simlify-rtx has tried done its work it
>>> generates the RTL:
>>> (set (reg:V2DI 119 [ _14 ])
>>>    (rotate:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
>>>            (reg:V2DI 116 [ *m1_01_8(D) ]))
>>>        (const_vector:V2DI [
>>>                (const_int 32 [0x20]) repeated x2
>>>            ])))
>>> 
>>> which fails to match our XAR pattern because the pattern expects:
>>> 1) A ROTATERT instead of the ROTATE.  However, according to the RTL ops
>>> documentation the preferred form of rotate-by-immediate is ROTATE, which
>>> I take to mean it's the canonical form.
>>> ROTATE (x, C) <-> ROTATERT (x, MODE_WIDTH - C) so it's better to match just
>>> one canonical representation.
>>> 2) A CONST_INT shift amount whereas the midend asks for a repeated vector
>>> constant.
>>> 
>>> These issues are fixed by introducing a dedicated expander for the
>>> aarch64_xarqv2di name, needed by the arm_neon.h intrinsic, that translate
>>> the intrinsic-level CONST_INT immediate (the right-rotate amount) into
>>> a repeated vector constant subtracted from 64 to give the corresponding
>>> left-rotate amount that is fed to the new representation for the XAR
>>> define_insn that uses the ROTATE RTL code.  This is a similar approach
>>> to have we handle the discrepancy between intrinsic-level and RTL-level
>>> vector lane numbers for big-endian.
>>> 
>>> With this patch and [1/2] the arithmetic parts of the testcase now simplify
>>> to just one XAR instruction.
>>> 
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>> 
>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> 
>>> gcc/
>>>      PR target/117048
>>>      * config/aarch64/aarch64-simd.md (aarch64_xarqv2di): Redefine into a
>>>      define_expand.
>>>      (*aarch64_xarqv2di_insn): Define.
>>> 
>>> gcc/testsuite/
>>>      PR target/117048
>>>      * g++.target/aarch64/pr117048.C: New test.
>>> ---
>>> gcc/config/aarch64/aarch64-simd.md          | 33 +++++++++++++++++---
>>> gcc/testsuite/g++.target/aarch64/pr117048.C | 34 +++++++++++++++++++++
>>> 2 files changed, 63 insertions(+), 4 deletions(-)
>>> create mode 100644 gcc/testsuite/g++.target/aarch64/pr117048.C
>>> 
>>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>>> b/gcc/config/aarch64/aarch64-simd.md
>>> index 11d405ed640..bf272bc0b4e 100644
>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>> @@ -9046,18 +9046,43 @@
>>>   [(set_attr "type" "crypto_sha3")]
>>> )
>>> 
>>> -(define_insn "aarch64_xarqv2di"
>>> +(define_insn "*aarch64_xarqv2di_insn"
>>>   [(set (match_operand:V2DI 0 "register_operand" "=w")
>>> -     (rotatert:V2DI
>>> +     (rotate:V2DI
>>>       (xor:V2DI
>>>        (match_operand:V2DI 1 "register_operand" "%w")
>>>        (match_operand:V2DI 2 "register_operand" "w"))
>>> -      (match_operand:SI 3 "aarch64_simd_shift_imm_di" "Usd")))]
>>> +      (match_operand:V2DI 3 "aarch64_simd_lshift_imm" "Dl")))]
>>>   "TARGET_SHA3"
>>> -  "xar\\t%0.2d, %1.2d, %2.2d, %3"
>>> +  {
>>> +    operands[3]
>>> +      = GEN_INT (64 - INTVAL (unwrap_const_vec_duplicate (operands[3])));
>>> +    return "xar\\t%0.2d, %1.2d, %2.2d, %3";
>>> +  }
>>>   [(set_attr "type" "crypto_sha3")]
>>> )
>>> 
>>> +;; The semantics of the vxarq_u64 intrinsics treat the immediate argument 
>>> as a
>>> +;; right-rotate amount but the recommended representation of rotates by a
>>> +;; constant in RTL is with the left ROTATE code.  Translate between the
>>> +;; intrinsic-provided amount and the RTL operands in the expander here.
>>> +;; The define_insn for XAR will translate back to instruction semantics in 
>>> its
>>> +;; output logic.
>>> +(define_expand "aarch64_xarqv2di"
>>> +  [(set (match_operand:V2DI 0 "register_operand")
>>> +     (rotate:V2DI
>>> +      (xor:V2DI
>>> +       (match_operand:V2DI 1 "register_operand")
>>> +       (match_operand:V2DI 2 "register_operand"))
>>> +      (match_operand:SI 3 "aarch64_simd_shift_imm_di")))]
>>> +  "TARGET_SHA3"
>>> +  {
>>> +    operands[3]
>>> +      = aarch64_simd_gen_const_vector_dup (V2DImode,
>>> +                                        64 - INTVAL (operands[3]));
>>> +  }
>>> +)
>>> +
>>> (define_insn "bcaxq<mode>4"
>>>   [(set (match_operand:VQ_I 0 "register_operand" "=w")
>>>      (xor:VQ_I
>>> diff --git a/gcc/testsuite/g++.target/aarch64/pr117048.C 
>>> b/gcc/testsuite/g++.target/aarch64/pr117048.C
>>> new file mode 100644
>>> index 00000000000..ae46e5875e4
>>> --- /dev/null
>>> +++ b/gcc/testsuite/g++.target/aarch64/pr117048.C
>>> @@ -0,0 +1,34 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2" } */
>>> +
>>> +#include <arm_neon.h>
>>> +
>>> +#pragma GCC target "+sha3"
>>> +
>>> +static inline uint64x2_t
>>> +rotr64_vec(uint64x2_t x, const int b)
>>> +{
>>> +    int64x2_t neg_b = vdupq_n_s64(-b);
>>> +    int64x2_t left_shift = vsubq_s64(vdupq_n_s64(64), vdupq_n_s64(b));
>>> +
>>> +    uint64x2_t right_shifted = vshlq_u64(x, neg_b);
>>> +    uint64x2_t left_shifted = vshlq_u64(x, left_shift);
>>> +
>>> +    return vorrq_u64(right_shifted, left_shifted);
>>> +}
>>> +
>>> +void G(
>>> +    int64_t* v,
>>> +    int64x2_t& m1_01,
>>> +    int64x2_t& m1_23,
>>> +    int64x2_t& m2_01,
>>> +    int64x2_t& m2_23
>>> +) {
>>> +    int64x2_t vd01 = {v[12],v[13]};
>>> +    vd01 = veorq_s64(vd01, m1_01);
>>> +    vd01 = vreinterpretq_s64_u64(rotr64_vec( vreinterpretq_u64_s64 (vd01), 
>>> 32));
>>> +    v[12] = vgetq_lane_s64(vd01, 0);
>>> +}
>>> +
>>> +/* { dg-final { scan-assembler {\txar\tv[0-9]+\.2d, v[0-9]+\.2d, 
>>> v[0-9]+\.2d, 32\n} } } */
>>> +

Re: [PATCH 2/2] PR target/117048 aarch64: Use more canonical and optimization-friendly representation for XAR instruction

Reply via email to