> On 11 Oct 2024, at 12:36, Richard Sandiford <richard.sandif...@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Kyrylo Tkachov <ktkac...@nvidia.com> writes: >> The pattern for the Advanced SIMD XAR instruction isn't very >> optimization-friendly at the moment. >> In the testcase from the PR once simlify-rtx has done its work it >> generates the RTL: >> (set (reg:V2DI 119 [ _14 ]) >> (rotate:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ]) >> (reg:V2DI 116 [ *m1_01_8(D) ])) >> (const_vector:V2DI [ >> (const_int 32 [0x20]) repeated x2 >> ]))) >> >> which fails to match our XAR pattern because the pattern expects: >> 1) A ROTATERT instead of the ROTATE. However, according to the RTL ops >> documentation the preferred form of rotate-by-immediate is ROTATE, which >> I take to mean it's the canonical form. >> ROTATE (x, C) <-> ROTATERT (x, MODE_WIDTH - C) so it's better to match just >> one canonical representation. >> 2) A CONST_INT shift amount whereas the midend asks for a repeated vector >> constant. > > Following on from the 1/2 review, I'm surprised that the middle end > requires a vector. I would have expected a scalar shift to work. > > I agree it should be rotate rather than rotatert though. Out of curiosity, > where do things go wrong if we just fix that, but keep the scalar shift > amount?
The vector constant comes out of the test case using intrinsics such as vshlq_u64 that take a vector as a shift amount. Our pattern for vector shift by immediate aarch64_simd_imm_shl<mode><vczle><vczbe> expresses the shift amount as a vector so I suppose it all comes to that. The standard ashl<mode>3 expander does take a scalar shift amount but explicitly creates a vector constant for the RTL passes. So it seems that we are de facto standardized on using vectors. Naively, I’d hope recog would try both forms and save us the trouble of worrying about it, but I think we’ve been reluctant to complicate recog that way in the past. > > No objection to switching to vectors in principle though, especially if it > matches what we do elsewhere. Thanks, I’ll adjust patch 1/2 in the meantime Kyrill > > Thanks, > Richard > >> >> These issues are fixed by introducing a dedicated expander for the >> aarch64_xarqv2di name, needed by the arm_neon.h intrinsic, that translate >> the intrinsic-level CONST_INT immediate (the right-rotate amount) into >> a repeated vector constant subtracted from 64 to give the corresponding >> left-rotate amount that is fed to the new representation for the XAR >> define_insn that uses the ROTATE RTL code. This is a similar approach >> to have we handle the discrepancy between intrinsic-level and RTL-level >> vector lane numbers for big-endian. >> >> With this patch and [1/2] the arithmetic parts of the testcase now simplify >> to just one XAR instruction. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> I’ll push it after patch approval of [1/2] leaving some time for comments. >> >> I’ll note that the SVE2 patterns for XAR should also be improved in a similar >> but that is a separate patch. >> >> Thanks, >> Kyrill >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> PR target/117048 >> * config/aarch64/aarch64-simd.md (aarch64_xarqv2di): Redefine into a >> define_expand. >> (*aarch64_xarqv2di_insn): Define. >> >> gcc/testsuite/ >> PR target/117048 >> * g++.target/aarch64/pr117048.C: New test. >> >> From 4f699bf239a563a05e88da5958c44a643718852c Mon Sep 17 00:00:00 2001 >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Date: Wed, 9 Oct 2024 09:40:33 -0700 >> Subject: [PATCH 2/2] PR target/117048 aarch64: Use more canonical and >> optimization-friendly representation for XAR instruction >> >> The pattern for the Advanced SIMD XAR instruction isn't very >> optimization-friendly at the moment. >> In the testcase from the PR once simlify-rtx has tried done its work it >> generates the RTL: >> (set (reg:V2DI 119 [ _14 ]) >> (rotate:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ]) >> (reg:V2DI 116 [ *m1_01_8(D) ])) >> (const_vector:V2DI [ >> (const_int 32 [0x20]) repeated x2 >> ]))) >> >> which fails to match our XAR pattern because the pattern expects: >> 1) A ROTATERT instead of the ROTATE. However, according to the RTL ops >> documentation the preferred form of rotate-by-immediate is ROTATE, which >> I take to mean it's the canonical form. >> ROTATE (x, C) <-> ROTATERT (x, MODE_WIDTH - C) so it's better to match just >> one canonical representation. >> 2) A CONST_INT shift amount whereas the midend asks for a repeated vector >> constant. >> >> These issues are fixed by introducing a dedicated expander for the >> aarch64_xarqv2di name, needed by the arm_neon.h intrinsic, that translate >> the intrinsic-level CONST_INT immediate (the right-rotate amount) into >> a repeated vector constant subtracted from 64 to give the corresponding >> left-rotate amount that is fed to the new representation for the XAR >> define_insn that uses the ROTATE RTL code. This is a similar approach >> to have we handle the discrepancy between intrinsic-level and RTL-level >> vector lane numbers for big-endian. >> >> With this patch and [1/2] the arithmetic parts of the testcase now simplify >> to just one XAR instruction. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> PR target/117048 >> * config/aarch64/aarch64-simd.md (aarch64_xarqv2di): Redefine into a >> define_expand. >> (*aarch64_xarqv2di_insn): Define. >> >> gcc/testsuite/ >> PR target/117048 >> * g++.target/aarch64/pr117048.C: New test. >> --- >> gcc/config/aarch64/aarch64-simd.md | 33 +++++++++++++++++--- >> gcc/testsuite/g++.target/aarch64/pr117048.C | 34 +++++++++++++++++++++ >> 2 files changed, 63 insertions(+), 4 deletions(-) >> create mode 100644 gcc/testsuite/g++.target/aarch64/pr117048.C >> >> diff --git a/gcc/config/aarch64/aarch64-simd.md >> b/gcc/config/aarch64/aarch64-simd.md >> index 11d405ed640..bf272bc0b4e 100644 >> --- a/gcc/config/aarch64/aarch64-simd.md >> +++ b/gcc/config/aarch64/aarch64-simd.md >> @@ -9046,18 +9046,43 @@ >> [(set_attr "type" "crypto_sha3")] >> ) >> >> -(define_insn "aarch64_xarqv2di" >> +(define_insn "*aarch64_xarqv2di_insn" >> [(set (match_operand:V2DI 0 "register_operand" "=w") >> - (rotatert:V2DI >> + (rotate:V2DI >> (xor:V2DI >> (match_operand:V2DI 1 "register_operand" "%w") >> (match_operand:V2DI 2 "register_operand" "w")) >> - (match_operand:SI 3 "aarch64_simd_shift_imm_di" "Usd")))] >> + (match_operand:V2DI 3 "aarch64_simd_lshift_imm" "Dl")))] >> "TARGET_SHA3" >> - "xar\\t%0.2d, %1.2d, %2.2d, %3" >> + { >> + operands[3] >> + = GEN_INT (64 - INTVAL (unwrap_const_vec_duplicate (operands[3]))); >> + return "xar\\t%0.2d, %1.2d, %2.2d, %3"; >> + } >> [(set_attr "type" "crypto_sha3")] >> ) >> >> +;; The semantics of the vxarq_u64 intrinsics treat the immediate argument >> as a >> +;; right-rotate amount but the recommended representation of rotates by a >> +;; constant in RTL is with the left ROTATE code. Translate between the >> +;; intrinsic-provided amount and the RTL operands in the expander here. >> +;; The define_insn for XAR will translate back to instruction semantics in >> its >> +;; output logic. >> +(define_expand "aarch64_xarqv2di" >> + [(set (match_operand:V2DI 0 "register_operand") >> + (rotate:V2DI >> + (xor:V2DI >> + (match_operand:V2DI 1 "register_operand") >> + (match_operand:V2DI 2 "register_operand")) >> + (match_operand:SI 3 "aarch64_simd_shift_imm_di")))] >> + "TARGET_SHA3" >> + { >> + operands[3] >> + = aarch64_simd_gen_const_vector_dup (V2DImode, >> + 64 - INTVAL (operands[3])); >> + } >> +) >> + >> (define_insn "bcaxq<mode>4" >> [(set (match_operand:VQ_I 0 "register_operand" "=w") >> (xor:VQ_I >> diff --git a/gcc/testsuite/g++.target/aarch64/pr117048.C >> b/gcc/testsuite/g++.target/aarch64/pr117048.C >> new file mode 100644 >> index 00000000000..ae46e5875e4 >> --- /dev/null >> +++ b/gcc/testsuite/g++.target/aarch64/pr117048.C >> @@ -0,0 +1,34 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-O2" } */ >> + >> +#include <arm_neon.h> >> + >> +#pragma GCC target "+sha3" >> + >> +static inline uint64x2_t >> +rotr64_vec(uint64x2_t x, const int b) >> +{ >> + int64x2_t neg_b = vdupq_n_s64(-b); >> + int64x2_t left_shift = vsubq_s64(vdupq_n_s64(64), vdupq_n_s64(b)); >> + >> + uint64x2_t right_shifted = vshlq_u64(x, neg_b); >> + uint64x2_t left_shifted = vshlq_u64(x, left_shift); >> + >> + return vorrq_u64(right_shifted, left_shifted); >> +} >> + >> +void G( >> + int64_t* v, >> + int64x2_t& m1_01, >> + int64x2_t& m1_23, >> + int64x2_t& m2_01, >> + int64x2_t& m2_23 >> +) { >> + int64x2_t vd01 = {v[12],v[13]}; >> + vd01 = veorq_s64(vd01, m1_01); >> + vd01 = vreinterpretq_s64_u64(rotr64_vec( vreinterpretq_u64_s64 (vd01), >> 32)); >> + v[12] = vgetq_lane_s64(vd01, 0); >> +} >> + >> +/* { dg-final { scan-assembler {\txar\tv[0-9]+\.2d, v[0-9]+\.2d, >> v[0-9]+\.2d, 32\n} } } */ >> +