https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117048

            Bug ID: 117048
           Summary: Failure to combine into XAR instruction
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

A testcase derived from a hashing algorithm:
#include <stdint.h>
#include <string.h>
#include <arm_neon.h>

static inline uint64x2_t
rotr64_vec(uint64x2_t x, const int b)
{
    int64x2_t neg_b = vdupq_n_s64(-b);
    int64x2_t left_shift = vsubq_s64(vdupq_n_s64(64), vdupq_n_s64(b));

    uint64x2_t right_shifted = vshlq_u64(x, neg_b);
    uint64x2_t left_shifted = vshlq_u64(x, left_shift);

    return vorrq_u64(right_shifted, left_shifted);
}

void G(
    int64_t* v,
    int64x2_t& m1_01, 
    int64x2_t& m1_23, 
    int64x2_t& m2_01, 
    int64x2_t& m2_23   
) {
    int64x2_t vd01 = {v[12],v[13]};
    vd01 = veorq_s64(vd01, m1_01);
    vd01 = vreinterpretq_s64_u64(rotr64_vec( vreinterpretq_u64_s64 (vd01),
32));
    v[12] = vgetq_lane_s64(vd01, 0);
}

When compiling with, say -march=armv9-a+sha3 should generate the XAR
instruction like LLVM does:
G(long*, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&):
        ldr     q0, [x0, #96]
        ldr     q1, [x1]
        xar     v0.2d, v0.2d, v1.2d, #32
        str     d0, [x0, #96]
        ret

But GCC generates the less efficient:
G(long*, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&):
        ldr     q30, [x1]
        ldr     q0, [x0, 96]
        eor     v30.16b, v0.16b, v30.16b
        ushr    v31.2d, v30.2d, 32
        shl     v30.2d, v30.2d, 32
        orr     v30.16b, v31.16b, v30.16b
        str     d30, [x0, 96]
        ret

We do have an RTL pattern for XAR expressed as a rotate of a XOR. I see combine
trying and failing to match:
(set (reg:V2DI 119 [ _14 ])
    (ior:V2DI (ashift:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
                (reg:V2DI 116 [ *m1_01_8(D) ]))
            (const_vector:V2DI [
                    (const_int 32 [0x20]) repeated x2
                ]))
        (lshiftrt:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
                (reg:V2DI 116 [ *m1_01_8(D) ]))
            (const_vector:V2DI [
                    (const_int 32 [0x20]) repeated x2
                ]))))

Should this have been simplified to a rotate or do we need more backend
patterns to match it?

Reply via email to