https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117048
Bug ID: 117048 Summary: Failure to combine into XAR instruction Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 A testcase derived from a hashing algorithm: #include <stdint.h> #include <string.h> #include <arm_neon.h> static inline uint64x2_t rotr64_vec(uint64x2_t x, const int b) { int64x2_t neg_b = vdupq_n_s64(-b); int64x2_t left_shift = vsubq_s64(vdupq_n_s64(64), vdupq_n_s64(b)); uint64x2_t right_shifted = vshlq_u64(x, neg_b); uint64x2_t left_shifted = vshlq_u64(x, left_shift); return vorrq_u64(right_shifted, left_shifted); } void G( int64_t* v, int64x2_t& m1_01, int64x2_t& m1_23, int64x2_t& m2_01, int64x2_t& m2_23 ) { int64x2_t vd01 = {v[12],v[13]}; vd01 = veorq_s64(vd01, m1_01); vd01 = vreinterpretq_s64_u64(rotr64_vec( vreinterpretq_u64_s64 (vd01), 32)); v[12] = vgetq_lane_s64(vd01, 0); } When compiling with, say -march=armv9-a+sha3 should generate the XAR instruction like LLVM does: G(long*, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&): ldr q0, [x0, #96] ldr q1, [x1] xar v0.2d, v0.2d, v1.2d, #32 str d0, [x0, #96] ret But GCC generates the less efficient: G(long*, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&): ldr q30, [x1] ldr q0, [x0, 96] eor v30.16b, v0.16b, v30.16b ushr v31.2d, v30.2d, 32 shl v30.2d, v30.2d, 32 orr v30.16b, v31.16b, v30.16b str d30, [x0, 96] ret We do have an RTL pattern for XAR expressed as a rotate of a XOR. I see combine trying and failing to match: (set (reg:V2DI 119 [ _14 ]) (ior:V2DI (ashift:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ]) (reg:V2DI 116 [ *m1_01_8(D) ])) (const_vector:V2DI [ (const_int 32 [0x20]) repeated x2 ])) (lshiftrt:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ]) (reg:V2DI 116 [ *m1_01_8(D) ])) (const_vector:V2DI [ (const_int 32 [0x20]) repeated x2 ])))) Should this have been simplified to a rotate or do we need more backend patterns to match it?