https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093
Bug ID: 117093
Summary: Missing detection of REV64 vector permute
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ktkachov at gcc dot gnu.org
CC: tnfchris at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
This testcase is reduced from a hashing code:
#include <arm_neon.h>
uint64x2_t ror32_neon_tgt_gcc_bad(uint64x2_t r) {
uint32x4_t a = vreinterpretq_u32_u64 (r);
uint32_t t;
t = a[0]; a[0] = a[1]; a[1] = t;
t = a[2]; a[2] = a[3]; a[3] = t;
return vreinterpretq_u64_u32 (a);
}
LLVM is able to produce on aarch64:
ror32_neon_tgt_gcc_bad(__Uint64x2_t):
rev64 v0.4s, v0.4s
ret
Whereas GCC does:
ror32_neon_tgt_gcc_bad(__Uint64x2_t):
mov v31.16b, v0.16b
ins v31.s[0], v0.s[1]
ins v31.s[1], v0.s[0]
ins v31.s[2], v0.s[3]
ins v31.s[3], v0.s[2]
mov v0.16b, v31.16b
ret
I'm not sure what part in GCC would handle this. Is that something SLP
vectorisation would pick up and optimise using its permute logic? Or something
bswap could do?