https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115886
Bug ID: 115886 Summary: 4 different ways of implementing concat with stride produce different results Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Target: aarch64 Take: ``` #define vec4 __attribute__((vector_size(4))) #define vec8 __attribute__((vector_size(8))) #define vec16 __attribute__((vector_size(16))) #if defined(__clang__) || __GNUC__ >= 12 vec8 unsigned char concat (unsigned char * pix1) { vec4 int t = *(vec4 int*)&pix1[0]; vec4 int t_ = *(vec4 int*)&pix1[16]; vec8 int t_t = __builtin_shufflevector (t, t_, 0, 1);//, 2, 3, 4, 5, 6, 7); vec8 unsigned char t_t_ = (vec8 unsigned char)t_t; return t_t_; } vec8 unsigned char concat2 (unsigned char * pix1) { vec4 char t = *(vec4 char*)&pix1[0]; vec4 char t_ = *(vec4 char*)&pix1[16]; vec8 char t_t = __builtin_shufflevector (t, t_, 0, 1, 2, 3, 4, 5, 6, 7); vec8 unsigned char t_t_ = (vec8 unsigned char)t_t; return t_t_; } #endif vec8 unsigned char concat3 (unsigned char * pix1) { int t = *(int*)&pix1[0]; int t_ = *(int*)&pix1[16]; vec8 int t_t = {t, t_}; vec8 unsigned char t_t_ = (vec8 unsigned char)t_t; return t_t_; } vec8 unsigned char concat4 (unsigned char * pix1) { int t = *(int*)&pix1[0]; int t_ = *(int*)&pix1[16]; vec8 int t1 = {t, 0}; t1[1] = t_; return (vec8 unsigned char)t1; } ``` All 4 functions should produce the same assembly. LLVM produces: ``` ldr s0, [x0] add x8, x0, #16 ld1 { v0.s }[1], [x8] ret ``` For all 4. While GCC produces that only for concat4. The other 2 (concat and concat3) GCC produces: ``` concat: ldr s31, [x0] ldr s0, [x0, 16] zip1 v0.2s, v31.2s, v0.2s ret ... concat3: ldr s0, [x0] ldr s31, [x0, 16] uzp1 v0.2s, v0.2s, v31.2s ret ``` Note ignore concat2 for now since it requires V4QI support and that is not implemented yet. I don't know if using ld1 or zip1 is better here though. Note this shows up in spec (x264) (see PR 115252 there) Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)