https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82189
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Note starting GCC 14 on aarch64 we get: ldp s31, s30, [x1] add x1, x2, 4 dup v0.4s, v0.s[0] ld1 {v30.s}[1], [x1] ld1 {v31.s}[1], [x2] zip1 v30.4s, v31.4s, v30.4s fdiv v0.4s, v30.4s, v0.4s str q0, [x0] And on the trunk we get: ldp s31, s30, [x1] dup v0.4s, v0.s[0] ldr s29, [x2, 4] ld1 {v31.s}[1], [x2] uzp1 v30.2s, v30.2s, v29.2s zip1 v30.4s, v31.4s, v30.4s fdiv v0.4s, v30.4s, v0.4s str q0, [x0] Which is slightly worse? This is all from: ``` _1 = *b_9(D); _3 = MEM[(float *)b_9(D) + 4B]; _5 = *c_15(D); _7 = MEM[(float *)c_15(D) + 4B]; _18 = {_1, _3, _5, _7}; ``` ``` #define vec8 __attribute__((vector_size(8))) #define vec16 __attribute__((vector_size(16))) vec16 float f1(float *restrict a, float * restrict b) { vec8 float t = {a[0], a[1]}; vec8 float t1 = {b[0], b[1]}; return __builtin_shufflevector(t, t1, 0, 1, 2, 3); } vec16 float f2(float *restrict a, float * restrict b) { vec16 float t = {a[0], a[1], b[0], b[1]}; return t; } ``` We can optimize f1 but not f2.