https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82151
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- How about this for the inner most loop: ldr q0, [a, index] ldr q1, [b, index] zip1 v2.2d, v0.2d, v1.2d zip2 v3.2d, v0.2d, v1.2d sdr q2, [c, index] sdr q3, [c, index+16] If we did not have ld2/st4, this is what would have be produced (well if __builtin_shuffle was producing the correct code generation). Now the question comes, is this faster than ld2/st4. On some (most) micro-arch it will be. That is the following C code: a1 = a[i*4]; b1 = b[i*4]; c1 = __builtin_shuffle (a1, b1, (vector int){0, 1, 4, 5}); c2 = __builtin_shuffle (a1, b1, (vector int){2, 3, 6, 7}); c[i*8] = c1; c[i*8+4] = c2; Note __builtin_shuffle code generation corresponds to bug 82199.