https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82151

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
How about this for the inner most loop:

ldr q0, [a, index]
ldr q1, [b, index]
zip1 v2.2d, v0.2d, v1.2d
zip2 v3.2d, v0.2d, v1.2d
sdr q2, [c, index]
sdr q3, [c, index+16]

If we did not have ld2/st4, this is what would have be produced (well if
__builtin_shuffle was producing the correct code generation).
Now the question comes, is this faster than ld2/st4.  On some (most) micro-arch
it will be.

That is the following C code:
a1 = a[i*4];
b1 = b[i*4];
c1 = __builtin_shuffle (a1, b1, (vector int){0, 1, 4, 5});
c2 = __builtin_shuffle (a1, b1, (vector int){2, 3, 6, 7});
c[i*8] = c1;
c[i*8+4] = c2;

Note __builtin_shuffle code generation corresponds to bug 82199.

Reply via email to