https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- I note that the efficiency you gain is only by a reduced number of loads/store instructions. vld3 instead of six vldr (huh, appearantly vld3 can load 16 byte vectors but vldr only 8 byte ones?). I assume vld3 has no penalty for the lane-split itself so the code-size reduction is always wanted. Thus we'd want to always use a lane load/store even if the permutation is pointless as soon as we'd otherwise would issue more than one SLP load, say for void t5 (int len, int * __restrict p, int * __restrict q) { for (int i = 0; i < len; i+=8) { p[i] = q[i] * 2; p[i+1] = q[i+1] * 2; p[i+2] = q[i+2] * 2; p[i+3] = q[i+3] * 2; p[i+4] = q[i+4] * 2; p[i+5] = q[i+5] * 2; p[i+6] = q[i+6] * 2; p[i+7] = q[i+7] * 2; } } instead of .L4: vldr d18, [r2, #-16] vldr d19, [r2, #-8] vldr d16, [r2, #-32] vldr d17, [r2, #-24] vshl.i32 q9, q9, #1 vshl.i32 q8, q8, #1 add r3, r3, #1 cmp r0, r3 vstr d18, [r1, #-16] vstr d19, [r1, #-8] vstr d16, [r1, #-32] vstr d17, [r1, #-24] add r2, r2, #32 add r1, r1, #32 bhi .L4 use vld2.32 / vst2.32? Generally for SLP the implicit permute performed by those instructions could be modeled properly (and the SLP chain permuted accordingly).