https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- So the key to notice here is the regular interleaving knows there's enough vectors to perform two-vector to one permutes within the same group and we only have a single child for the VEC_PERM_EXPR which for the permute in question effectively means we have to take "two" VLA vectors. The non-SLP interleaving scheme for this performs multiple VLA loads while we'd have a contiguous load node that we'd permute later on but we're usually not emitting multiple loads(?). For gcc.dg/vect/slp-42.c we do end up (after re-analyzing with single-lane SLP) with store-lanes for the 4 element store but SVE doesn't support 8 element load-lanes (we could use 4 element load lanes with u64 elements - missing feature). I do think the VLA interleaving scheme we produce is quite inefficient (and the cost modeling agrees and would choose V4SI fixed-size regs).