https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the key to notice here is the regular interleaving knows there's enough
vectors to perform two-vector to one permutes within the same group and
we only have a single child for the VEC_PERM_EXPR which for the permute
in question effectively means we have to take "two" VLA vectors.

The non-SLP interleaving scheme for this performs multiple VLA loads while
we'd have a contiguous load node that we'd permute later on but we're usually
not emitting multiple loads(?).  For gcc.dg/vect/slp-42.c we do end up
(after re-analyzing with single-lane SLP) with store-lanes for the 4 element
store but SVE doesn't support 8 element load-lanes (we could use 4 element
load lanes with u64 elements - missing feature).

I do think the VLA interleaving scheme we produce is quite inefficient
(and the cost modeling agrees and would choose V4SI fixed-size regs).

Reply via email to