https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2024-09-20 Target| |aarch64, riscv Keywords| |missed-optimization Status|UNCONFIRMED |NEW Ever confirmed|0 |1 CC| |tnfchris at gcc dot gnu.org --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Another example this shows is for gcc.dg/vect/slp-42.c - we definitely can do the interleaving scheme as non-SLP vectorization shows. gcc.dg/vect/slp-42.c also shows we're not yet "lowering" all SLP load permutes. The original SLP attempt still has node 0x45d5050 (max_nunits=4, refcnt=2) vector([4,4]) int op template: _2 = q[_1]; stmt 0 _2 = q[_1]; stmt 1 _8 = q[_7]; stmt 2 _14 = q[_13]; stmt 3 _20 = q[_19]; load permutation { 0 1 2 3 } node 0x45d50e8 (max_nunits=4, refcnt=2) vector([4,4]) int op template: _4 = q[_3]; stmt 0 _4 = q[_3]; stmt 1 _10 = q[_9]; stmt 2 _16 = q[_15]; stmt 3 _22 = q[_21]; load permutation { 4 5 6 7 } instead of a single contiguous load and two VEC_PERM_EXPR nodes to extract the lo/hi parts (which is also extract even/odd, but with a larger mode encompassing 4 elements). I'd say for VLA operation this is one of the major blockers for all-SLP.