https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-09-20
             Target|                            |aarch64, riscv
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another example this shows is for gcc.dg/vect/slp-42.c - we definitely can
do the interleaving scheme as non-SLP vectorization shows.

gcc.dg/vect/slp-42.c also shows we're not yet "lowering" all SLP load permutes.
The original SLP attempt still has

   node 0x45d5050 (max_nunits=4, refcnt=2) vector([4,4]) int
   op template: _2 = q[_1];
        stmt 0 _2 = q[_1];
        stmt 1 _8 = q[_7];
        stmt 2 _14 = q[_13];
        stmt 3 _20 = q[_19];
        load permutation { 0 1 2 3 }
   node 0x45d50e8 (max_nunits=4, refcnt=2) vector([4,4]) int
   op template: _4 = q[_3];
        stmt 0 _4 = q[_3];
        stmt 1 _10 = q[_9];
        stmt 2 _16 = q[_15];
        stmt 3 _22 = q[_21];
        load permutation { 4 5 6 7 }

instead of a single contiguous load and two VEC_PERM_EXPR nodes to extract
the lo/hi parts (which is also extract even/odd, but with a larger mode
encompassing 4 elements).

I'd say for VLA operation this is one of the major blockers for all-SLP.

Reply via email to