https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116083
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- So we're having a large store group which we analyze, then split away the first two lanes where discovery succeeds. Then we re-analyze the remaining N - 2 which fails, we split away two more lanes where discovery succeeds, etc. We eventually run into the SLP discovery limit, but it takes too much time in this situation - the limit counts the overall number of SLP nodes build and it's the number of scalar stmts in the block. The limiting works quite well, but we eventually do not backtrack quickly enough, we're at least adding excess failed nodes, that's something we can avoid. For BB vect we are also building nodes constructing operands from scalars, those are not accounted for. But even when fixing this the fact is the constant amount of work we perform per "work item" allowed is big, in particular for work items with a large number of lanes. With the reduction scheme outlined above we are looking at a quadratic number of scalar stmts N. Limiting the number of created SLP nodes to the number of scalar stmts is quite some overcommit for the basic-block case (of course we count every "failed" SLP node as well, rightfully so). The correct thing to do might be to instead assign a lane budget and thus not cost all SLP nodes the same. With that the compile-time is down to tree slp vectorization : 0.02 ( 2%) note this would disallow splitting groups and thus re-analyzing and thus for example make gcc.dg/vect/bb-slp-subgroups-3.c FAIL. To recap, the current limiting is ineffective in the case of the store groups being on the order of the size of the number of stmts which is usually unlikely.