https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116083

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we're having a large store group which we analyze, then split away the
first two lanes where discovery succeeds.  Then we re-analyze the remaining N -
2 which fails, we split away two more lanes where discovery succeeds, etc.

We eventually run into the SLP discovery limit, but it takes too much time
in this situation - the limit counts the overall number of SLP nodes build
and it's the number of scalar stmts in the block.

The limiting works quite well, but we eventually do not backtrack quickly
enough, we're at least adding excess failed nodes, that's something we can
avoid.  For BB vect we are also building nodes constructing operands from
scalars, those are not accounted for.

But even when fixing this the fact is the constant amount of work we perform
per "work item" allowed is big, in particular for work items with a large
number of lanes.  With the reduction scheme outlined above we are looking
at a quadratic number of scalar stmts N.  Limiting the number of created
SLP nodes to the number of scalar stmts is quite some overcommit for the
basic-block case (of course we count every "failed" SLP node as well,
rightfully so).

The correct thing to do might be to instead assign a lane budget and
thus not cost all SLP nodes the same.  With that the compile-time is down to

 tree slp vectorization             :   0.02 (  2%)

note this would disallow splitting groups and thus re-analyzing and thus
for example make gcc.dg/vect/bb-slp-subgroups-3.c FAIL.

To recap, the current limiting is ineffective in the case of the store
groups being on the order of the size of the number of stmts which is
usually unlikely.

Reply via email to