https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with this change and prototyping a scale-cost-by-loop depth in a way to
very conservatively (or optimistically, depending on the view) estimate of
all loops iterating exactly twice (so cost scaling by 1 << loop_depth works)
this isn't enough to make the vectorization appear unprofitable.

Technically scaling by loop depth is going to be needed but it has some
impact throughout the code, esp. the cost vector entries do not have a
scale but only 'count' which I abused for the prototype but that should
also be an indication to the backend on how many stmts we emit so we
shouldn't overload it by scaling.

Another idea would be to constrain inner loop "parts" to be profitable
on their own so that the vectorization is profitable independent on the
number of iterations.  For a start that means at least annotating the
cost entries with the loop number for example.

ARM get's away cost-wise for this testcase because it seems to have a uniform
scalar cost of one and with a lower VF (we can only use neon) this is enough to
have the vectorization rejected in this case.

On x86 we have the target specific "issue" that load/store costs dominate
everything since stmt cost (scalar and vector) is generally 4 while
load/store cost (scalar and vector, aligned and unaligned) is 12.


So design-wise I'm leaning towards requiring that BB vectorization should
be profitable independent of the number of iterations of a loop which
would mean that we need to check profitability for each loop level we cover
from inner to outer loop (and including inner loops) but without applying
any scaling.  So assume we annotated the loop tree with the cost of stmts
belonging directly to them do

 FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
   {
     if (!profitable_p (loop->scalar_cost, loop->vector_cost))
       reject SLP;
     loop_outer (loop)->scalar_cost += loop->scalar_cost;
     loop_outer (loop)->vector_cost += loop->vector_cost; 
   }

(in the end we should do sth more efficient)

Thoughts?

Reply via email to