https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979

--- Comment #22 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #21)
> (In reply to Richard Biener from comment #20)
> > (In reply to Jakub Jelinek from comment #17)
> > > Not marking as fixed for GCC 15 yet, as there is the scalar cost 
> > > computation
> > > issue unresolved.
> > 
> > I will look into this part.
> 
> So this is because all load lanes and the FMADDSUB lanes are "live" and we
> try to be conservative with respect to costing.  Given we cannot be sure
> to be able to use lane-extraction to code-generate "live" (because of
> scheduling issues), the intent was to simply not cost the scalar stmts
> for the whole "live" subgraph.  The latter doesn't seem to work because
> of the check in vect_bb_slp_scalar_cost:
> 
>       /* If there is a non-vectorized use of the defs then the scalar
>          stmt is kept live in which case we do not account it or any
>          required defs in the SLP children in the scalar cost.  This
>          way we make the vectorization more costly when compared to
>          the scalar cost.  */
>       if (!STMT_VINFO_LIVE_P (stmt_info))
> 
> which keeps live[i] == false if STMT_VINFO_LIVE_P (stmt_info).  Fixing this
> would make the issue only worse - what we fail to realize is that
> 
> note: node 0x4dc9ac0 (max_nunits=2, refcnt=1) vector(2) float
> note: op template: _8 = REALPART_EXPR <a_5(D)->f>;
> note:   [l] stmt 0 _8 = REALPART_EXPR <a_5(D)->f>;
> note:   [l] stmt 1 _8 = REALPART_EXPR <a_5(D)->f>;
> note:   load permutation { 0 0 }
> 
> should be better handled as extern { _8, _8 } rather than costed as
> vector load + permute (cost of 4, compared to 12 + 4).  But then the
> conservative handling of "live" would cause a scalar cost of zero.
> 
> The vector construction from the __mulsc3 result also fails to consume
> the REAL/IMAGPART_EXPRs, but I guess that is harder to fix.
> 
> That said, the proper solution is to make "live" handling less conservative
> and compute scheduling constraints so we know whether extracted lanes can
> substitute for the original scalar ones (uses are not after the vector def).
> 
> Fixing the above conservativeness without further changes results in the
> following
> 
> gcc.target/i386/pr116979.c:15:10: note: Cost model analysis:
> _21 1 times scalar_stmt costs 12 in body
> _22 1 times scalar_stmt costs 12 in body
> _21 = PHI <_16(5), _19(3)> 1 times scalar_stmt costs 12 in body
> _22 = PHI <_17(5), _20(3)> 1 times scalar_stmt costs 12 in body
> REALPART_EXPR <a_5(D)->f> 1 times vec_perm costs 4 in body
> REALPART_EXPR <a_5(D)->f> 1 times unaligned_load (misalign -1) costs 12 in
> body
> REALPART_EXPR <b_6(D)->f> 1 times unaligned_load (misalign -1) costs 12 in
> body
> IMAGPART_EXPR <a_5(D)->f> 1 times vec_perm costs 4 in body
> IMAGPART_EXPR <a_5(D)->f> 1 times unaligned_load (misalign -1) costs 12 in
> body
> IMAGPART_EXPR <b_6(D)->f> 1 times vec_perm costs 4 in body
> IMAGPART_EXPR <b_6(D)->f> 1 times unaligned_load (misalign -1) costs 12 in
> body
> _9 * _11 1 times vector_stmt costs 16 in body
> .VEC_FMADDSUB (_8, _10, _13) 1 times vector_stmt costs 12 in body
> _21 = PHI <_16(5), _19(3)> 1 times vector_stmt costs 12 in body
> node 0x4dc9d00 1 times vec_construct costs 4 in prologue
> _21 1 times vector_store costs 12 in body
> _12 - _13 1 times vec_to_scalar costs 4 in epilogue
> _14 + _15 1 times vec_to_scalar costs 4 in epilogue
> REALPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue
> REALPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue
> REALPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue
> IMAGPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue
> IMAGPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue
> IMAGPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue
> IMAGPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue
> REALPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue
> gcc.target/i386/pr116979.c:15:10: note: Cost model analysis for part in loop
> 0:
>   Vector cost: 156
>   Scalar cost: 48
> 
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 9ad95104ec7..e7ef943d743 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -8740,6 +8740,11 @@ next_lane:
>           if ((*life)[i])
>             continue;
>         }
> +      else
> +       {
> +         (*life)[i] = true;
> +         continue;
> +       }
>  
>        /* Count scalar stmts only once.  */
>        if (gimple_visited_p (orig_stmt))

Hmm, no - it seems to work as intended (and we're not that conservative
anymore).  Instead the issue seems to be we're not finding all scalar
stmts to cost because we only consider SLP_TREE_SCALAR_STMTS.  We consider
intermediate pattern covered scalar stmts for finding whether any of those
are still live after vectorization (and not covered by live lane extraction)
but do not consider them for scalar costing.  We have corresponding code
in vect_bb_slp_mark_live_stmts (vec_slp_has_scalar_use) that does this
for the purpose of STMT_VINFO_LIVE_P computation, but costing lacks this.

Reply via email to