15 regression] fma not always used in complex product

rguenth at gcc dot gnu.org via Gcc-bugs Sun, 15 Dec 2024 03:10:58 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979


--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #20)
> (In reply to Jakub Jelinek from comment #17)
> > Not marking as fixed for GCC 15 yet, as there is the scalar cost computation
> > issue unresolved.
> 
> I will look into this part.

So this is because all load lanes and the FMADDSUB lanes are "live" and we
try to be conservative with respect to costing.  Given we cannot be sure
to be able to use lane-extraction to code-generate "live" (because of
scheduling issues), the intent was to simply not cost the scalar stmts
for the whole "live" subgraph.  The latter doesn't seem to work because
of the check in vect_bb_slp_scalar_cost:

      /* If there is a non-vectorized use of the defs then the scalar
         stmt is kept live in which case we do not account it or any
         required defs in the SLP children in the scalar cost.  This
         way we make the vectorization more costly when compared to
         the scalar cost.  */
      if (!STMT_VINFO_LIVE_P (stmt_info))

which keeps live[i] == false if STMT_VINFO_LIVE_P (stmt_info).  Fixing this
would make the issue only worse - what we fail to realize is that

note: node 0x4dc9ac0 (max_nunits=2, refcnt=1) vector(2) float
note: op template: _8 = REALPART_EXPR <a_5(D)->f>;
note:   [l] stmt 0 _8 = REALPART_EXPR <a_5(D)->f>;
note:   [l] stmt 1 _8 = REALPART_EXPR <a_5(D)->f>;
note:   load permutation { 0 0 }

should be better handled as extern { _8, _8 } rather than costed as
vector load + permute (cost of 4, compared to 12 + 4).  But then the
conservative handling of "live" would cause a scalar cost of zero.

The vector construction from the __mulsc3 result also fails to consume
the REAL/IMAGPART_EXPRs, but I guess that is harder to fix.

That said, the proper solution is to make "live" handling less conservative
and compute scheduling constraints so we know whether extracted lanes can
substitute for the original scalar ones (uses are not after the vector def).

Fixing the above conservativeness without further changes results in the
following

gcc.target/i386/pr116979.c:15:10: note: Cost model analysis:
_21 1 times scalar_stmt costs 12 in body
_22 1 times scalar_stmt costs 12 in body
_21 = PHI <_16(5), _19(3)> 1 times scalar_stmt costs 12 in body
_22 = PHI <_17(5), _20(3)> 1 times scalar_stmt costs 12 in body
REALPART_EXPR <a_5(D)->f> 1 times vec_perm costs 4 in body
REALPART_EXPR <a_5(D)->f> 1 times unaligned_load (misalign -1) costs 12 in body
REALPART_EXPR <b_6(D)->f> 1 times unaligned_load (misalign -1) costs 12 in body
IMAGPART_EXPR <a_5(D)->f> 1 times vec_perm costs 4 in body
IMAGPART_EXPR <a_5(D)->f> 1 times unaligned_load (misalign -1) costs 12 in body
IMAGPART_EXPR <b_6(D)->f> 1 times vec_perm costs 4 in body
IMAGPART_EXPR <b_6(D)->f> 1 times unaligned_load (misalign -1) costs 12 in body
_9 * _11 1 times vector_stmt costs 16 in body
.VEC_FMADDSUB (_8, _10, _13) 1 times vector_stmt costs 12 in body
_21 = PHI <_16(5), _19(3)> 1 times vector_stmt costs 12 in body
node 0x4dc9d00 1 times vec_construct costs 4 in prologue
_21 1 times vector_store costs 12 in body
_12 - _13 1 times vec_to_scalar costs 4 in epilogue
_14 + _15 1 times vec_to_scalar costs 4 in epilogue
REALPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue
REALPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue
REALPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue
IMAGPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue
IMAGPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue
IMAGPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue
IMAGPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue
REALPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue
gcc.target/i386/pr116979.c:15:10: note: Cost model analysis for part in loop 0:
  Vector cost: 156
  Scalar cost: 48

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 9ad95104ec7..e7ef943d743 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -8740,6 +8740,11 @@ next_lane:
          if ((*life)[i])
            continue;
        }
+      else
+       {
+         (*life)[i] = true;
+         continue;
+       }

       /* Count scalar stmts only once.  */
       if (gimple_visited_p (orig_stmt))

[Bug target/116979] [12/13/14/15 regression] fma not always used in complex product

Reply via email to