https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979
--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #20) > (In reply to Jakub Jelinek from comment #17) > > Not marking as fixed for GCC 15 yet, as there is the scalar cost computation > > issue unresolved. > > I will look into this part. So this is because all load lanes and the FMADDSUB lanes are "live" and we try to be conservative with respect to costing. Given we cannot be sure to be able to use lane-extraction to code-generate "live" (because of scheduling issues), the intent was to simply not cost the scalar stmts for the whole "live" subgraph. The latter doesn't seem to work because of the check in vect_bb_slp_scalar_cost: /* If there is a non-vectorized use of the defs then the scalar stmt is kept live in which case we do not account it or any required defs in the SLP children in the scalar cost. This way we make the vectorization more costly when compared to the scalar cost. */ if (!STMT_VINFO_LIVE_P (stmt_info)) which keeps live[i] == false if STMT_VINFO_LIVE_P (stmt_info). Fixing this would make the issue only worse - what we fail to realize is that note: node 0x4dc9ac0 (max_nunits=2, refcnt=1) vector(2) float note: op template: _8 = REALPART_EXPR <a_5(D)->f>; note: [l] stmt 0 _8 = REALPART_EXPR <a_5(D)->f>; note: [l] stmt 1 _8 = REALPART_EXPR <a_5(D)->f>; note: load permutation { 0 0 } should be better handled as extern { _8, _8 } rather than costed as vector load + permute (cost of 4, compared to 12 + 4). But then the conservative handling of "live" would cause a scalar cost of zero. The vector construction from the __mulsc3 result also fails to consume the REAL/IMAGPART_EXPRs, but I guess that is harder to fix. That said, the proper solution is to make "live" handling less conservative and compute scheduling constraints so we know whether extracted lanes can substitute for the original scalar ones (uses are not after the vector def). Fixing the above conservativeness without further changes results in the following gcc.target/i386/pr116979.c:15:10: note: Cost model analysis: _21 1 times scalar_stmt costs 12 in body _22 1 times scalar_stmt costs 12 in body _21 = PHI <_16(5), _19(3)> 1 times scalar_stmt costs 12 in body _22 = PHI <_17(5), _20(3)> 1 times scalar_stmt costs 12 in body REALPART_EXPR <a_5(D)->f> 1 times vec_perm costs 4 in body REALPART_EXPR <a_5(D)->f> 1 times unaligned_load (misalign -1) costs 12 in body REALPART_EXPR <b_6(D)->f> 1 times unaligned_load (misalign -1) costs 12 in body IMAGPART_EXPR <a_5(D)->f> 1 times vec_perm costs 4 in body IMAGPART_EXPR <a_5(D)->f> 1 times unaligned_load (misalign -1) costs 12 in body IMAGPART_EXPR <b_6(D)->f> 1 times vec_perm costs 4 in body IMAGPART_EXPR <b_6(D)->f> 1 times unaligned_load (misalign -1) costs 12 in body _9 * _11 1 times vector_stmt costs 16 in body .VEC_FMADDSUB (_8, _10, _13) 1 times vector_stmt costs 12 in body _21 = PHI <_16(5), _19(3)> 1 times vector_stmt costs 12 in body node 0x4dc9d00 1 times vec_construct costs 4 in prologue _21 1 times vector_store costs 12 in body _12 - _13 1 times vec_to_scalar costs 4 in epilogue _14 + _15 1 times vec_to_scalar costs 4 in epilogue REALPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue REALPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue REALPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue IMAGPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue IMAGPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue IMAGPART_EXPR <a_5(D)->f> 1 times vec_to_scalar costs 4 in epilogue IMAGPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue REALPART_EXPR <b_6(D)->f> 1 times vec_to_scalar costs 4 in epilogue gcc.target/i386/pr116979.c:15:10: note: Cost model analysis for part in loop 0: Vector cost: 156 Scalar cost: 48 diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 9ad95104ec7..e7ef943d743 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -8740,6 +8740,11 @@ next_lane: if ((*life)[i]) continue; } + else + { + (*life)[i] = true; + continue; + } /* Count scalar stmts only once. */ if (gimple_visited_p (orig_stmt))