Richard Biener <rguent...@suse.de> writes: > On Thu, 14 Apr 2022, Richard Sandiford wrote: > >> Richard Biener <rguent...@suse.de> writes: >> > When doing BB vectorization the scalar cost compute is derailed >> > by patterns, causing lanes to be considered live and thus not >> > costed on the scalar side. For the testcase in PR104010 this >> > prevents vectorization which was done by GCC 11. PR103941 >> > shows similar cases of missed optimizations that are fixed by >> > this patch. >> > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu. >> > >> > I'm only considering this now because PR104010 is identified >> > as regression on arm - Richards, what do you think? I do think >> > this will enable vectorization of more stuff now which might >> > be good or bad - who knowns, but at least it needs to involve >> > patterns. >> > >> > Thanks, >> > Richard. >> > >> > 2022-04-13 Richard Biener <rguent...@suse.de> >> > >> > PR tree-optimization/104010 >> > PR tree-optimization/103941 >> > * tree-vect-slp.cc (vect_bb_slp_scalar_cost): When >> > we run into stmts in patterns continue walking those >> > for uses outside of the vectorized region instead of >> > marking the lane live. >> > >> > * gcc.target/i386/pr103941-1.c: New testcase. >> > * gcc.target/i386/pr103941-2.c: Likewise. >> > --- >> > gcc/testsuite/gcc.target/i386/pr103941-1.c | 14 +++++++ >> > gcc/testsuite/gcc.target/i386/pr103941-2.c | 12 ++++++ >> > gcc/tree-vect-slp.cc | 47 ++++++++++++++++------ >> > 3 files changed, 61 insertions(+), 12 deletions(-) >> > create mode 100644 gcc/testsuite/gcc.target/i386/pr103941-1.c >> > create mode 100644 gcc/testsuite/gcc.target/i386/pr103941-2.c >> > >> > diff --git a/gcc/testsuite/gcc.target/i386/pr103941-1.c >> > b/gcc/testsuite/gcc.target/i386/pr103941-1.c >> > new file mode 100644 >> > index 00000000000..524fdd0b4b1 >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/i386/pr103941-1.c >> > @@ -0,0 +1,14 @@ >> > +/* { dg-do compile } */ >> > +/* { dg-options "-O2 -msse2" } */ >> > + >> > +unsigned char ur[16], ua[16], ub[16]; >> > + >> > +void avgu_v2qi (void) >> > +{ >> > + int i; >> > + >> > + for (i = 0; i < 2; i++) >> > + ur[i] = (ua[i] + ub[i] + 1) >> 1; >> > +} >> > + >> > +/* { dg-final { scan-assembler "pavgb" } } */ >> > diff --git a/gcc/testsuite/gcc.target/i386/pr103941-2.c >> > b/gcc/testsuite/gcc.target/i386/pr103941-2.c >> > new file mode 100644 >> > index 00000000000..972a32be997 >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/i386/pr103941-2.c >> > @@ -0,0 +1,12 @@ >> > +/* { dg-do compile } */ >> > +/* { dg-options "-O2 -msse2" } */ >> > + >> > +void foo (int *c, float *x, float *y) >> > +{ >> > + c[0] = x[0] < y[0]; >> > + c[1] = x[1] < y[1]; >> > + c[2] = x[2] < y[2]; >> > + c[3] = x[3] < y[3]; >> > +} >> > + >> > +/* { dg-final { scan-assembler "cmpltps" } } */ >> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc >> > index 4ac2b70303c..c7687065374 100644 >> > --- a/gcc/tree-vect-slp.cc >> > +++ b/gcc/tree-vect-slp.cc >> > @@ -5185,22 +5185,45 @@ vect_bb_slp_scalar_cost (vec_info *vinfo, >> > the scalar cost. */ >> > if (!STMT_VINFO_LIVE_P (stmt_info)) >> > { >> > - FOR_EACH_PHI_OR_STMT_DEF (def_p, orig_stmt, op_iter, SSA_OP_DEF) >> > + auto_vec<gimple *, 8> worklist; >> > + hash_set<gimple *> *worklist_visited = NULL; >> > + worklist.quick_push (orig_stmt); >> > + do >> > { >> > - imm_use_iterator use_iter; >> > - gimple *use_stmt; >> > - FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, DEF_FROM_PTR (def_p)) >> > - if (!is_gimple_debug (use_stmt)) >> > - { >> > - stmt_vec_info use_stmt_info = vinfo->lookup_stmt (use_stmt); >> > - if (!use_stmt_info >> > - || !vectorized_scalar_stmts.contains (use_stmt_info)) >> > + gimple *work_stmt = worklist.pop (); >> > + FOR_EACH_PHI_OR_STMT_DEF (def_p, work_stmt, op_iter, SSA_OP_DEF) >> > + { >> > + imm_use_iterator use_iter; >> > + gimple *use_stmt; >> > + FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, >> > + DEF_FROM_PTR (def_p)) >> > + if (!is_gimple_debug (use_stmt)) >> > { >> > - (*life)[i] = true; >> > - break; >> > + stmt_vec_info use_stmt_info >> > + = vinfo->lookup_stmt (use_stmt); >> > + if (!use_stmt_info >> > + || !vectorized_scalar_stmts.contains >> > (use_stmt_info)) >> > + { >> > + if (STMT_VINFO_IN_PATTERN_P (use_stmt_info)) >> > + { >> >> I guess I should walk through the testcase and figure it out for myself, >> but: I assume vectorized_scalar_stmts exists because not every statement >> we've considered vectorising has made the cut. Isn't that also true >> of (original) scalar statements that would have been vectorised using >> patterns? >> >> Does vectorized_scalar_stmts record original statements or statements to >> vectorise? From its name I'd have assumed original statements, in which >> case I wouldn't have expected IN_PATTERN_P to need special handling. > > vectorized_scalar_stmts records statements to vectorize, and yes, that's > done because PURE_SLP isn't accurate when we promoted SLP nodes to > extern because they failed to vectorize.
Ah, OK. > But it's true that in case a pattern is composed from multiple scalar > stmts we only cost the scalar root of the pattern. As I said in the PR > the whole thing is up for another rewrite - we're basically trying to > compute the "scalar stmt cover" of the vectorized stmts and then > try to exclude those scalar stmts in the cover that have uses outside > of it. Currently we're split this problem in an odd way that makes > doing it correctly difficult - I hope there's a better way but I > haven't yet found the time to do it. Ideally we'd also merge it with > the copy that computes live lanes to be code generated as element > extracts. And IIRC there's another variant somewhere, I think with > the SLP graph partitioning. > > If you have good ideas or want to give it a stab be welcome ;) No, no good ideas. I just wasn't sure why, if vectorized_scalar_stmts records statements to vectorize, we couldn't just use vect_stmt_to_vectorize (use_stmt_info). But maybe that isn't reliable for SLP. Anyway, you knoew this code much better than me :-), so I've certainly no objections to the patch for GCC 12. We've done some aarch64 costing work recently but I don't remember a case in which we relied/worked around the specific behaviour being patched here. Thanks, Richard