Richard Biener <rguent...@suse.de> writes:
> On Thu, 14 Apr 2022, Richard Sandiford wrote:
>
>> Richard Biener <rguent...@suse.de> writes:
>> > When doing BB vectorization the scalar cost compute is derailed
>> > by patterns, causing lanes to be considered live and thus not
>> > costed on the scalar side.  For the testcase in PR104010 this
>> > prevents vectorization which was done by GCC 11.  PR103941
>> > shows similar cases of missed optimizations that are fixed by
>> > this patch.
>> >
>> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> >
>> > I'm only considering this now because PR104010 is identified
>> > as regression on arm - Richards, what do you think?  I do think
>> > this will enable vectorization of more stuff now which might
>> > be good or bad - who knowns, but at least it needs to involve
>> > patterns.
>> >
>> > Thanks,
>> > Richard.
>> >
>> > 2022-04-13  Richard Biener  <rguent...@suse.de>
>> >
>> >    PR tree-optimization/104010
>> >    PR tree-optimization/103941
>> >    * tree-vect-slp.cc (vect_bb_slp_scalar_cost): When
>> >    we run into stmts in patterns continue walking those
>> >    for uses outside of the vectorized region instead of
>> >    marking the lane live.
>> >
>> >    * gcc.target/i386/pr103941-1.c: New testcase.
>> >    * gcc.target/i386/pr103941-2.c: Likewise.
>> > ---
>> >  gcc/testsuite/gcc.target/i386/pr103941-1.c | 14 +++++++
>> >  gcc/testsuite/gcc.target/i386/pr103941-2.c | 12 ++++++
>> >  gcc/tree-vect-slp.cc                       | 47 ++++++++++++++++------
>> >  3 files changed, 61 insertions(+), 12 deletions(-)
>> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103941-1.c
>> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103941-2.c
>> >
>> > diff --git a/gcc/testsuite/gcc.target/i386/pr103941-1.c 
>> > b/gcc/testsuite/gcc.target/i386/pr103941-1.c
>> > new file mode 100644
>> > index 00000000000..524fdd0b4b1
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/i386/pr103941-1.c
>> > @@ -0,0 +1,14 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-options "-O2 -msse2" } */
>> > +
>> > +unsigned char ur[16], ua[16], ub[16];
>> > +
>> > +void avgu_v2qi (void)
>> > +{
>> > +  int i;
>> > +
>> > +  for (i = 0; i < 2; i++)
>> > +    ur[i] = (ua[i] + ub[i] + 1) >> 1;
>> > +}
>> > +
>> > +/* { dg-final { scan-assembler "pavgb" } } */
>> > diff --git a/gcc/testsuite/gcc.target/i386/pr103941-2.c 
>> > b/gcc/testsuite/gcc.target/i386/pr103941-2.c
>> > new file mode 100644
>> > index 00000000000..972a32be997
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/i386/pr103941-2.c
>> > @@ -0,0 +1,12 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-options "-O2 -msse2" } */
>> > +
>> > +void foo (int *c, float *x, float *y)
>> > +{
>> > +  c[0] = x[0] < y[0];
>> > +  c[1] = x[1] < y[1];
>> > +  c[2] = x[2] < y[2];
>> > +  c[3] = x[3] < y[3];
>> > +}
>> > +
>> > +/* { dg-final { scan-assembler "cmpltps" } } */
>> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> > index 4ac2b70303c..c7687065374 100644
>> > --- a/gcc/tree-vect-slp.cc
>> > +++ b/gcc/tree-vect-slp.cc
>> > @@ -5185,22 +5185,45 @@ vect_bb_slp_scalar_cost (vec_info *vinfo,
>> >     the scalar cost.  */
>> >        if (!STMT_VINFO_LIVE_P (stmt_info))
>> >    {
>> > -    FOR_EACH_PHI_OR_STMT_DEF (def_p, orig_stmt, op_iter, SSA_OP_DEF)
>> > +    auto_vec<gimple *, 8> worklist;
>> > +    hash_set<gimple *> *worklist_visited = NULL;
>> > +    worklist.quick_push (orig_stmt);
>> > +    do
>> >        {
>> > -        imm_use_iterator use_iter;
>> > -        gimple *use_stmt;
>> > -        FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, DEF_FROM_PTR (def_p))
>> > -          if (!is_gimple_debug (use_stmt))
>> > -            {
>> > -              stmt_vec_info use_stmt_info = vinfo->lookup_stmt (use_stmt);
>> > -              if (!use_stmt_info
>> > -                  || !vectorized_scalar_stmts.contains (use_stmt_info))
>> > +        gimple *work_stmt = worklist.pop ();
>> > +        FOR_EACH_PHI_OR_STMT_DEF (def_p, work_stmt, op_iter, SSA_OP_DEF)
>> > +          {
>> > +            imm_use_iterator use_iter;
>> > +            gimple *use_stmt;
>> > +            FOR_EACH_IMM_USE_STMT (use_stmt, use_iter,
>> > +                                   DEF_FROM_PTR (def_p))
>> > +              if (!is_gimple_debug (use_stmt))
>> >                  {
>> > -                  (*life)[i] = true;
>> > -                  break;
>> > +                  stmt_vec_info use_stmt_info
>> > +                    = vinfo->lookup_stmt (use_stmt);
>> > +                  if (!use_stmt_info
>> > +                      || !vectorized_scalar_stmts.contains 
>> > (use_stmt_info))
>> > +                    {
>> > +                      if (STMT_VINFO_IN_PATTERN_P (use_stmt_info))
>> > +                        {
>> 
>> I guess I should walk through the testcase and figure it out for myself,
>> but: I assume vectorized_scalar_stmts exists because not every statement
>> we've considered vectorising has made the cut.  Isn't that also true
>> of (original) scalar statements that would have been vectorised using
>> patterns?
>> 
>> Does vectorized_scalar_stmts record original statements or statements to
>> vectorise?  From its name I'd have assumed original statements, in which
>> case I wouldn't have expected IN_PATTERN_P to need special handling.
>
> vectorized_scalar_stmts records statements to vectorize, and yes, that's
> done because PURE_SLP isn't accurate when we promoted SLP nodes to
> extern because they failed to vectorize.

Ah, OK.

> But it's true that in case a pattern is composed from multiple scalar
> stmts we only cost the scalar root of the pattern.  As I said in the PR
> the whole thing is up for another rewrite - we're basically trying to
> compute the "scalar stmt cover" of the vectorized stmts and then
> try to exclude those scalar stmts in the cover that have uses outside
> of it.  Currently we're split this problem in an odd way that makes
> doing it correctly difficult - I hope there's a better way but I
> haven't yet found the time to do it.  Ideally we'd also merge it with
> the copy that computes live lanes to be code generated as element
> extracts.  And IIRC there's another variant somewhere, I think with
> the SLP graph partitioning.
>
> If you have good ideas or want to give it a stab be welcome ;)

No, no good ideas.  I just wasn't sure why, if vectorized_scalar_stmts
records statements to vectorize, we couldn't just use
vect_stmt_to_vectorize (use_stmt_info).  But maybe that isn't
reliable for SLP.

Anyway, you knoew this code much better than me :-), so I've certainly
no objections to the patch for GCC 12.  We've done some aarch64 costing
work recently but I don't remember a case in which we relied/worked
around the specific behaviour being patched here.

Thanks,
Richard

Reply via email to