Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences

Richard Biener via Gcc-patches Wed, 05 Jul 2023 06:31:06 -0700

On Tue, 4 Jul 2023, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Thu, 29 Jun 2023, Richard Biener wrote:
> >
> >> On Thu, 29 Jun 2023, Richard Sandiford wrote:
> >> 
> >> > Richard Biener <rguent...@suse.de> writes:
> >> > > With applying loop masking to epilogues on x86_64 AVX512 we see
> >> > > some significant performance regressions when evaluating SPEC CPU 2017
> >> > > that are caused by store-to-load forwarding fails across outer
> >> > > loop iterations when the inner loop does not iterate.  Consider
> >> > >
> >> > >   for (j = 0; j < m; ++j)
> >> > >     for (i = 0; i < n; ++i)
> >> > >       a[j*n + i] += b[j*n + i];
> >> > >
> >> > > with 'n' chosen so that the inner loop vectorized code is fully
> >> > > executed by the masked epilogue and that masked epilogue
> >> > > storing O > n elements (with elements >= n masked of course).
> >> > > Then the masked load performed for the next outer loop iteration
> >> > > will get a hit in the store queue but it obviously cannot forward
> >> > > so we have to wait for the store to retire.
> >> > >
> >> > > That causes a significant hit to performance especially if 'n'
> >> > > would have made a non-masked epilogue to fully cover 'n' as well
> >> > > (say n == 4 for a V4DImode epilogue), avoiding the need for
> >> > > store-forwarding and waiting for the retiring of the store.
> >> > >
> >> > > The following applies a very simple heuristic, disabling
> >> > > the use of loop masking when there's a memory reference pair
> >> > > with dependence distance zero.  That resolves the issue
> >> > > (other problematic dependence distances seem to be less common
> >> > > at least).
> >> > >
> >> > > I have applied this heuristic in generic vectorizer code but
> >> > > restricted it to non-VL vector sizes.  There currently isn't
> >> > > a way for the target to request disabling of masking only,
> >> > > while we can reject the vectoriztion at costing time that will
> >> > > not re-consider the same vector mode but without masking.
> >> > > It seems simply re-costing with masking disabled should be
> >> > > possible through, we'd just need an indication whether that
> >> > > should be done?  Maybe always when the current vector mode is
> >> > > of fixed size?
> >> > >
> >> > > I wonder how SVE vectorized code behaves in these situations?
> >> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
> >> > > 503.bwaves_r though I think both will need a hardware vector
> >> > > size covering at least 8 doubles to show the issue.  527.cam4_r
> >> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
> >> > >
> >> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
> >> > >
> >> > > Any comments?
> >> > >
> >> > > Thanks,
> >> > > Richard.
> >> > >
> >> > >        PR target/110456
> >> > >        * tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
> >> > >        * tree-vectorizer.cc (vec_info_shared::vec_info_shared):
> >> > >        Initialize has_zero_dep_dist.
> >> > >        * tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
> >> > >        Remember if we've seen a dependence distance of zero.
> >> > >        * tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> >> > >        When we've seen a dependence distance of zero and the vector
> >> > >        type has constant size disable the use of partial vectors.
> >> > > ---
> >> > >  gcc/tree-vect-data-refs.cc |  2 ++
> >> > >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
> >> > >  gcc/tree-vectorizer.cc     |  1 +
> >> > >  gcc/tree-vectorizer.h      |  3 +++
> >> > >  4 files changed, 16 insertions(+)
> >> > >
> >> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> >> > > index ebe93832b1e..40cde95c16a 100644
> >> > > --- a/gcc/tree-vect-data-refs.cc
> >> > > +++ b/gcc/tree-vect-data-refs.cc
> >> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct 
> >> > > data_dependence_relation *ddr,
> >> > >                             "dependence distance == 0 between %T and 
> >> > > %T\n",
> >> > >                             DR_REF (dra), DR_REF (drb));
> >> > >  
> >> > > +        loop_vinfo->shared->has_zero_dep_dist = true;
> >> > > +
> >> > >          /* When we perform grouped accesses and perform implicit CSE
> >> > >             by detecting equal accesses and doing disambiguation with
> >> > >             runtime alias tests like for
> >> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> >> > > index d642d3c257f..3bcbc000323 100644
> >> > > --- a/gcc/tree-vect-stmts.cc
> >> > > +++ b/gcc/tree-vect-stmts.cc
> >> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors 
> >> > > (loop_vec_info loop_vinfo, tree vectype,
> >> > >        using_partial_vectors_p = true;
> >> > >      }
> >> > >  
> >> > > +  if (loop_vinfo->shared->has_zero_dep_dist
> >> > > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> >> > 
> >> > I don't think it makes sense to treat VLA and VLS differently here.
> >> > 
> >> > But RMW operations are very common, so it seems like we're giving up a
> >> > lot on the off chance that the inner loop is applied iteratively
> >> > to successive memory locations.
> >> 
> >> Yes ...
> >> 
> >> > Maybe that's still OK for AVX512, where I guess loop masking is more
> >> > of a niche use case.  But if so, then yeah, I think a hook to disable
> >> > masking might be better here.
> >> 
> >> It's a nice use case in that if you'd cost the main vector loop
> >> with and without masking the not masked case is always winning.
> >> I understand with SVE it would be the same if you fix the
> >> vector size but otherwise it would be a win to mask as there's
> >> the chance the HW implementation uses bigger vectors than the
> >> architected minimum size.
> >> 
> >> So for AVX512 the win is with the epilogue and the case of
> >> few scalar iterations where the epilogue iterations play
> >> a significant role.  Since we only vectorize the epilogue
> >> of the main loop but not the epilogue of the epilogue loop
> >> we're leaving quite some iterations unvectorized when the
> >> main loop uses 512bit vectors and the epilogue 256bit.
> >> 
> >> But that case specifically is also prone to make the cross-iteration
> >> issue wrt an outer loop significiant ...
> >> 
> >> I'll see to address this on the costing side somehow.
> >
> > So with us requiring both partial vector and non-partial vector variants
> > working during vectorizable_* I thought it should be possible to
> > simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without
> > re-analyzing in case the target deemed the partial vector using loop
> > not profitable.
> >
> > While the following successfully hacks this in-place the question is
> > whether the above is really true for VLA vector types?
> 
> Yeah, I think so.  We do support full-vector VLA.
> 
> > Besides from this "working", maybe the targets should get to say
> > more specifically what they'd like to change - should we make
> > the finish_cost () hook have a more elaborate return value or
> > provide more hints like we already do with the suggested unroll factor?
> >
> > I can imagine a "partial vector usage is bad" or "too high VF"?
> > But then we probably still want to re-do the final costing part
> > so we'd need a way to push away the per-stmt costing we already
> > did (sth nicer than the hack below).  Maybe make
> > 'bool costing_for_scalar' a tri-state, adding 'produce copy
> > from current vector cost in vinfo'?
> 
> Not sure I follow the tri-state thing.  If the idea is to redo
> the vector_costs::add_stmt_costs stuff, could we just export
> the cost_vec from vect_analyze_loop_operations and apply the
> costs in vect_analyze_loop_costing instead?


I think we don't need to re-do the add_stmt_costs, but yes, I guess
we could do this and basically initialize the backend cost info
only at vect_analyze_loop_costing time.  Is that what you suggest?

> That seems worthwhile anyway, if we think the costs are going
> to depend on partial vs. full vectors.  As things stand, we could
> already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later
> set it to 0.

Yeah, at the time we feed stmts to the backed 
(in vect_analyze_loop_operations) we have not yet committed to
using partial vectors or not.

What I meant with the tri-state thing is to give the vectorizer
an idea how to best recover here without wasting too many cycles.
At least costing masking vs. not masking should involve only
re-running the costing itself so we could unconditionally try that
as the patch does when the masked variant is rejected by the
backend (via setting cost to INT_MAX).  The question above was
whether with VLA vectors we could even code generate with
!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?

> > Anyway - the takeaway is the x86 backend likes a way to disable
> > the use of partial vectors for the epilog (or main loop) in some cases.
> > An alternative to doing this somehow via the costing hooks would be
> > to add a new hook - the specific data dependence check could be
> > a hook invoked after dependence checking, initializing
> > LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?
> >
> > Any good ideas?  Anything that comes to your minds that would be
> > useful in this area for other targets?
> >
> > Thanks,
> > Richard.
> >
> > The following applies ontop of the earlier posted patch in this thread.
> >
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 8989985700a..a892f72ded3 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23976,6 +23976,10 @@ ix86_vector_costs::finish_cost (const vector_costs 
> > *scalar_costs)
> >       && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> >           > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> >     m_costs[vect_body] = INT_MAX;
> > +
> > +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> > +     && loop_vinfo->shared->has_zero_dep_dist)
> > +   m_costs[vect_body] = INT_MAX;
> >      }
> >  
> >    vector_costs::finish_cost (scalar_costs);
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 3b46c58a8d8..85802f07443 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -2773,6 +2773,7 @@ start_over:
> >      }
> >  
> >    loop_vinfo->vector_costs = init_cost (loop_vinfo, false);
> > +  vector_costs saved_costs = *loop_vinfo->vector_costs;
> 
> The target can derive the class and add its own member variables,
> so I think we'd need some kind of clone method for this to be safe.
> 
> >    /* Analyze the alignment of the data-refs in the loop.
> >       Fail if a data reference is found that cannot be vectorized.  */
> > @@ -3017,6 +3018,8 @@ start_over:
> >     LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
> >      }
> >  
> > +  saved_costs = *loop_vinfo->vector_costs;
> > +again_no_partial_vectors:
> >    /* If we're vectorizing an epilogue loop, the vectorized loop either 
> > needs
> >       to be able to handle fewer than VF scalars, or needs to have a lower 
> > VF
> >       than the main loop.  */
> > @@ -3043,6 +3046,17 @@ start_over:
> >      {
> >        ok = opt_result::failure_at (vect_location,
> >                                "Loop costings may not be worthwhile.\n");
> > +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> > +   {
> > +     if (dump_enabled_p ())
> > +       dump_printf_loc (MSG_NOTE, vect_location,
> > +                        "trying with partial vectors disabled\n");
> > +     LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > +     LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = false;
> > +     LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false;
> 
> I guess these last two variables should have been set after
> vect_determine_partial_vectors_and_peeling rather than before.
> Sorry for not noticing earlier.
> 
> The approach seems OK to me otherwise FWIW.

OK, will try post-poning the target cost creation and re-feed
the cost_vect, that should avoid the need to copy the target cost
structure as well.

Thanks,
Richard.

> Thanks,
> Richard
> 
> > +     *loop_vinfo->vector_costs = saved_costs;
> > +     goto again_no_partial_vectors;
> > +   }
> >        goto again;
> >      }
> >    if (!res)
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 08d071463fb..003af878ee7 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -1854,7 +1854,7 @@ check_load_store_for_partial_vectors (loop_vec_info 
> > loop_vinfo, tree vectype,
> >        using_partial_vectors_p = true;
> >      }
> >  
> > -  if (loop_vinfo->shared->has_zero_dep_dist
> > +  if (0 && loop_vinfo->shared->has_zero_dep_dist
> >        && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> >      {
> >        if (dump_enabled_p ())
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences

Reply via email to