Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences

Richard Sandiford via Gcc-patches Wed, 05 Jul 2023 06:55:44 -0700

Richard Biener <rguent...@suse.de> writes:
> On Tue, 4 Jul 2023, Richard Sandiford wrote:
>
>> Richard Biener <rguent...@suse.de> writes:
>> > On Thu, 29 Jun 2023, Richard Biener wrote:
>> >
>> >> On Thu, 29 Jun 2023, Richard Sandiford wrote:
>> >> 
>> >> > Richard Biener <rguent...@suse.de> writes:
>> >> > > With applying loop masking to epilogues on x86_64 AVX512 we see
>> >> > > some significant performance regressions when evaluating SPEC CPU 2017
>> >> > > that are caused by store-to-load forwarding fails across outer
>> >> > > loop iterations when the inner loop does not iterate.  Consider
>> >> > >
>> >> > >   for (j = 0; j < m; ++j)
>> >> > >     for (i = 0; i < n; ++i)
>> >> > >       a[j*n + i] += b[j*n + i];
>> >> > >
>> >> > > with 'n' chosen so that the inner loop vectorized code is fully
>> >> > > executed by the masked epilogue and that masked epilogue
>> >> > > storing O > n elements (with elements >= n masked of course).
>> >> > > Then the masked load performed for the next outer loop iteration
>> >> > > will get a hit in the store queue but it obviously cannot forward
>> >> > > so we have to wait for the store to retire.
>> >> > >
>> >> > > That causes a significant hit to performance especially if 'n'
>> >> > > would have made a non-masked epilogue to fully cover 'n' as well
>> >> > > (say n == 4 for a V4DImode epilogue), avoiding the need for
>> >> > > store-forwarding and waiting for the retiring of the store.
>> >> > >
>> >> > > The following applies a very simple heuristic, disabling
>> >> > > the use of loop masking when there's a memory reference pair
>> >> > > with dependence distance zero.  That resolves the issue
>> >> > > (other problematic dependence distances seem to be less common
>> >> > > at least).
>> >> > >
>> >> > > I have applied this heuristic in generic vectorizer code but
>> >> > > restricted it to non-VL vector sizes.  There currently isn't
>> >> > > a way for the target to request disabling of masking only,
>> >> > > while we can reject the vectoriztion at costing time that will
>> >> > > not re-consider the same vector mode but without masking.
>> >> > > It seems simply re-costing with masking disabled should be
>> >> > > possible through, we'd just need an indication whether that
>> >> > > should be done?  Maybe always when the current vector mode is
>> >> > > of fixed size?
>> >> > >
>> >> > > I wonder how SVE vectorized code behaves in these situations?
>> >> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
>> >> > > 503.bwaves_r though I think both will need a hardware vector
>> >> > > size covering at least 8 doubles to show the issue.  527.cam4_r
>> >> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
>> >> > >
>> >> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
>> >> > >
>> >> > > Any comments?
>> >> > >
>> >> > > Thanks,
>> >> > > Richard.
>> >> > >
>> >> > >       PR target/110456
>> >> > >       * tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
>> >> > >       * tree-vectorizer.cc (vec_info_shared::vec_info_shared):
>> >> > >       Initialize has_zero_dep_dist.
>> >> > >       * tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
>> >> > >       Remember if we've seen a dependence distance of zero.
>> >> > >       * tree-vect-stmts.cc (check_load_store_for_partial_vectors):
>> >> > >       When we've seen a dependence distance of zero and the vector
>> >> > >       type has constant size disable the use of partial vectors.
>> >> > > ---
>> >> > >  gcc/tree-vect-data-refs.cc |  2 ++
>> >> > >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
>> >> > >  gcc/tree-vectorizer.cc     |  1 +
>> >> > >  gcc/tree-vectorizer.h      |  3 +++
>> >> > >  4 files changed, 16 insertions(+)
>> >> > >
>> >> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
>> >> > > index ebe93832b1e..40cde95c16a 100644
>> >> > > --- a/gcc/tree-vect-data-refs.cc
>> >> > > +++ b/gcc/tree-vect-data-refs.cc
>> >> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct 
>> >> > > data_dependence_relation *ddr,
>> >> > >                            "dependence distance == 0 between %T and 
>> >> > > %T\n",
>> >> > >                            DR_REF (dra), DR_REF (drb));
>> >> > >  
>> >> > > +       loop_vinfo->shared->has_zero_dep_dist = true;
>> >> > > +
>> >> > >         /* When we perform grouped accesses and perform implicit CSE
>> >> > >            by detecting equal accesses and doing disambiguation with
>> >> > >            runtime alias tests like for
>> >> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> >> > > index d642d3c257f..3bcbc000323 100644
>> >> > > --- a/gcc/tree-vect-stmts.cc
>> >> > > +++ b/gcc/tree-vect-stmts.cc
>> >> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors 
>> >> > > (loop_vec_info loop_vinfo, tree vectype,
>> >> > >        using_partial_vectors_p = true;
>> >> > >      }
>> >> > >  
>> >> > > +  if (loop_vinfo->shared->has_zero_dep_dist
>> >> > > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
>> >> > 
>> >> > I don't think it makes sense to treat VLA and VLS differently here.
>> >> > 
>> >> > But RMW operations are very common, so it seems like we're giving up a
>> >> > lot on the off chance that the inner loop is applied iteratively
>> >> > to successive memory locations.
>> >> 
>> >> Yes ...
>> >> 
>> >> > Maybe that's still OK for AVX512, where I guess loop masking is more
>> >> > of a niche use case.  But if so, then yeah, I think a hook to disable
>> >> > masking might be better here.
>> >> 
>> >> It's a nice use case in that if you'd cost the main vector loop
>> >> with and without masking the not masked case is always winning.
>> >> I understand with SVE it would be the same if you fix the
>> >> vector size but otherwise it would be a win to mask as there's
>> >> the chance the HW implementation uses bigger vectors than the
>> >> architected minimum size.
>> >> 
>> >> So for AVX512 the win is with the epilogue and the case of
>> >> few scalar iterations where the epilogue iterations play
>> >> a significant role.  Since we only vectorize the epilogue
>> >> of the main loop but not the epilogue of the epilogue loop
>> >> we're leaving quite some iterations unvectorized when the
>> >> main loop uses 512bit vectors and the epilogue 256bit.
>> >> 
>> >> But that case specifically is also prone to make the cross-iteration
>> >> issue wrt an outer loop significiant ...
>> >> 
>> >> I'll see to address this on the costing side somehow.
>> >
>> > So with us requiring both partial vector and non-partial vector variants
>> > working during vectorizable_* I thought it should be possible to
>> > simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without
>> > re-analyzing in case the target deemed the partial vector using loop
>> > not profitable.
>> >
>> > While the following successfully hacks this in-place the question is
>> > whether the above is really true for VLA vector types?
>> 
>> Yeah, I think so.  We do support full-vector VLA.
>> 
>> > Besides from this "working", maybe the targets should get to say
>> > more specifically what they'd like to change - should we make
>> > the finish_cost () hook have a more elaborate return value or
>> > provide more hints like we already do with the suggested unroll factor?
>> >
>> > I can imagine a "partial vector usage is bad" or "too high VF"?
>> > But then we probably still want to re-do the final costing part
>> > so we'd need a way to push away the per-stmt costing we already
>> > did (sth nicer than the hack below).  Maybe make
>> > 'bool costing_for_scalar' a tri-state, adding 'produce copy
>> > from current vector cost in vinfo'?
>> 
>> Not sure I follow the tri-state thing.  If the idea is to redo
>> the vector_costs::add_stmt_costs stuff, could we just export
>> the cost_vec from vect_analyze_loop_operations and apply the
>> costs in vect_analyze_loop_costing instead?
>
> I think we don't need to re-do the add_stmt_costs, but yes, I guess
> we could do this and basically initialize the backend cost info
> only at vect_analyze_loop_costing time.  Is that what you suggest?


I wasn't think that far ahead, but yeah, if we can do that,
it definitely sounds better.

>> That seems worthwhile anyway, if we think the costs are going
>> to depend on partial vs. full vectors.  As things stand, we could
>> already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later
>> set it to 0.
>
> Yeah, at the time we feed stmts to the backed 
> (in vect_analyze_loop_operations) we have not yet committed to
> using partial vectors or not.
>
> What I meant with the tri-state thing is to give the vectorizer
> an idea how to best recover here without wasting too many cycles.
> At least costing masking vs. not masking should involve only
> re-running the costing itself so we could unconditionally try that
> as the patch does when the masked variant is rejected by the
> backend (via setting cost to INT_MAX).  The question above was
> whether with VLA vectors we could even code generate with
> !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?

Yeah, we can.  And we currently do for things like reversed loops.
E.g.:

    int x[100];
    int y[100];
    void f() {
        for (int i = 0; i < 100; ++i)
            x[i] = y[99 - i];
    }

[https://godbolt.org/z/rhqc9dWxs~].  Admittedly that's only because
we haven't added loop predication for reversed loads yet, rather than
a deliberate choice.

But vect-partial-vector-usage=1 makes sense for VLA too, and is
supported there.

Thanks,
Richard

Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences

Reply via email to