Richard Biener <rguent...@suse.de> writes: > On Tue, 4 Jul 2023, Richard Sandiford wrote: > >> Richard Biener <rguent...@suse.de> writes: >> > On Thu, 29 Jun 2023, Richard Biener wrote: >> > >> >> On Thu, 29 Jun 2023, Richard Sandiford wrote: >> >> >> >> > Richard Biener <rguent...@suse.de> writes: >> >> > > With applying loop masking to epilogues on x86_64 AVX512 we see >> >> > > some significant performance regressions when evaluating SPEC CPU 2017 >> >> > > that are caused by store-to-load forwarding fails across outer >> >> > > loop iterations when the inner loop does not iterate. Consider >> >> > > >> >> > > for (j = 0; j < m; ++j) >> >> > > for (i = 0; i < n; ++i) >> >> > > a[j*n + i] += b[j*n + i]; >> >> > > >> >> > > with 'n' chosen so that the inner loop vectorized code is fully >> >> > > executed by the masked epilogue and that masked epilogue >> >> > > storing O > n elements (with elements >= n masked of course). >> >> > > Then the masked load performed for the next outer loop iteration >> >> > > will get a hit in the store queue but it obviously cannot forward >> >> > > so we have to wait for the store to retire. >> >> > > >> >> > > That causes a significant hit to performance especially if 'n' >> >> > > would have made a non-masked epilogue to fully cover 'n' as well >> >> > > (say n == 4 for a V4DImode epilogue), avoiding the need for >> >> > > store-forwarding and waiting for the retiring of the store. >> >> > > >> >> > > The following applies a very simple heuristic, disabling >> >> > > the use of loop masking when there's a memory reference pair >> >> > > with dependence distance zero. That resolves the issue >> >> > > (other problematic dependence distances seem to be less common >> >> > > at least). >> >> > > >> >> > > I have applied this heuristic in generic vectorizer code but >> >> > > restricted it to non-VL vector sizes. There currently isn't >> >> > > a way for the target to request disabling of masking only, >> >> > > while we can reject the vectoriztion at costing time that will >> >> > > not re-consider the same vector mode but without masking. >> >> > > It seems simply re-costing with masking disabled should be >> >> > > possible through, we'd just need an indication whether that >> >> > > should be done? Maybe always when the current vector mode is >> >> > > of fixed size? >> >> > > >> >> > > I wonder how SVE vectorized code behaves in these situations? >> >> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and >> >> > > 503.bwaves_r though I think both will need a hardware vector >> >> > > size covering at least 8 doubles to show the issue. 527.cam4_r >> >> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC. >> >> > > >> >> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu. >> >> > > >> >> > > Any comments? >> >> > > >> >> > > Thanks, >> >> > > Richard. >> >> > > >> >> > > PR target/110456 >> >> > > * tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New. >> >> > > * tree-vectorizer.cc (vec_info_shared::vec_info_shared): >> >> > > Initialize has_zero_dep_dist. >> >> > > * tree-vect-data-refs.cc (vect_analyze_data_ref_dependence): >> >> > > Remember if we've seen a dependence distance of zero. >> >> > > * tree-vect-stmts.cc (check_load_store_for_partial_vectors): >> >> > > When we've seen a dependence distance of zero and the vector >> >> > > type has constant size disable the use of partial vectors. >> >> > > --- >> >> > > gcc/tree-vect-data-refs.cc | 2 ++ >> >> > > gcc/tree-vect-stmts.cc | 10 ++++++++++ >> >> > > gcc/tree-vectorizer.cc | 1 + >> >> > > gcc/tree-vectorizer.h | 3 +++ >> >> > > 4 files changed, 16 insertions(+) >> >> > > >> >> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc >> >> > > index ebe93832b1e..40cde95c16a 100644 >> >> > > --- a/gcc/tree-vect-data-refs.cc >> >> > > +++ b/gcc/tree-vect-data-refs.cc >> >> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct >> >> > > data_dependence_relation *ddr, >> >> > > "dependence distance == 0 between %T and >> >> > > %T\n", >> >> > > DR_REF (dra), DR_REF (drb)); >> >> > > >> >> > > + loop_vinfo->shared->has_zero_dep_dist = true; >> >> > > + >> >> > > /* When we perform grouped accesses and perform implicit CSE >> >> > > by detecting equal accesses and doing disambiguation with >> >> > > runtime alias tests like for >> >> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >> >> > > index d642d3c257f..3bcbc000323 100644 >> >> > > --- a/gcc/tree-vect-stmts.cc >> >> > > +++ b/gcc/tree-vect-stmts.cc >> >> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors >> >> > > (loop_vec_info loop_vinfo, tree vectype, >> >> > > using_partial_vectors_p = true; >> >> > > } >> >> > > >> >> > > + if (loop_vinfo->shared->has_zero_dep_dist >> >> > > + && TYPE_VECTOR_SUBPARTS (vectype).is_constant ()) >> >> > >> >> > I don't think it makes sense to treat VLA and VLS differently here. >> >> > >> >> > But RMW operations are very common, so it seems like we're giving up a >> >> > lot on the off chance that the inner loop is applied iteratively >> >> > to successive memory locations. >> >> >> >> Yes ... >> >> >> >> > Maybe that's still OK for AVX512, where I guess loop masking is more >> >> > of a niche use case. But if so, then yeah, I think a hook to disable >> >> > masking might be better here. >> >> >> >> It's a nice use case in that if you'd cost the main vector loop >> >> with and without masking the not masked case is always winning. >> >> I understand with SVE it would be the same if you fix the >> >> vector size but otherwise it would be a win to mask as there's >> >> the chance the HW implementation uses bigger vectors than the >> >> architected minimum size. >> >> >> >> So for AVX512 the win is with the epilogue and the case of >> >> few scalar iterations where the epilogue iterations play >> >> a significant role. Since we only vectorize the epilogue >> >> of the main loop but not the epilogue of the epilogue loop >> >> we're leaving quite some iterations unvectorized when the >> >> main loop uses 512bit vectors and the epilogue 256bit. >> >> >> >> But that case specifically is also prone to make the cross-iteration >> >> issue wrt an outer loop significiant ... >> >> >> >> I'll see to address this on the costing side somehow. >> > >> > So with us requiring both partial vector and non-partial vector variants >> > working during vectorizable_* I thought it should be possible to >> > simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without >> > re-analyzing in case the target deemed the partial vector using loop >> > not profitable. >> > >> > While the following successfully hacks this in-place the question is >> > whether the above is really true for VLA vector types? >> >> Yeah, I think so. We do support full-vector VLA. >> >> > Besides from this "working", maybe the targets should get to say >> > more specifically what they'd like to change - should we make >> > the finish_cost () hook have a more elaborate return value or >> > provide more hints like we already do with the suggested unroll factor? >> > >> > I can imagine a "partial vector usage is bad" or "too high VF"? >> > But then we probably still want to re-do the final costing part >> > so we'd need a way to push away the per-stmt costing we already >> > did (sth nicer than the hack below). Maybe make >> > 'bool costing_for_scalar' a tri-state, adding 'produce copy >> > from current vector cost in vinfo'? >> >> Not sure I follow the tri-state thing. If the idea is to redo >> the vector_costs::add_stmt_costs stuff, could we just export >> the cost_vec from vect_analyze_loop_operations and apply the >> costs in vect_analyze_loop_costing instead? > > I think we don't need to re-do the add_stmt_costs, but yes, I guess > we could do this and basically initialize the backend cost info > only at vect_analyze_loop_costing time. Is that what you suggest?
I wasn't think that far ahead, but yeah, if we can do that, it definitely sounds better. >> That seems worthwhile anyway, if we think the costs are going >> to depend on partial vs. full vectors. As things stand, we could >> already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later >> set it to 0. > > Yeah, at the time we feed stmts to the backed > (in vect_analyze_loop_operations) we have not yet committed to > using partial vectors or not. > > What I meant with the tri-state thing is to give the vectorizer > an idea how to best recover here without wasting too many cycles. > At least costing masking vs. not masking should involve only > re-running the costing itself so we could unconditionally try that > as the patch does when the masked variant is rejected by the > backend (via setting cost to INT_MAX). The question above was > whether with VLA vectors we could even code generate with > !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P? Yeah, we can. And we currently do for things like reversed loops. E.g.: int x[100]; int y[100]; void f() { for (int i = 0; i < 100; ++i) x[i] = y[99 - i]; } [https://godbolt.org/z/rhqc9dWxs~]. Admittedly that's only because we haven't added loop predication for reversed loads yet, rather than a deliberate choice. But vect-partial-vector-usage=1 makes sense for VLA too, and is supported there. Thanks, Richard