On Tue, 4 Jul 2023, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Thu, 29 Jun 2023, Richard Biener wrote: > > > >> On Thu, 29 Jun 2023, Richard Sandiford wrote: > >> > >> > Richard Biener <rguent...@suse.de> writes: > >> > > With applying loop masking to epilogues on x86_64 AVX512 we see > >> > > some significant performance regressions when evaluating SPEC CPU 2017 > >> > > that are caused by store-to-load forwarding fails across outer > >> > > loop iterations when the inner loop does not iterate. Consider > >> > > > >> > > for (j = 0; j < m; ++j) > >> > > for (i = 0; i < n; ++i) > >> > > a[j*n + i] += b[j*n + i]; > >> > > > >> > > with 'n' chosen so that the inner loop vectorized code is fully > >> > > executed by the masked epilogue and that masked epilogue > >> > > storing O > n elements (with elements >= n masked of course). > >> > > Then the masked load performed for the next outer loop iteration > >> > > will get a hit in the store queue but it obviously cannot forward > >> > > so we have to wait for the store to retire. > >> > > > >> > > That causes a significant hit to performance especially if 'n' > >> > > would have made a non-masked epilogue to fully cover 'n' as well > >> > > (say n == 4 for a V4DImode epilogue), avoiding the need for > >> > > store-forwarding and waiting for the retiring of the store. > >> > > > >> > > The following applies a very simple heuristic, disabling > >> > > the use of loop masking when there's a memory reference pair > >> > > with dependence distance zero. That resolves the issue > >> > > (other problematic dependence distances seem to be less common > >> > > at least). > >> > > > >> > > I have applied this heuristic in generic vectorizer code but > >> > > restricted it to non-VL vector sizes. There currently isn't > >> > > a way for the target to request disabling of masking only, > >> > > while we can reject the vectoriztion at costing time that will > >> > > not re-consider the same vector mode but without masking. > >> > > It seems simply re-costing with masking disabled should be > >> > > possible through, we'd just need an indication whether that > >> > > should be done? Maybe always when the current vector mode is > >> > > of fixed size? > >> > > > >> > > I wonder how SVE vectorized code behaves in these situations? > >> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and > >> > > 503.bwaves_r though I think both will need a hardware vector > >> > > size covering at least 8 doubles to show the issue. 527.cam4_r > >> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC. > >> > > > >> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu. > >> > > > >> > > Any comments? > >> > > > >> > > Thanks, > >> > > Richard. > >> > > > >> > > PR target/110456 > >> > > * tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New. > >> > > * tree-vectorizer.cc (vec_info_shared::vec_info_shared): > >> > > Initialize has_zero_dep_dist. > >> > > * tree-vect-data-refs.cc (vect_analyze_data_ref_dependence): > >> > > Remember if we've seen a dependence distance of zero. > >> > > * tree-vect-stmts.cc (check_load_store_for_partial_vectors): > >> > > When we've seen a dependence distance of zero and the vector > >> > > type has constant size disable the use of partial vectors. > >> > > --- > >> > > gcc/tree-vect-data-refs.cc | 2 ++ > >> > > gcc/tree-vect-stmts.cc | 10 ++++++++++ > >> > > gcc/tree-vectorizer.cc | 1 + > >> > > gcc/tree-vectorizer.h | 3 +++ > >> > > 4 files changed, 16 insertions(+) > >> > > > >> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc > >> > > index ebe93832b1e..40cde95c16a 100644 > >> > > --- a/gcc/tree-vect-data-refs.cc > >> > > +++ b/gcc/tree-vect-data-refs.cc > >> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct > >> > > data_dependence_relation *ddr, > >> > > "dependence distance == 0 between %T and > >> > > %T\n", > >> > > DR_REF (dra), DR_REF (drb)); > >> > > > >> > > + loop_vinfo->shared->has_zero_dep_dist = true; > >> > > + > >> > > /* When we perform grouped accesses and perform implicit CSE > >> > > by detecting equal accesses and doing disambiguation with > >> > > runtime alias tests like for > >> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > >> > > index d642d3c257f..3bcbc000323 100644 > >> > > --- a/gcc/tree-vect-stmts.cc > >> > > +++ b/gcc/tree-vect-stmts.cc > >> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors > >> > > (loop_vec_info loop_vinfo, tree vectype, > >> > > using_partial_vectors_p = true; > >> > > } > >> > > > >> > > + if (loop_vinfo->shared->has_zero_dep_dist > >> > > + && TYPE_VECTOR_SUBPARTS (vectype).is_constant ()) > >> > > >> > I don't think it makes sense to treat VLA and VLS differently here. > >> > > >> > But RMW operations are very common, so it seems like we're giving up a > >> > lot on the off chance that the inner loop is applied iteratively > >> > to successive memory locations. > >> > >> Yes ... > >> > >> > Maybe that's still OK for AVX512, where I guess loop masking is more > >> > of a niche use case. But if so, then yeah, I think a hook to disable > >> > masking might be better here. > >> > >> It's a nice use case in that if you'd cost the main vector loop > >> with and without masking the not masked case is always winning. > >> I understand with SVE it would be the same if you fix the > >> vector size but otherwise it would be a win to mask as there's > >> the chance the HW implementation uses bigger vectors than the > >> architected minimum size. > >> > >> So for AVX512 the win is with the epilogue and the case of > >> few scalar iterations where the epilogue iterations play > >> a significant role. Since we only vectorize the epilogue > >> of the main loop but not the epilogue of the epilogue loop > >> we're leaving quite some iterations unvectorized when the > >> main loop uses 512bit vectors and the epilogue 256bit. > >> > >> But that case specifically is also prone to make the cross-iteration > >> issue wrt an outer loop significiant ... > >> > >> I'll see to address this on the costing side somehow. > > > > So with us requiring both partial vector and non-partial vector variants > > working during vectorizable_* I thought it should be possible to > > simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without > > re-analyzing in case the target deemed the partial vector using loop > > not profitable. > > > > While the following successfully hacks this in-place the question is > > whether the above is really true for VLA vector types? > > Yeah, I think so. We do support full-vector VLA. > > > Besides from this "working", maybe the targets should get to say > > more specifically what they'd like to change - should we make > > the finish_cost () hook have a more elaborate return value or > > provide more hints like we already do with the suggested unroll factor? > > > > I can imagine a "partial vector usage is bad" or "too high VF"? > > But then we probably still want to re-do the final costing part > > so we'd need a way to push away the per-stmt costing we already > > did (sth nicer than the hack below). Maybe make > > 'bool costing_for_scalar' a tri-state, adding 'produce copy > > from current vector cost in vinfo'? > > Not sure I follow the tri-state thing. If the idea is to redo > the vector_costs::add_stmt_costs stuff, could we just export > the cost_vec from vect_analyze_loop_operations and apply the > costs in vect_analyze_loop_costing instead?
I think we don't need to re-do the add_stmt_costs, but yes, I guess we could do this and basically initialize the backend cost info only at vect_analyze_loop_costing time. Is that what you suggest? > That seems worthwhile anyway, if we think the costs are going > to depend on partial vs. full vectors. As things stand, we could > already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later > set it to 0. Yeah, at the time we feed stmts to the backed (in vect_analyze_loop_operations) we have not yet committed to using partial vectors or not. What I meant with the tri-state thing is to give the vectorizer an idea how to best recover here without wasting too many cycles. At least costing masking vs. not masking should involve only re-running the costing itself so we could unconditionally try that as the patch does when the masked variant is rejected by the backend (via setting cost to INT_MAX). The question above was whether with VLA vectors we could even code generate with !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P? > > Anyway - the takeaway is the x86 backend likes a way to disable > > the use of partial vectors for the epilog (or main loop) in some cases. > > An alternative to doing this somehow via the costing hooks would be > > to add a new hook - the specific data dependence check could be > > a hook invoked after dependence checking, initializing > > LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P? > > > > Any good ideas? Anything that comes to your minds that would be > > useful in this area for other targets? > > > > Thanks, > > Richard. > > > > The following applies ontop of the earlier posted patch in this thread. > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index 8989985700a..a892f72ded3 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -23976,6 +23976,10 @@ ix86_vector_costs::finish_cost (const vector_costs > > *scalar_costs) > > && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > > > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) > > m_costs[vect_body] = INT_MAX; > > + > > + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > > + && loop_vinfo->shared->has_zero_dep_dist) > > + m_costs[vect_body] = INT_MAX; > > } > > > > vector_costs::finish_cost (scalar_costs); > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index 3b46c58a8d8..85802f07443 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -2773,6 +2773,7 @@ start_over: > > } > > > > loop_vinfo->vector_costs = init_cost (loop_vinfo, false); > > + vector_costs saved_costs = *loop_vinfo->vector_costs; > > The target can derive the class and add its own member variables, > so I think we'd need some kind of clone method for this to be safe. > > > /* Analyze the alignment of the data-refs in the loop. > > Fail if a data reference is found that cannot be vectorized. */ > > @@ -3017,6 +3018,8 @@ start_over: > > LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true; > > } > > > > + saved_costs = *loop_vinfo->vector_costs; > > +again_no_partial_vectors: > > /* If we're vectorizing an epilogue loop, the vectorized loop either > > needs > > to be able to handle fewer than VF scalars, or needs to have a lower > > VF > > than the main loop. */ > > @@ -3043,6 +3046,17 @@ start_over: > > { > > ok = opt_result::failure_at (vect_location, > > "Loop costings may not be worthwhile.\n"); > > + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) > > + { > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "trying with partial vectors disabled\n"); > > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > > + LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = false; > > + LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false; > > I guess these last two variables should have been set after > vect_determine_partial_vectors_and_peeling rather than before. > Sorry for not noticing earlier. > > The approach seems OK to me otherwise FWIW. OK, will try post-poning the target cost creation and re-feed the cost_vect, that should avoid the need to copy the target cost structure as well. Thanks, Richard. > Thanks, > Richard > > > + *loop_vinfo->vector_costs = saved_costs; > > + goto again_no_partial_vectors; > > + } > > goto again; > > } > > if (!res) > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index 08d071463fb..003af878ee7 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -1854,7 +1854,7 @@ check_load_store_for_partial_vectors (loop_vec_info > > loop_vinfo, tree vectype, > > using_partial_vectors_p = true; > > } > > > > - if (loop_vinfo->shared->has_zero_dep_dist > > + if (0 && loop_vinfo->shared->has_zero_dep_dist > > && TYPE_VECTOR_SUBPARTS (vectype).is_constant ()) > > { > > if (dump_enabled_p ()) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)