On Tue, 4 Jul 2023, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Tue, 4 Jul 2023, Richard Biener wrote: > > > >> On Mon, 3 Jul 2023, Richard Sandiford wrote: > >> > >> > Richard Biener <rguent...@suse.de> writes: > >> > > The following removes late deciding to elide vectorized epilogues to > >> > > the analysis phase and also avoids altering the epilogues niter. > >> > > The costing part from vect_determine_partial_vectors_and_peeling is > >> > > moved to vect_analyze_loop_costing where we use the main loop > >> > > analysis to constrain the epilogue scalar iterations. > >> > > > >> > > I have not tried to integrate this with > >> > > vect_known_niters_smaller_than_vf. > >> > > > >> > > It seems the for_epilogue_p parameter in > >> > > vect_determine_partial_vectors_and_peeling is largely useless and > >> > > we could compute that in the function itself. > >> > > > >> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK? > >> > > > >> > > I suppose testing on aarch64 would be nice-to-have - any takers? > >> > > >> > Sorry, ran this earlier today and then forgot about it. And yeah, > >> > it passes bootstrap & regtest on aarch64-linux-gnu (all languages). > >> > > >> > LGTM FWIW, except: > >> > > >> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > >> > > index 0a03f56aae7..f39a1ecb306 100644 > >> > > --- a/gcc/tree-vect-loop.cc > >> > > +++ b/gcc/tree-vect-loop.cc > >> > > @@ -2144,14 +2144,76 @@ vect_analyze_loop_costing (loop_vec_info > >> > > loop_vinfo, > >> > > > >> > > /* Only loops that can handle partially-populated vectors can have > >> > > iteration > >> > > counts less than the vectorization factor. */ > >> > > - if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) > >> > > + if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > >> > > + && vect_known_niters_smaller_than_vf (loop_vinfo)) > >> > > { > >> > > - if (vect_known_niters_smaller_than_vf (loop_vinfo)) > >> > > + if (dump_enabled_p ()) > >> > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> > > + "not vectorized: iteration count smaller than " > >> > > + "vectorization factor.\n"); > >> > > + return 0; > >> > > + } > >> > > + > >> > > + /* If we know the number of iterations we can do better, for the > >> > > + epilogue we can also decide whether the main loop leaves us > >> > > + with enough iterations, prefering a smaller vector epilog then > >> > > + also possibly used for the case we skip the vector loop. */ > >> > > + if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > >> > > + && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > >> > > + { > >> > > + widest_int scalar_niters > >> > > + = wi::to_widest (LOOP_VINFO_NITERSM1 (loop_vinfo)) + 1; > >> > > + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) > >> > > + { > >> > > + loop_vec_info orig_loop_vinfo > >> > > + = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo); > >> > > + unsigned lowest_vf > >> > > + = constant_lower_bound (LOOP_VINFO_VECT_FACTOR > >> > > (orig_loop_vinfo)); > >> > > + int prolog_peeling = 0; > >> > > + if (!vect_use_loop_mask_for_alignment_p (loop_vinfo)) > >> > > + prolog_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT > >> > > (orig_loop_vinfo); > >> > > + if (prolog_peeling >= 0 > >> > > + && known_eq (LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo), > >> > > + lowest_vf)) > >> > > + { > >> > > + unsigned gap > >> > > + = LOOP_VINFO_PEELING_FOR_GAPS (orig_loop_vinfo) ? 1 : 0; > >> > > + scalar_niters = ((scalar_niters - gap - prolog_peeling) > >> > > + % lowest_vf + gap); > >> > > >> > Are you sure we want this + gap? A vectorised epilogue can't handle the > >> > gap either, at least for things that use (say) the first vector of LD2 > >> > and ignore the second vector. > >> > >> I must confess I blindly copied this code from vect_do_peeling which did > >> > >> - unsigned HOST_WIDE_INT eiters > >> - = (LOOP_VINFO_INT_NITERS (loop_vinfo) > >> - - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); > >> - > >> - eiters -= prolog_peeling; > >> - eiters > >> - = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo); > >> > >> I think it's correct - the main vector loop processes > >> (scalar_niters - gap - prolog_peeling) % vf iterations and it leaves > >> that one 'gap' iteration to the epilogue. Yes, that cannot handle > >> the gap either, so we should subtract it's gap (maybe we were able > >> to elide the gap peeling with lower VF) which means altering the > >> two conditions below. That btw also holds for the main vector > >> loop. > >> > >> I think I'm going to incrementally try to fix this so we can bisect > >> issues from moving this code from transform to analysis separately > >> from changing the actual heuristics. > > > > So it looks correct even since we below check both > > > > /* Check that the loop processes at least one full vector. */ > > poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); > > if (known_lt (scalar_niters, vf)) > > { > > > > and > > > > /* If we need to peel an extra epilogue iteration to handle data > > accesses with gaps, check that there are enough scalar iterations > > available. > > > > The check above is redundant with this one when peeling for gaps, > > but the distinction is useful for diagnostics. */ > > if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > > && known_le (scalar_niters, vf)) > > > > so here we take that into consideration. > > Ah, I see. I was thinking mostly of the following: > > >> > > + if (scalar_niters == 0) > >> > > + { > >> > > + if (dump_enabled_p ()) > >> > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, > >> > > vect_location, > >> > > + "not vectorized: loop never > >> > > entered\n"); > >> > > + return 0; > >> > > + } > > which looked odd when we'd just added "gaps" back in. The loop is never > entered when there's a single iteration left over for gaps either. > > Maybe it would be clearer to delete the == 0 check? Or alternatively, > do that check before adding gaps back in?
Yeah, I think the check is now useless - 'scalar_niters' used to be unsigned int and we'd subtract one from it. But that's all gone now and widest_int will handle possibly negative scalar_niters just fine. I'll delete that check. Thanks, Richard. > Thanks, > Richard > > > >> > > + } > >> > > + } > >> > > + > >> > > + /* Check that the loop processes at least one full vector. */ > >> > > + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); > >> > > + if (known_lt (scalar_niters, vf)) > >> > > { > >> > > if (dump_enabled_p ()) > >> > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> > > - "not vectorized: iteration count smaller > >> > > than " > >> > > - "vectorization factor.\n"); > >> > > + "loop does not have enough iterations " > >> > > + "to support vectorization.\n"); > >> > > + return 0; > >> > > + } > >> > > + > >> > > + /* If we need to peel an extra epilogue iteration to handle data > >> > > + accesses with gaps, check that there are enough scalar > >> > > iterations > >> > > + available. > >> > > + > >> > > + The check above is redundant with this one when peeling for > >> > > gaps, > >> > > + but the distinction is useful for diagnostics. */ > >> > > + if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > >> > > + && known_le (scalar_niters, vf)) > >> > > + { > >> > > + if (dump_enabled_p ()) > >> > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> > > + "loop does not have enough iterations " > >> > > + "to support peeling for gaps.\n"); > >> > > return 0; > >> > > } > >> > > } > >> > > @@ -2502,31 +2564,6 @@ vect_determine_partial_vectors_and_peeling > >> > > (loop_vec_info loop_vinfo, > >> > > LOOP_VINFO_VECT_FACTOR > >> > > (orig_loop_vinfo))); > >> > > } > >> > > > >> > > - if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > >> > > - && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) > >> > > - { > >> > > - /* Check that the loop processes at least one full vector. */ > >> > > - poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); > >> > > - tree scalar_niters = LOOP_VINFO_NITERS (loop_vinfo); > >> > > - if (known_lt (wi::to_widest (scalar_niters), vf)) > >> > > - return opt_result::failure_at (vect_location, > >> > > - "loop does not have enough > >> > > iterations" > >> > > - " to support vectorization.\n"); > >> > > - > >> > > - /* If we need to peel an extra epilogue iteration to handle data > >> > > - accesses with gaps, check that there are enough scalar > >> > > iterations > >> > > - available. > >> > > - > >> > > - The check above is redundant with this one when peeling for > >> > > gaps, > >> > > - but the distinction is useful for diagnostics. */ > >> > > - tree scalar_nitersm1 = LOOP_VINFO_NITERSM1 (loop_vinfo); > >> > > - if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > >> > > - && known_lt (wi::to_widest (scalar_nitersm1), vf)) > >> > > - return opt_result::failure_at (vect_location, > >> > > - "loop does not have enough > >> > > iterations" > >> > > - " to support peeling for > >> > > gaps.\n"); > >> > > - } > >> > > - > >> > > LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) > >> > > = (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > >> > > && need_peeling_or_partial_vectors_p); > >> > > @@ -3002,7 +3039,8 @@ start_over: > >> > > assuming that the loop will be used as a main loop. We will redo > >> > > this analysis later if we instead decide to use the loop as an > >> > > epilogue loop. */ > >> > > - ok = vect_determine_partial_vectors_and_peeling (loop_vinfo, false); > >> > > + ok = vect_determine_partial_vectors_and_peeling > >> > > + (loop_vinfo, LOOP_VINFO_EPILOGUE_P (loop_vinfo)); > >> > > if (!ok) > >> > > return ok; > >> > > >> > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)