On Thu, 7 Nov 2024, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Wednesday, November 6, 2024 2:32 PM > > To: gcc-patches@gcc.gnu.org > > Cc: RISC-V CI <patchworks...@rivosinc.com>; Tamar Christina > > <tamar.christ...@arm.com>; Richard Sandiford <richard.sandif...@arm.com> > > Subject: [PATCH 5/5] Allow multiple vectorized epilogs via --param > > vect-epilogues- > > nomask=N > > > > The following is a prototype allowing N possible vector epilogues. > > In the end I'd like the target to tell us a set of (or no) vector modes > > to consider for the epilogue of the main or the current epilog analyzed loop > > in a way similar as to how we communicate back suggested_unroll_factor. > > > > The main motivation is SPEC CPU 2017 525.x264_r which when doing > > AVX512 vectorization ends up with using the scalar epilogue in > > a hot function because the AVX2 epilogue has a too high VF. Using > > two vector epilogues mitigates this and also avoids regressing in > > 527.cam4_r which has a loop iteration count exactly matching the > > AVX2 epilogue (one of the original ideas was to always use a SSE2 > > vector epilogue, even with a AVX512 main loop). > > > > It turns out that two vector epilogues even create smaller code > > in some cases since we tend to fully unroll epilogues with less > > than 16 iterations. So a simple (int x[]) > > > > for (int i = 0; i < n; ++i) > > x[i] *= 3; > > > > has a -O3 -march=znver4 code size > > > > N vector epilogues size > > 0 615 > > 1 429 > > 2 388 > > 3 392 > > > > I'm unsure how important/effective multiple vector epilogues are > > for non-x86 ISAs who all seem to have only a single vector size > > or VLA vectors. For better target control on x86 I'd like to > > tell the vectorizer the array of modes to consider for the > > epilogue of the current loop plus a flag whether to consider > > using partial vectors (x86 does not have that encoded into the mode). > > So I'd add m_epilog_vec_modes[] and m_epilog_vec_mode_partial, > > since currently x86 doesn't do cost compares the latter can be a > > flag and we'd try that first when set, together with (only?) the > > first mode? Alternatively only hint a single mode, but this won't > > ever scale to cost compare targets? > > > > So using --param vect-epilogues-nomask=N is mainly for this RFC, > > not sure if it has to prevail. > > > > Note I didn't manage to get aarch64 to use more than one epilogue, > > not even with -msve-vector-bits=512. > > > > My guess is it's probably due to partial SVE vector type support not > being as robust as full vector. And once you say all vectors are 512bits > to use a smaller one it needs support for partial vectors. > > I think this change would be useful for AArch64 as well, but I (personally) > think the most useful mode for us is to be able to generate different > kinds of epilogues. > > With that I mean, having an unpredicated SVE main loop, > unpredicated Adv. SIMD first epilogue and predicated SVE second epilogue. > > For that I think this change is a good step forward :) > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also > > built SPEC CPU 2017 with --param vect-epilogues-nomask=2 - as > > said, I want the target to have more control, even on x86 we > > probably only want two epilogues when doing 512bit vectorization > > for the main loop and possibly depend on its VF. > > Agreed, for AArch64 we'd definitely like this as the cases we'd generate more > than one epilogue would have a large overlap with ones where we unrolled.
OK. I'll for now push the prerequesites (1-4/5), after fixing a compile issue in 3/5 caused by splitting the series. I'll then post a RFC for the target control and the x86 implementation, for now skipping the --param change. It's then also easier to iterate on the interface between the vectorizer and the target without breaking the user interaction - on the x86 side we'd want to control defaults based on -mtune= with manual control via the x86 -mtune-ctrl=, I do not expect much heuristics on the x86 side for now. Thanks for looking, Richard. > Cheers, > Tamar > > > > > Any comments sofar? > > > > Thanks, > > Richard. > > > > * doc/invoke.texi (vect-epilogues-nomask): Adjust. > > * params.opt (vect-epilogues-nomask): Adjust max value and > > documentation. > > * tree-vect-loop.cc (vect_analyze_loop): Hack in multiple > > vectorized epilogs. > > --- > > gcc/doc/invoke.texi | 3 ++- > > gcc/params.opt | 2 +- > > gcc/tree-vect-loop.cc | 23 +++++++++++++++++------ > > 3 files changed, 20 insertions(+), 8 deletions(-) > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index f2555ec83a1..73e54a47381 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -16870,7 +16870,8 @@ The maximum number of insns in loop header > > duplicated > > by the copy loop headers pass. > > > > @item vect-epilogues-nomask > > -Enable loop epilogue vectorization using smaller vector size. > > +Enable loop epilogue vectorization using smaller vector size with up to N > > +vector epilogue loops. > > > > @item vect-partial-vector-usage > > Controls when the loop vectorizer considers using partial vector loads > > diff --git a/gcc/params.opt b/gcc/params.opt > > index 4dab7a26f9b..c77472e7ad3 100644 > > --- a/gcc/params.opt > > +++ b/gcc/params.opt > > @@ -1175,7 +1175,7 @@ Common Joined UInteger > > Var(param_use_canonical_types) Init(1) IntegerRange(0, 1) > > Whether to use canonical types. > > > > -param=vect-epilogues-nomask= > > -Common Joined UInteger Var(param_vect_epilogues_nomask) Init(1) > > IntegerRange(0, 1) Param Optimization > > +Common Joined UInteger Var(param_vect_epilogues_nomask) Init(1) > > IntegerRange(0, 8) Param Optimization > > Enable loop epilogue vectorization using smaller vector size. > > > > -param=vect-max-layout-candidates= > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index 41875683595..90802675a84 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -3721,6 +3721,10 @@ vect_analyze_loop (class loop *loop, gimple > > *loop_vectorized_call, > > partial_vectors_supported_p () && param_vect_partial_vector_usage != 0; > > poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo); > > > > + loop_vec_info orig_loop_vinfo = first_loop_vinfo; > > + unsigned n = param_vect_epilogues_nomask; > > + do > > + { > > while (1) > > { > > /* If the target does not support partial vectors we can shorten the > > @@ -3744,7 +3748,7 @@ vect_analyze_loop (class loop *loop, gimple > > *loop_vectorized_call, > > bool fatal; > > opt_loop_vec_info loop_vinfo > > = vect_analyze_loop_1 (loop, shared, &loop_form_info, > > - first_loop_vinfo, > > + orig_loop_vinfo, > > vector_modes, mode_i, > > autodetected_vector_mode, fatal); > > if (fatal) > > @@ -3769,17 +3773,24 @@ vect_analyze_loop (class loop *loop, gimple > > *loop_vectorized_call, > > loop_vinfo = opt_loop_vec_info::success (NULL); > > } > > > > - /* For now only allow one epilogue loop, but allow > > - pick_lowest_cost_p to replace it, so commit to the > > - first epilogue if we have no reason to try alternatives. */ > > + /* If we do not pick an alternative based on cost we're done. */ > > if (!pick_lowest_cost_p) > > break; > > } > > > > if (mode_i == vector_modes.length ()) > > - break; > > - > > + { > > + mode_i = 0; > > + break; > > + } > > + } > > + if (mode_i == vector_modes.length ()) > > + break; > > + orig_loop_vinfo = orig_loop_vinfo->epilogue_vinfo; > > } > > + while (orig_loop_vinfo > > + && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (orig_loop_vinfo) > > + && --n != 0); > > > > if (first_loop_vinfo->epilogue_vinfo) > > { > > -- > > 2.43.0 > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)