RE: [PATCH 5/5] Allow multiple vectorized epilogs via --param vect-epilogues-nomask=N

Richard Biener Thu, 07 Nov 2024 10:03:46 -0800

On Thu, 7 Nov 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Wednesday, November 6, 2024 2:32 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: RISC-V CI <patchworks...@rivosinc.com>; Tamar Christina
> > <tamar.christ...@arm.com>; Richard Sandiford <richard.sandif...@arm.com>
> > Subject: [PATCH 5/5] Allow multiple vectorized epilogs via --param 
> > vect-epilogues-
> > nomask=N
> > 
> > The following is a prototype allowing N possible vector epilogues.
> > In the end I'd like the target to tell us a set of (or no) vector modes
> > to consider for the epilogue of the main or the current epilog analyzed loop
> > in a way similar as to how we communicate back suggested_unroll_factor.
> > 
> > The main motivation is SPEC CPU 2017 525.x264_r which when doing
> > AVX512 vectorization ends up with using the scalar epilogue in
> > a hot function because the AVX2 epilogue has a too high VF.  Using
> > two vector epilogues mitigates this and also avoids regressing in
> > 527.cam4_r which has a loop iteration count exactly matching the
> > AVX2 epilogue (one of the original ideas was to always use a SSE2
> > vector epilogue, even with a AVX512 main loop).
> > 
> > It turns out that two vector epilogues even create smaller code
> > in some cases since we tend to fully unroll epilogues with less
> > than 16 iterations.  So a simple (int x[])
> > 
> >   for (int i = 0; i < n; ++i)
> >     x[i] *= 3;
> > 
> > has a -O3 -march=znver4 code size
> > 
> > N vector epilogues   size
> > 0                    615
> > 1                    429
> > 2                    388
> > 3                    392
> > 
> > I'm unsure how important/effective multiple vector epilogues are
> > for non-x86 ISAs who all seem to have only a single vector size
> > or VLA vectors.  For better target control on x86 I'd like to
> > tell the vectorizer the array of modes to consider for the
> > epilogue of the current loop plus a flag whether to consider
> > using partial vectors (x86 does not have that encoded into the mode).
> > So I'd add m_epilog_vec_modes[] and m_epilog_vec_mode_partial,
> > since currently x86 doesn't do cost compares the latter can be a
> > flag and we'd try that first when set, together with (only?) the
> > first mode?  Alternatively only hint a single mode, but this won't
> > ever scale to cost compare targets?
> > 
> > So using --param vect-epilogues-nomask=N is mainly for this RFC,
> > not sure if it has to prevail.
> > 
> > Note I didn't manage to get aarch64 to use more than one epilogue,
> > not even with -msve-vector-bits=512.
> > 
> 
> My guess is it's probably due to partial SVE vector type support not
> being as robust as full vector.  And once you say all vectors are 512bits
> to use a smaller one it needs support for partial vectors.
> 
> I think this change would be useful for AArch64 as well, but I (personally)
> think the most useful mode for us is to be able to generate different
> kinds of epilogues.
> 
> With that I mean, having an unpredicated SVE main loop,
> unpredicated Adv. SIMD first epilogue and predicated SVE second epilogue.
> 
> For that I think this change is a good step forward :)
> 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> > built SPEC CPU 2017 with --param vect-epilogues-nomask=2 - as
> > said, I want the target to have more control, even on x86 we
> > probably only want two epilogues when doing 512bit vectorization
> > for the main loop and possibly depend on its VF.
> 
> Agreed, for AArch64 we'd definitely like this as the cases we'd generate more
> than one epilogue would have a large overlap with ones where we unrolled.


OK.  I'll for now push the prerequesites (1-4/5), after fixing a
compile issue in 3/5 caused by splitting the series.  I'll then post
a RFC for the target control and the x86 implementation, for now
skipping the --param change.  It's then also easier to iterate on
the interface between the vectorizer and the target without breaking
the user interaction - on the x86 side we'd want to control defaults
based on -mtune= with manual control via the x86 -mtune-ctrl=, I
do not expect much heuristics on the x86 side for now.

Thanks for looking,
Richard.

> Cheers,
> Tamar
> 
> > 
> > Any comments sofar?
> > 
> > Thanks,
> > Richard.
> > 
> >     * doc/invoke.texi (vect-epilogues-nomask): Adjust.
> >     * params.opt (vect-epilogues-nomask): Adjust max value and
> >     documentation.
> >     * tree-vect-loop.cc (vect_analyze_loop): Hack in multiple
> >     vectorized epilogs.
> > ---
> >  gcc/doc/invoke.texi   |  3 ++-
> >  gcc/params.opt        |  2 +-
> >  gcc/tree-vect-loop.cc | 23 +++++++++++++++++------
> >  3 files changed, 20 insertions(+), 8 deletions(-)
> > 
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index f2555ec83a1..73e54a47381 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -16870,7 +16870,8 @@ The maximum number of insns in loop header
> > duplicated
> >  by the copy loop headers pass.
> > 
> >  @item vect-epilogues-nomask
> > -Enable loop epilogue vectorization using smaller vector size.
> > +Enable loop epilogue vectorization using smaller vector size with up to N
> > +vector epilogue loops.
> > 
> >  @item vect-partial-vector-usage
> >  Controls when the loop vectorizer considers using partial vector loads
> > diff --git a/gcc/params.opt b/gcc/params.opt
> > index 4dab7a26f9b..c77472e7ad3 100644
> > --- a/gcc/params.opt
> > +++ b/gcc/params.opt
> > @@ -1175,7 +1175,7 @@ Common Joined UInteger
> > Var(param_use_canonical_types) Init(1) IntegerRange(0, 1)
> >  Whether to use canonical types.
> > 
> >  -param=vect-epilogues-nomask=
> > -Common Joined UInteger Var(param_vect_epilogues_nomask) Init(1)
> > IntegerRange(0, 1) Param Optimization
> > +Common Joined UInteger Var(param_vect_epilogues_nomask) Init(1)
> > IntegerRange(0, 8) Param Optimization
> >  Enable loop epilogue vectorization using smaller vector size.
> > 
> >  -param=vect-max-layout-candidates=
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 41875683595..90802675a84 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -3721,6 +3721,10 @@ vect_analyze_loop (class loop *loop, gimple
> > *loop_vectorized_call,
> >      partial_vectors_supported_p () && param_vect_partial_vector_usage != 0;
> >    poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
> > 
> > +  loop_vec_info orig_loop_vinfo = first_loop_vinfo;
> > +  unsigned n = param_vect_epilogues_nomask;
> > +  do
> > +    {
> >    while (1)
> >      {
> >        /* If the target does not support partial vectors we can shorten the
> > @@ -3744,7 +3748,7 @@ vect_analyze_loop (class loop *loop, gimple
> > *loop_vectorized_call,
> >        bool fatal;
> >        opt_loop_vec_info loop_vinfo
> >     = vect_analyze_loop_1 (loop, shared, &loop_form_info,
> > -                          first_loop_vinfo,
> > +                          orig_loop_vinfo,
> >                            vector_modes, mode_i,
> >                            autodetected_vector_mode, fatal);
> >        if (fatal)
> > @@ -3769,17 +3773,24 @@ vect_analyze_loop (class loop *loop, gimple
> > *loop_vectorized_call,
> >           loop_vinfo = opt_loop_vec_info::success (NULL);
> >         }
> > 
> > -     /* For now only allow one epilogue loop, but allow
> > -        pick_lowest_cost_p to replace it, so commit to the
> > -        first epilogue if we have no reason to try alternatives.  */
> > +     /* If we do not pick an alternative based on cost we're done.  */
> >       if (!pick_lowest_cost_p)
> >         break;
> >     }
> > 
> >        if (mode_i == vector_modes.length ())
> > -   break;
> > -
> > +   {
> > +     mode_i = 0;
> > +     break;
> > +   }
> > +    }
> > +  if (mode_i == vector_modes.length ())
> > +    break;
> > +  orig_loop_vinfo = orig_loop_vinfo->epilogue_vinfo;
> >      }
> > +  while (orig_loop_vinfo
> > +    && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (orig_loop_vinfo)
> > +    && --n != 0);
> > 
> >    if (first_loop_vinfo->epilogue_vinfo)
> >      {
> > --
> > 2.43.0
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH 5/5] Allow multiple vectorized epilogs via --param vect-epilogues-nomask=N

Reply via email to