Re: [PATCH] [x86] [RFC] Prevent loop vectorization if it's in a deeply nested big loop.

Richard Biener Fri, 29 Nov 2024 00:16:25 -0800

On Fri, Nov 29, 2024 at 2:30 AM Hongtao Liu <crazy...@gmail.com> wrote:
>
> On Thu, Nov 28, 2024 at 4:57 PM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Thu, Nov 28, 2024 at 3:04 AM Hongtao Liu <crazy...@gmail.com> wrote:
> > >
> > > On Wed, Nov 27, 2024 at 9:43 PM Richard Biener
> > > <richard.guent...@gmail.com> wrote:
> > > >
> > > > On Wed, Nov 27, 2024 at 4:26 AM liuhongt <hongtao....@intel.com> wrote:
> > > > >
> > > > > When loop requires any kind of versioning which could increase 
> > > > > register
> > > > > pressure too much, and it's in a deeply nest big loop, don't do
> > > > > vectorization.
> > > > >
> > > > > I tested the patch with both Ofast and O2 for SPEC2017, besides 
> > > > > 548.exchange_r,
> > > > > other benchmarks are same binary.
> > > > >
> > > > > Bootstrapped and regtested 0on x86_64-pc-linux-gnu{-m32,}
> > > > > Any comments?
> > > >
> > > > The vectorizer tries to version an outer loop when vectorizing a loop 
> > > > nest
> > > > and the versioning condition is invariant.  See vect_loop_versioning.  
> > > > This
> > > > tries to handle such cases.  Often the generated runtime alias checks 
> > > > are
> > > > not invariant because we do not consider the outer evolutions.  I think 
> > > > we
> > > > should instead fix this there.
> > > >
> > > > Question below ...
> > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         pr target/117088
> > > > >         * config/i386/i386.cc
> > > > >         (ix86_vector_costs::ix86_vect_in_deep_nested_loop_p): New 
> > > > > function.
> > > > >         (ix86_vector_costs::finish_cost): Prevent loop vectorization
> > > > >         if it's in a deeply nested loop and require versioning.
> > > > >         * config/i386/i386.opt (--param=vect-max-loop-depth=): New
> > > > >         param.
> > > > > ---
> > > > >  gcc/config/i386/i386.cc  | 89 
> > > > > ++++++++++++++++++++++++++++++++++++++++
> > > > >  gcc/config/i386/i386.opt |  4 ++
> > > > >  2 files changed, 93 insertions(+)
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > index 526c9df7618..608f40413d2 100644
> > > > > --- a/gcc/config/i386/i386.cc
> > > > > +++ b/gcc/config/i386/i386.cc
> > > > > @@ -25019,6 +25019,8 @@ private:
> > > > >
> > > > >    /* Estimate register pressure of the vectorized code.  */
> > > > >    void ix86_vect_estimate_reg_pressure ();
> > > > > +  /* Check if vect_loop is in a deeply-nested loop.  */
> > > > > +  bool ix86_vect_in_deep_nested_loop_p (class loop *vect_loop);
> > > > >    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's 
> > > > > used for
> > > > >       estimation of register pressure.
> > > > >       ??? Currently it's only used by vec_construct/scalar_to_vec
> > > > > @@ -25324,6 +25326,84 @@ 
> > > > > ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
> > > > >      }
> > > > >  }
> > > > >
> > > > > +/* Return true if vect_loop is in a deeply-nested loop.
> > > > > +   .i.e vect_loop_n in below loop structure.
> > > > > +loop1
> > > > > +{
> > > > > + loop2
> > > > > + {
> > > > > +  loop3
> > > > > +  {
> > > > > +   vect_loop_1;
> > > > > +   loop4
> > > > > +   {
> > > > > +    vect_loop_2;
> > > > > +    loop5
> > > > > +    {
> > > > > +     vect_loop_3;
> > > > > +     loop6
> > > > > +     {
> > > > > +      vect_loop_4;
> > > > > +      loop7
> > > > > +      {
> > > > > +       vect_loop_5;
> > > > > +       loop8
> > > > > +       {
> > > > > +       loop9
> > > > > +       }
> > > > > +      vect_loop_6;
> > > > > +      }
> > > > > +     vect_loop_7;
> > > > > +     }
> > > > > +    }
> > > > > +   }
> > > > > + }
> > > > > + It's a big hammer to fix O2 regression for 548.exchange_r after 
> > > > > vectorization
> > > > > + is enhanced by (r15-4225-g70c3db511ba14f)  */
> > > > > +bool
> > > > > +ix86_vector_costs::ix86_vect_in_deep_nested_loop_p (class loop 
> > > > > *vect_loop)
> > > > > +{
> > > > > +  if (loop_depth (vect_loop) > (unsigned) ix86_vect_max_loop_depth)
> > > > > +    return true;
> > > > > +
> > > > > +  if (loop_depth (vect_loop) < 2)
> > > > > +    return false;
> > > > > +
> > > >
> > > > while the above two are "obvious", what you check below isn't clear to 
> > > > me.
> > > > Is this trying to compute whether 'vect_loop' is inside of a loop nest 
> > > > which
> > > > at any sibling of vect_loop (or even sibling of an outer loop of 
> > > > vect_loop,
> > > > recursively) is a sub-nest with a loop depth (relative to what?) exceeds
> > > > ix86_vect_max_loop_depth?
> > > Yes, the function tries to find if the vect_loop is in a "big outer
> > > loop" which contains an innermost loop with loop_depth >
> > > ix86_vect_max_loop_depth.
> > > If yes, then prevent vectorization for the loop if its tripcount is
> > > not constant VF-times.(requires any kind of versioning is not
> > > accurate, and yes it's a big hammer.)
> >
> > I'll note it also doesn't seem to look at register pressure at all or limit
> > the cut-off to the very-cheap cost model?
> The default parameter ix86_vect_max_loop_depth implies the register
> pressure, for each layer of loop, it generally needs 2 registers: 1 iv
> + 1 tripcount.
> ix86_vect_max_loop_depth > 8 will run out of 16 registers. The
> vectoriation for unknown tripcountl increases 1 register for the "new
> tripcount of main vectorized loop", and it causes extra spill in the
> outer loop.


Hmm, yeah - one of my long-term TODO is to re-do how the vectorizer
generates the main IV code to use a decrement to zero remaining
scalar iterations for prologue/vector/epilogue loops.  The disadvantage
is we'd get a decrement by VF and test >= VF in the vector loop
rather than decrement by one and test for zero, but I think IVOPTs
will eventually recover this IV.

My experience with exchange is that LRA spills the "wrong" regs here
and that IVOPTs only considers a loop in isolation rather than a set
of sibling loops within an outer loop for example.

At some point I thought of doing aggressive unswitching of outer loops
on 'mod(row,3)' somehow, that might reduce the complexity of the
nests considerably, enabling IRA/LRA to do a better job eventually.

But anyway, I think exchange is special enough that we shouldn't
worry about the regression too much - at least not as to invent
these kind of odd heuristics for it.

> if the tripcount of vect_loop is big, then the penalty can be
> compensated by the vectorization, but for exchange_r, the tripcount is
> small, that's why it regresses performance.
> The range info didn't get real range for the tripcount since fortran
> allows negative index for the array.
> I'm not sure if ivopt or ira could be improved for this case, so maybe
> we should live with this regression.
> Restricting the cut off with the very-cheap cost model would also be a
> good idea.
> >
> > That said, it feels like a hack specifically for 548.exchange_r, in 
> > particular
> > vectorization by itself shouldn't increase register pressure (much), but
> > exchange is known to operate on the bounds of "awful" with regard to
> > register pressure.  If you'd enable APX would exchange benefit from
> > vectorizing?
> APX would fix this(it even improves the exchange w/o vectorization)
> >
> > That said, I think we have to live with the regression, the change feels
> > odd and a strcmp (main_file_name, "exchange") would be similar.  So
> > we either need to make the pattern matching more precise, like counting
> > live IVs from the loop nest plus applying heuristics on how vectorization
> > increases register pressure (maybe it's an IVOPTs issue in the end?),
> > or defer a solution to GCC 16.
> >
> > Richard.
> >
> > > >
> > > > > +  class loop* outer_loop = loop_outer (vect_loop);
> > > > > +
> > > > > +  auto_vec<class loop*> m_loop_stack;
> > > > > +  auto_sbitmap m_visited_loops (number_of_loops (cfun));
> > > > > +
> > > > > +  /* Get all sibling loops for vect_loop.  */
> > > > > +  class loop* next_loop = outer_loop->inner;
> > > > > +  for (; next_loop; next_loop = next_loop->next)
> > > > > +    {
> > > > > +      m_loop_stack.safe_push (next_loop);
> > > > > +      bitmap_set_bit (m_visited_loops, next_loop->num);
> > > > > +    }
> > > > > +
> > > > > +  /* DFS the max depth of all sibling loop.  */
> > > > > +  while (!m_loop_stack.is_empty ())
> > > > > +    {
> > > > > +      next_loop = m_loop_stack.pop ();
> > > > > +      if (loop_depth (next_loop) > (unsigned) 
> > > > > ix86_vect_max_loop_depth)
> > > > > +       return true;
> > > > > +
> > > > > +      class loop* inner_loop = next_loop->inner;
> > > > > +      while (inner_loop)
> > > > > +       {
> > > > > +         if (!bitmap_bit_p (m_visited_loops, inner_loop->num))
> > > > > +           {
> > > > > +             m_loop_stack.safe_push (inner_loop);
> > > > > +             bitmap_set_bit (m_visited_loops, inner_loop->num);
> > > > > +           }
> > > > > +         inner_loop = inner_loop->next;
> > > > > +       }
> > > > > +    }
> > > > > +
> > > > > +  return false;
> > > > > +}
> > > > > +
> > > > >  void
> > > > >  ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > > > >  {
> > > > > @@ -25344,6 +25424,15 @@ ix86_vector_costs::finish_cost (const 
> > > > > vector_costs *scalar_costs)
> > > > >           && (exact_log2 (LOOP_VINFO_VECT_FACTOR 
> > > > > (loop_vinfo).to_constant ())
> > > > >               > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> > > > >         m_costs[vect_body] = INT_MAX;
> > > > > +
> > > > > +      /* Prohibit vectorization when the loop requires versioning
> > > > > +        and loop_depth exceeds threshold.  */
> > > > > +      if ((LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > > > +          || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > > +          || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > > > > +          || vect_apply_runtime_profitability_check_p (loop_vinfo))
> > > >
> > > > this also applies to loops not requiring versioning - it practically 
> > > > applies to
> > > > all loops that do not run for a constant VF-times times.
> > > >
> > > > > +         && ix86_vect_in_deep_nested_loop_p (LOOP_VINFO_LOOP 
> > > > > (loop_vinfo)))
> > > > > +       m_costs[vect_body] = INT_MAX;
> > > > >      }
> > > > >
> > > > >    ix86_vect_estimate_reg_pressure ();
> > > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > > > index 99e86f545e8..c5abf83473d 100644
> > > > > --- a/gcc/config/i386/i386.opt
> > > > > +++ b/gcc/config/i386/i386.opt
> > > > > @@ -1414,3 +1414,7 @@ Support MOVRS built-in functions and code 
> > > > > generation.
> > > > >  mamx-movrs
> > > > >  Target Mask(ISA2_AMX_MOVRS) Var(ix86_isa_flags2) Save
> > > > >  Support AMX-MOVRS built-in functions and code generation.
> > > > > +
> > > > > +-param=vect-max-loop-depth=
> > > > > +Target Joined UInteger Var(ix86_vect_max_loop_depth) Init(8) Param
> > > > > +Preversion loop vectorization when it's in a deeply nested loop and 
> > > > > requires versioning, since it may increase register pressure too much.
> > > > > --
> > > > > 2.34.1
> > > > >
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
>
>
>
> --
> BR,
> Hongtao

Re: [PATCH] [x86] [RFC] Prevent loop vectorization if it's in a deeply nested big loop.

Reply via email to