unroll directly in vectorizer

Tamar Christina Mon, 02 Jun 2025 05:15:11 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Monday, May 26, 2025 2:56 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> Subject: RE: [PATCH 1/2]middle-end: Apply loop->unroll directly in vectorizer
> 
> On Mon, 19 May 2025, Tamar Christina wrote:
> 
> > > >    /* Complete the target-specific cost calculations.  */
> > > >    loop_vinfo->vector_costs->finish_cost (loop_vinfo->scalar_costs);
> > > >    vec_prologue_cost = loop_vinfo->vector_costs->prologue_cost ();
> > > > @@ -12373,6 +12394,13 @@ vect_transform_loop (loop_vec_info
> loop_vinfo,
> > > gimple *loop_vectorized_call)
> > > >         dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling 
> > > > due to"
> > > >                          " variable-length vectorization factor\n");
> > > >      }
> > > > +
> > > > +  /* When we have unrolled the loop due to a user requested value we
> should
> > > > +     leave it up to the RTL unroll heuristics to determine if it's 
> > > > still worth
> > > > +     while to unroll more.  */
> > > > +  if (LOOP_VINFO_USER_UNROLL (loop_vinfo))
> > >
> > > What I meant with copying of LOOP_VINFO_USER_UNROLL is that I think
> > > you'll never get to this being true as you set the suggested unroll
> > > factor for the costing attempt of the not extra unrolled loop but
> > > the transform where you want to reset is is when the unrolling
> > > was actually applied?
> >
> > It was being set on every analysis of the main loop body.  Since it wasn't
> > actually cleared until we've picked a mode and did codegen the condition 
> > would
> > be true.
> >
> > However..
> >
> > >
> > > That said, it would be clearer if LOOP_VINFO_USER_UNROLL would be
> > > set in vect_analyze_loop_1 where we have
> > >
> >
> > I agree this is much nicer.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > -m32, -m64 and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     * doc/extend.texi: Document pragma unroll interaction with vectorizer.
> >     * tree-vectorizer.h (LOOP_VINFO_USER_UNROLL): New.
> >     (class _loop_vec_info): Add user_unroll.
> >     * tree-vect-loop.cc (vect_analyze_loop_1 ): Set
> >     suggested_unroll_factor and retry.
> >     (_loop_vec_info::_loop_vec_info): Initialize user_unroll.
> >     (vect_transform_loop): Clear the loop->unroll value if the pragma was
> >     used.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     * gcc.target/aarch64/unroll-vect.c: New test.
> >
> > -- inline copy of patch --
> >
> > diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> > index
> e87a3c271f8420d8fd175823b5bb655f76c89afe..f8261d13903afc90d3341c09a
> b3fdbd0ab96ea49 100644
> > --- a/gcc/doc/extend.texi
> > +++ b/gcc/doc/extend.texi
> > @@ -10398,6 +10398,11 @@ unrolled @var{n} times regardless of any
> commandline arguments.
> >  When the option is @var{preferred} then the user is allowed to override the
> >  unroll amount through commandline options.
> >
> > +If the loop was vectorized the unroll factor specified will be used to 
> > seed the
> > +vectorizer unroll factor.  Whether the loop is unrolled or not will be
> > +determined by target costing.  The resulting vectorized loop may still be
> > +unrolled more in later passes depending on the target costing.
> > +
> >  @end table
> >
> >  @node Thread-Local
> > diff --git a/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
> b/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..3cb774ba95787ebee488fb
> e7306299ef28e6bb35
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
> > @@ -0,0 +1,20 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -march=armv8-a --param aarch64-autovec-
> preference=asimd-only -std=gnu99" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +/*
> > +** f1:
> > +** ...
> > +** add     v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > +** add     v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > +** add     v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > +** add     v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > +** ...
> > +*/
> > +void f1 (int *restrict a, int n)
> > +{
> > +#pragma GCC unroll 16
> > +  for (int i = 0; i < n; i++)
> > +    a[i] *= 2;
> > +}
> > +
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index
> fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..f215b6bc7881e7e659272cefbe
> 3d5c8892ef768c 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -1073,6 +1073,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,
> vec_info_shared *shared)
> >      peeling_for_gaps (false),
> >      peeling_for_niter (false),
> >      early_breaks (false),
> > +    user_unroll (false),
> >      no_data_dependencies (false),
> >      has_mask_store (false),
> >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > @@ -3428,27 +3429,51 @@ vect_analyze_loop_1 (class loop *loop,
> vec_info_shared *shared,
> >                  res ? "succeeded" : "failed",
> >                  GET_MODE_NAME (loop_vinfo->vector_mode));
> >
> > -  if (res && !LOOP_VINFO_EPILOGUE_P (loop_vinfo) &&
> suggested_unroll_factor > 1)
> > +  auto user_unroll = LOOP_VINFO_LOOP (loop_vinfo)->unroll;
> > +  if (res && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> > +      /* Check to see if the user wants to unroll or if the target wants 
> > to.  */
> > +      && (suggested_unroll_factor > 1 || user_unroll > 1))
> >      {
> > -      if (dump_enabled_p ())
> > -   dump_printf_loc (MSG_NOTE, vect_location,
> > +      if (suggested_unroll_factor == 1)
> > +   {
> > +     int assumed_vf = vect_vf_for_cost (loop_vinfo);
> > +     int unroll_fact = user_unroll / assumed_vf;
> > +     suggested_unroll_factor = 1 << ceil_log2 (unroll_fact);
> 
> So with SLP we can have non-power-of-two vectorization factors, and
> certainly the extra unrolling we apply ontop of determining the VF
> has no additional restrictions.  So I woder why you go through
> 1 << ceil_log2 (unroll_fact) here instead of just using
> user_unroll / assumed_vf?  IMO a truncating division is OK, we could
> argue for rounding?  If we apply costing across modes there might
> be a mode with the unrolling exactly matching - I'm unsure if we
> want to adhere to #pragma GCC unroll irrespective of costing or not.
> 
> That said, as we simply do
> 
>   if (applying_suggested_uf)
>     LOOP_VINFO_VECT_FACTOR (loop_vinfo) *= loop_vinfo-
> >suggested_unroll_factor;
> 
> I'd say go with
> 
>    suggested_unroll_factor = user_unroll / assumed_vf;
> 
> ?
>


The reason for it was that there's an assumption in place that the final VF
is a power of two.  Which is enforced by

  if (!is_gimple_val (niters_vector))
    {
      var = create_tmp_var (type, "bnd");
      gimple_seq stmts = NULL;
      niters_vector = force_gimple_operand (niters_vector, &stmts, true, var);
      gsi_insert_seq_on_edge_immediate (pe, stmts);
      /* Peeling algorithm guarantees that vector loop bound is at least ONE,
         we set range information to make niters analyzer's life easier.
         Note the number of latch iteration value can be TYPE_MAX_VALUE so
         we have to represent the vector niter TYPE_MAX_VALUE + 1 >> log_vf.  */
      if (stmts != NULL && log_vf)

where log_vf is using exact_log2 which doesn't return null on non-power-of-2 
values,
it returns -1;

So the check is somewhat incorrect.

We'd need

      if (stmts != NULL && log_vf && ! integer_minus_onep (log_vf))
        {

But I didn't do so because we would then not set range information for 
niters_vector.

So instead I just made sure that the unfoll factor is a power of two, such that 
the multiplication
returns a power of two.

Would you like the extra !integer_minus_onep (log_vf)) instead? It does seem 
that many things
would be less optimal. E.g. in 

      /* Create: niters >> log2(vf) */
      /* If it's known that niters == number of latch executions + 1 doesn't
         overflow, we can generate niters >> log2(vf); otherwise we generate
         (niters - vf) >> log2(vf) + 1 by using the fact that we know ratio
         will be at least one.  */
      log_vf = build_int_cst (type, exact_log2 (const_vf));
      if (niters_no_overflow)
        niters_vector = fold_build2 (RSHIFT_EXPR, type, ni_minus_gap, log_vf);

even for a small non-power of two unroll factor like, 3 would then become -1, 
so niters_vector suddenly
becomes unknown.

Thanks,
Tamar

> The rest of the patch looks OK to me.
> 
> Richard.
> 
> 
> > +     if (suggested_unroll_factor > 1)
> > +       {
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_NOTE, vect_location,
> > +                    "setting unroll factor to %d based on user requested "
> > +                    "unroll factor %d and suggested vectorization "
> > +                    "factor: %d\n",
> > +                    suggested_unroll_factor, user_unroll, assumed_vf);
> > +       }
> > +   }
> > +
> > +   if (suggested_unroll_factor > 1)
> > +     {
> > +       if (dump_enabled_p ())
> > +         dump_printf_loc (MSG_NOTE, vect_location,
> >                      "***** Re-trying analysis for unrolling"
> >                      " with unroll factor %d and slp %s.\n",
> >                      suggested_unroll_factor,
> >                      slp_done_for_suggested_uf ? "on" : "off");
> > -      loop_vec_info unroll_vinfo
> > -   = vect_create_loop_vinfo (loop, shared, loop_form_info, NULL);
> > -      unroll_vinfo->vector_mode = vector_mode;
> > -      unroll_vinfo->suggested_unroll_factor = suggested_unroll_factor;
> > -      opt_result new_res = vect_analyze_loop_2 (unroll_vinfo, fatal, NULL,
> > -                                           slp_done_for_suggested_uf);
> > -      if (new_res)
> > -   {
> > -     delete loop_vinfo;
> > -     loop_vinfo = unroll_vinfo;
> > -   }
> > -      else
> > -   delete unroll_vinfo;
> > +       loop_vec_info unroll_vinfo
> > +           = vect_create_loop_vinfo (loop, shared, loop_form_info, NULL);
> > +       unroll_vinfo->vector_mode = vector_mode;
> > +       unroll_vinfo->suggested_unroll_factor = suggested_unroll_factor;
> > +       opt_result new_res
> > +           = vect_analyze_loop_2 (unroll_vinfo, fatal, NULL,
> > +                                  slp_done_for_suggested_uf);
> > +       if (new_res)
> > +         {
> > +           delete loop_vinfo;
> > +           loop_vinfo = unroll_vinfo;
> > +           LOOP_VINFO_USER_UNROLL (loop_vinfo) = user_unroll > 1;
> > +         }
> > +       else
> > +         delete unroll_vinfo;
> > +     }
> >      }
> >
> >    /* Remember the autodetected vector mode.  */
> > @@ -12373,6 +12398,13 @@ vect_transform_loop (loop_vec_info loop_vinfo,
> gimple *loop_vectorized_call)
> >     dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
> >                      " variable-length vectorization factor\n");
> >      }
> > +
> > +  /* When we have unrolled the loop due to a user requested value we should
> > +     leave it up to the RTL unroll heuristics to determine if it's still 
> > worth
> > +     while to unroll more.  */
> > +  if (LOOP_VINFO_USER_UNROLL (loop_vinfo))
> > +    loop->unroll = 0;
> > +
> >    /* Free SLP instances here because otherwise stmt reference counting
> >       won't work.  */
> >    slp_instance instance;
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index
> a2f33a5ecd60288fe7f28ee639ff8b6a77667796..8fd8c10ec64f7241d6b097491f
> 84400164893911 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -970,6 +970,10 @@ public:
> >    /* Main loop IV cond.  */
> >    gcond* loop_iv_cond;
> >
> > +  /* True if we have an unroll factor requested by the user through pragma 
> > GCC
> > +     unroll.  */
> > +  bool user_unroll;
> > +
> >    /* True if there are no loop carried data dependencies in the loop.
> >       If loop->safelen <= 1, then this is always true, either the loop
> >       didn't have any loop carried data dependencies, or the loop is being
> > @@ -1094,6 +1098,7 @@ public:
> >  #define LOOP_VINFO_CHECK_UNEQUAL_ADDRS(L)  (L)->check_unequal_addrs
> >  #define LOOP_VINFO_CHECK_NONZERO(L)        (L)->check_nonzero
> >  #define LOOP_VINFO_LOWER_BOUNDS(L)         (L)->lower_bounds
> > +#define LOOP_VINFO_USER_UNROLL(L)          (L)->user_unroll
> >  #define LOOP_VINFO_GROUPED_STORES(L)       (L)->grouped_stores
> >  #define LOOP_VINFO_SLP_INSTANCES(L)        (L)->slp_instances
> >  #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH 1/2]middle-end: Apply loop->unroll directly in vectorizer

Reply via email to