Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling

Andre Vieira (lists) via Gcc-patches Tue, 12 Oct 2021 03:35:14 -0700

Hi Richi,

I think this is what you meant, I now hide all the unrolling costcalculations in the existing target hooks for costs. I did need toadjust 'finish_cost' to take the loop_vinfo so the target'simplementations are able to set the newly renamed 'suggested_unroll_factor'.


Also added the checks for the epilogue's VF.

Is this more like what you had in mind?


gcc/ChangeLog:

* config/aarch64/aarch64.c (aarch64_finish_cost): Add classvec_info parameter.

        * config/i386/i386.c (ix86_finish_cost): Likewise.
        * config/rs6000/rs6000.c (rs6000_finish_cost): Likewise.
        * doc/tm.texi: Document changes to TARGET_VECTORIZE_FINISH_COST.
        * target.def: Add class vec_info parameter to finish_cost.
        * targhooks.c (default_finish_cost): Likewise.
        * targhooks.h (default_finish_cost): Likewise.

* tree-vect-loop.c (vect_determine_vectorization_factor): Usesuggested_unroll_factor

        to increase vectorization_factor if possible.

(_loop_vec_info::_loop_vec_info): Add suggested_unroll_factormember. (vect_compute_single_scalar_iteration_cost): Adjust call tofinish_cost. (vect_determine_partial_vectors_and_peeling): Ensure unrolledloop is not predicated.

        (vect_determine_unroll_factor): New.
        (vect_try_unrolling): New.

(vect_reanalyze_as_main_loop): Also try to unroll whenreanalyzing as main loop. (vect_analyze_loop): Add call to vect_try_unrolling and checkto ensure epilogue is either a smaller VF than main loop or uses partial vectorsand might be of equal

        VF.
        (vect_estimate_min_profitable_iters): Adjust call to finish_cost.

(vectorizable_reduction): Make sure to not usesingle_defuse_cyle when unrolling. * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Adjustcall to finish_cost. * tree-vectorizer.h (finish_cost): Change to pass new classvec_info parameter.


On 01/10/2021 09:19, Richard Biener wrote:

On Thu, 30 Sep 2021, Andre Vieira (lists) wrote:

Hi,

That just forces trying the vector modes we've tried before. Though I might
need to revisit this now I think about it. I'm afraid it might be possible
for
this to generate an epilogue with a vf that is not lower than that of the
main
loop, but I'd need to think about this again.

Either way I don't think this changes the vector modes used for the
epilogue.
But maybe I'm just missing your point here.

Yes, I was refering to the above which suggests that when we vectorize
the main loop with V4SF but unroll then we try vectorizing the
epilogue with V4SF as well (but not unrolled).  I think that's
premature (not sure if you try V8SF if the main loop was V4SF but
unrolled 4 times).

My main motivation for this was because I had a SVE loop that vectorized with
both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided to unroll
V8HI by two and skipped using VNx8HI as a predicated epilogue which would've
been the best choice.

I see, yes - for fully predicated epilogues it makes sense to consider
the same vector mode as for the main loop anyways (independent on
whether we're unrolling or not).  One could argue that with an
unrolled V4SImode main loop a predicated V8SImode epilogue would also
be a good match (but then somehow costing favored the unrolled V4SI
over the V8SI for the main loop...).

So that is why I decided to just 'reset' the vector_mode selection. In a
scenario where you only have the traditional vector modes it might make less
sense.

Just realized I still didn't add any check to make sure the epilogue has a
lower VF than the previous loop, though I'm still not sure that could happen.
I'll go look at where to add that if you agree with this.

As said above, it only needs a lower VF in case the epilogue is not
fully masked - otherwise the same VF would be OK.

I can move it there, it would indeed remove the need for the change to
vect_update_vf_for_slp, the change to
vect_determine_partial_vectors_and_peeling would still be required I think.
It
is meant to disable using partial vectors in an unrolled loop.

Why would we disable the use of partial vectors in an unrolled loop?

The motivation behind that is that the overhead caused by generating
predicates for each iteration will likely be too much for it to be profitable
to unroll. On top of that, when dealing with low iteration count loops, if
executing one predicated iteration would be enough we now still need to
execute all other unrolled predicated iterations, whereas if we keep them
unrolled we skip the unrolled loops.

OK, I guess we're not factoring in costs when deciding on predication
but go for it if it's gernally enabled and possible.

With the proposed scheme we'd then cost the predicated not unrolled
loop against a not predicated unrolled loop which might be a bit
apples vs. oranges also because the target made the unroll decision
based on the data it collected for the predicated loop.

Sure but I'm suggesting you keep the not unrolled body as one way of
costed vectorization but then if the target says "try unrolling"
re-do the analysis with the same mode but a larger VF.  Just like
we iterate over vector modes you'll now iterate over pairs of
vector mode + VF (unroll factor).  It's not about re-using the costing
it's about using costing that is actually relevant and also to avoid
targets inventing two distinct separate costings - a target (powerpc)
might already compute load/store density and other stuff for the main
costing so it should have an idea whether doubling or triplicating is OK.

Richard.

Sounds good! I changed the patch to determine the unrolling factor later,
after all analysis has been done and retry analysis if an unrolling factor
larger than 1 has been chosen for this loop and vector_mode.

gcc/ChangeLog:

         * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR.
         * doc/tm.texi.in: Add entries for TARGET_VECTORIZE_UNROLL_FACTOR.
         * params.opt: Add vect-unroll and vect-unroll-reductions
parameters.

What's the reason to add the --params?  It looks like this makes
us unroll with a static number short-cutting the target.

IMHO that's never going to be a great thing - but what we could do
is look at loop->unroll and try to honor that (factoring in that
the vectorization factor is already the times we unroll).

So I'd leave those params out for now, the user would have a much
more fine-grained way to control this with the unroll pragma.

Adding a max-vect-unroll parameter would be another thing but that
would apply after the targets or pragma decision.

         * target.def: Define hook TARGET_VECTORIZE_UNROLL_FACTOR.

I still do not like the new target hook - as said I'd like to
make you have the finis_cost hook allow the target to specify
a suggested unroll factor instead because that's the point where
it has all the info.

Thanks,
Richard.

         * targhooks.c (default_unroll_factor): New.
         * targhooks.h (default_unroll_factor): Likewise.
         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
         par_unrolling_factor.
         (vect_determine_partial_vectors_and_peeling): Account for
unrolling.
         (vect_determine_unroll_factor): New.
         (vect_try_unrolling): New.
         (vect_reanalyze_as_main_loop): Call vect_try_unrolling when
         retrying a loop_vinfo as a main loop.
         (vect_analyze_loop): Call vect_try_unrolling when vectorizing
main loops.
         (vect_analyze_loop): Allow for epilogue vectorization when unrolling
         and rewalk vector_mode warray for the epilogues.
         (vectorizable_reduction): Disable single_defuse_cycle when
unrolling.
         * tree-vectorizer.h (vect_unroll_value): Declare par_unrolling_factor
         as a member of loop_vec_info.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
36519ccc5a58abab483c38d0a6c5f039592bfc7f..e6ccb66ba41895c4583a959d03ac3f0f173adae6
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -15972,8 +15972,9 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, 
unsigned int body_cost)
 
 /* Implement TARGET_VECTORIZE_FINISH_COST.  */
 static void
-aarch64_finish_cost (void *data, unsigned *prologue_cost,
-                    unsigned *body_cost, unsigned *epilogue_cost)
+aarch64_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+                    unsigned *prologue_cost, unsigned *body_cost,
+                    unsigned *epilogue_cost)
 {
   auto *costs = static_cast<aarch64_vector_costs *> (data);
   *prologue_cost = costs->region[vect_prologue];
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 
afc2674d49da370ae0f5ef277df7e9954f303b8e..de7bb9fe62fcec53ee40a4798f24c6ccd4584736
 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -23048,8 +23048,9 @@ ix86_add_stmt_cost (class vec_info *vinfo, void *data, 
int count,
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-ix86_finish_cost (void *data, unsigned *prologue_cost,
-                 unsigned *body_cost, unsigned *epilogue_cost)
+ix86_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+                 unsigned *prologue_cost, unsigned *body_cost,
+                 unsigned *epilogue_cost)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 
ad81dfb316dff00cde810d6b1edd31fa49d5c1e8..6f674b6426284dbf9b9f8fdd85515cf9702adff6
 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5551,8 +5551,9 @@ rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-rs6000_finish_cost (void *data, unsigned *prologue_cost,
-                   unsigned *body_cost, unsigned *epilogue_cost)
+rs6000_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+                   unsigned *prologue_cost, unsigned *body_cost,
+                   unsigned *epilogue_cost)
 {
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 
be8148583d8571b0d035b1938db9d056bfd213a8..05ddd4c58a3711dd949b28da3e61fb49d8175257
 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6276,7 +6276,7 @@ return value should be viewed as a tentative cost that 
may later be
 revised.
 @end deftypefn
 
-@deftypefn {Target Hook} void TARGET_VECTORIZE_FINISH_COST (void *@var{data}, 
unsigned *@var{prologue_cost}, unsigned *@var{body_cost}, unsigned 
*@var{epilogue_cost})
+@deftypefn {Target Hook} void TARGET_VECTORIZE_FINISH_COST (class vec_info 
*@var{vinfo}, void *@var{data}, unsigned *@var{prologue_cost}, unsigned 
*@var{body_cost}, unsigned *@var{epilogue_cost})
 This hook should complete calculations of the cost of vectorizing a loop
 or basic block based on @var{data}, and return the prologue, body, and
 epilogue costs as unsigned integers.  The default returns the value of
diff --git a/gcc/target.def b/gcc/target.def
index 
bfa819609c21bd71c0cc585c01dba42534453f47..f0be0e10a9225dd75b013535d8e42c1d1bfe8f50
 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2081,8 +2081,8 @@ or basic block based on @var{data}, and return the 
prologue, body, and\n\
 epilogue costs as unsigned integers.  The default returns the value of\n\
 the three accumulators.",
  void,
- (void *data, unsigned *prologue_cost, unsigned *body_cost,
-  unsigned *epilogue_cost),
+ (class vec_info *vinfo, void *data, unsigned *prologue_cost,
+  unsigned *body_cost, unsigned *epilogue_cost),
  default_finish_cost)
 
 /* Function to delete target-specific cost modeling data.  */
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 
92d51992e625c2497aa8496b1e2e3d916e5706fd..6fd1fade49cfe00295afd52aee7a34931bb48b92
 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -123,7 +123,8 @@ extern unsigned default_add_stmt_cost (class vec_info *, 
void *, int,
                                       enum vect_cost_for_stmt,
                                       class _stmt_vec_info *, tree, int,
                                       enum vect_cost_model_location);
-extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
+extern void default_finish_cost (class vec_info *, void *, unsigned *,
+                                unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
 /* OpenACC hooks.  */
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 
c9b5208853dbc15706a65d1eb335e28e0564325e..0a3ecfa76406152ce79aaf19c5a2cc8b652936ff
 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1518,8 +1518,9 @@ default_add_stmt_cost (class vec_info *vinfo, void *data, 
int count,
 /* By default, the cost model just returns the accumulated costs.  */
 
 void
-default_finish_cost (void *data, unsigned *prologue_cost,
-                    unsigned *body_cost, unsigned *epilogue_cost)
+default_finish_cost (class vec_info *vinfo, void *data,
+                    unsigned *prologue_cost, unsigned *body_cost,
+                    unsigned *epilogue_cost)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
5a5b8da2e771a1dd204f22a6447eba96bb3b352c..50256cb6cb478246e3402162391096cbbc7fde94
 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -365,6 +365,24 @@ vect_determine_vectorization_factor (loop_vec_info 
loop_vinfo)
   if (known_le (vectorization_factor, 1U))
     return opt_result::failure_at (vect_location,
                                   "not vectorized: unsupported data-type\n");
+  /* Apply unrolling factor, this was determined by
+     vect_determine_unroll_factor the first time we ran the analyzis for this
+     vector mode.  */
+  if (loop_vinfo->suggested_unroll_factor > 1)
+    {
+      unsigned unrolling_factor = loop_vinfo->suggested_unroll_factor;
+      while (unrolling_factor > 1)
+       {
+         poly_uint64 candidate_factor = vectorization_factor * 
unrolling_factor;
+         if (estimated_poly_value (candidate_factor, POLY_VALUE_MAX)
+             <= (HOST_WIDE_INT) LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo))
+           {
+             vectorization_factor = candidate_factor;
+             break;
+           }
+         unrolling_factor /= 2;
+       }
+    }
   LOOP_VINFO_VECT_FACTOR (loop_vinfo) = vectorization_factor;
   return opt_result::success ();
 }
@@ -828,6 +846,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    suggested_unroll_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -1301,7 +1320,7 @@ vect_compute_single_scalar_iteration_cost (loop_vec_info 
loop_vinfo)
                          si->kind, si->stmt_info, si->vectype,
                          si->misalign, si->where);
   unsigned prologue_cost = 0, body_cost = 0, epilogue_cost = 0;
-  finish_cost (target_cost_data, &prologue_cost, &body_cost,
+  finish_cost (NULL, target_cost_data, &prologue_cost, &body_cost,
               &epilogue_cost);
   destroy_cost_data (target_cost_data);
   LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST (loop_vinfo)
@@ -2128,10 +2147,16 @@ vect_determine_partial_vectors_and_peeling 
(loop_vec_info loop_vinfo,
         vectors to the epilogue, with the main loop continuing to operate
         on full vectors.
 
+        If we are unrolling we also do not want to use partial vectors. This
+        is to avoid the overhead of generating multiple masks and also to
+        avoid having to execute entire iterations of FALSE masked instructions
+        when dealing with one or less full iterations.
+
         ??? We could then end up failing to use partial vectors if we
         decide to peel iterations into a prologue, and if the main loop
         then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+          || loop_vinfo->suggested_unroll_factor > 1)
          && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
          && !vect_known_niters_smaller_than_vf (loop_vinfo))
        LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2879,6 +2904,121 @@ vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
   return true;
 }
 
+/* Determine whether we should unroll this loop and ask target how much to
+   unroll by.  */
+
+static opt_loop_vec_info
+vect_determine_unroll_factor (loop_vec_info loop_vinfo)
+{
+  stmt_vec_info stmt_info;
+  unsigned i;
+  bool seen_reduction_p = false;
+  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (loop_vinfo->stmt_vec_infos, i, stmt_info)
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+         || !STMT_VINFO_RELEVANT_P (stmt_info)
+         || stmt_info->vectype == NULL_TREE)
+       continue;
+      /* Do not unroll loops with negative steps as it is unlikely that
+        vectorization will succeed due to the way we deal with negative steps
+        in loads and stores in 'get_load_store_type'.  */
+      if (stmt_info->dr_aux.dr
+         && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+       {
+         dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info);
+         tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
+         if (TREE_CODE (step) == INTEGER_CST
+             && tree_int_cst_compare (step, size_zero_node) < 0)
+           {
+             return opt_loop_vec_info::failure_at
+               (vect_location, "could not unroll due to negative step\n");
+           }
+       }
+
+      if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)
+       {
+         auto red_info = info_for_reduction (loop_vinfo, stmt_info);
+         if (STMT_VINFO_REDUC_TYPE (red_info) == TREE_CODE_REDUCTION)
+           seen_reduction_p = true;
+         else
+           {
+             return opt_loop_vec_info::failure_at
+               (vect_location, "could not unroll loop with reduction due to "
+                               "non TREE_CODE_REDUCTION\n");
+           }
+       }
+    }
+
+  if (known_le (vectorization_factor, 1U))
+    return opt_loop_vec_info::failure_at (vect_location,
+                                         "will not unroll loop with a VF of 1"
+                                         "or less\n");
+
+  opt_loop_vec_info unrolled_vinfo
+    = opt_loop_vec_info::success (vect_analyze_loop_form (loop_vinfo->loop,
+                                                         loop_vinfo->shared));
+  unrolled_vinfo->vector_mode = loop_vinfo->vector_mode;
+  /* Use the suggested_unrolling_factor that was set during the target's
+     TARGET_VECTORIZE_FINISH_COST hook.  */
+  unrolled_vinfo->suggested_unroll_factor = 
loop_vinfo->suggested_unroll_factor;
+  return unrolled_vinfo;
+}
+
+
+/* Try to unroll the current loop.  First determine the unrolling factor using
+   the analysis done for the current vector mode.  Then re-analyze the loop for
+   the given unrolling factor and the current vector mode.  */
+
+static opt_loop_vec_info
+vect_try_unrolling (opt_loop_vec_info loop_vinfo, unsigned *n_stmts)
+{
+  DUMP_VECT_SCOPE ("vect_try_unrolling");
+
+  opt_loop_vec_info unrolled_vinfo = vect_determine_unroll_factor (loop_vinfo);
+  /* Reset unrolling factor, in case we decide to not unroll.  */
+  loop_vinfo->suggested_unroll_factor = 1;
+  if (unrolled_vinfo)
+    {
+      if (unrolled_vinfo->suggested_unroll_factor > 1)
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_NOTE, vect_location,
+                            "***** unrolling factor %d chosen for vector mode 
%s,"
+                            "re-trying analyzis...\n",
+                            unrolled_vinfo->suggested_unroll_factor,
+                            GET_MODE_NAME (unrolled_vinfo->vector_mode));
+         bool unrolling_fatal = false;
+         if (vect_analyze_loop_2 (unrolled_vinfo, unrolling_fatal, n_stmts)
+             && known_ne (loop_vinfo->vectorization_factor,
+                          unrolled_vinfo->vectorization_factor))
+           {
+
+             loop_vinfo = unrolled_vinfo;
+             if (dump_enabled_p ())
+               dump_printf_loc (MSG_NOTE, vect_location,
+                                "unrolling succeeded with factor = %d\n",
+                                loop_vinfo->suggested_unroll_factor);
+
+           }
+         else
+           {
+             if (dump_enabled_p ())
+               dump_printf_loc (MSG_NOTE, vect_location,
+                                "unrolling failed with factor = %d\n",
+                                unrolled_vinfo->suggested_unroll_factor);
+           }
+       }
+      else
+       if (dump_enabled_p ())
+         dump_printf_loc (MSG_NOTE, vect_location,
+                          "target determined unrolling is not profitable.\n");
+    }
+  loop_vinfo->loop->aux = NULL;
+  return loop_vinfo;
+}
+
 /* If LOOP_VINFO is already a main loop, return it unmodified.  Otherwise
    try to reanalyze it as a main loop.  Return the loop_vinfo on success
    and null on failure.  */
@@ -2904,6 +3044,8 @@ vect_reanalyze_as_main_loop (loop_vec_info loop_vinfo, 
unsigned int *n_stmts)
   bool fatal = false;
   bool res = vect_analyze_loop_2 (main_loop_vinfo, fatal, n_stmts);
   loop->aux = NULL;
+  main_loop_vinfo = vect_try_unrolling (main_loop_vinfo, n_stmts);
+
   if (!res)
     {
       if (dump_enabled_p ())
@@ -3038,6 +3180,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
*shared)
 
       if (res)
        {
+         /* Only try unrolling main loops.  */
+         if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+           loop_vinfo = vect_try_unrolling (loop_vinfo, &n_stmts);
+
          LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
          vectorized_loops++;
 
@@ -3056,13 +3202,26 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
*shared)
              /* Keep trying to roll back vectorization attempts while the
                 loop_vec_infos they produced were worse than this one.  */
              vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos;
+             poly_uint64 vinfo_vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+             poly_uint64 first_vinfo_vf
+               = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
              while (!vinfos.is_empty ()
+                    && (known_lt (vinfo_vf, first_vinfo_vf)
+                        || (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+                            && maybe_eq (vinfo_vf, first_vinfo_vf)))
                     && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ()))
                {
                  gcc_assert (vect_epilogues);
                  delete vinfos.pop ();
                }
+             /* Check if we may want to replace the current first_loop_vinfo
+                with the new loop, but only if they have different vector
+                modes.  If they have the same vector mode this means the main
+                loop is an unrolled loop and we are trying to vectorize the
+                epilogue using the same vector mode but with a lower
+                vectorization factor.  */
              if (vinfos.is_empty ()
+                 && loop_vinfo->vector_mode != first_loop_vinfo->vector_mode
                  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
                {
                  loop_vec_info main_loop_vinfo
@@ -3105,14 +3264,34 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
*shared)
                   /* For now only allow one epilogue loop.  */
                   && first_loop_vinfo->epilogue_vinfos.is_empty ())
            {
-             first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
-             poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-             gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
-                         || maybe_ne (lowest_th, 0U));
-             /* Keep track of the known smallest versioning
-                threshold.  */
-             if (ordered_p (lowest_th, th))
-               lowest_th = ordered_min (lowest_th, th);
+             /* Ensure the epilogue has a smaller VF than the main loop or
+                uses predication and has the same VF.  */
+             if (known_lt (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+                           LOOP_VINFO_VECT_FACTOR (first_loop_vinfo))
+                 || (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+                     && maybe_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+                                  LOOP_VINFO_VECT_FACTOR (first_loop_vinfo))))
+               {
+                 first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
+                 poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+                 gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+                             || maybe_ne (lowest_th, 0U));
+                 /* Keep track of the known smallest versioning
+                    threshold.  */
+                 if (ordered_p (lowest_th, th))
+                   lowest_th = ordered_min (lowest_th, th);
+               }
+             else
+               {
+                 if (dump_enabled_p ())
+                   dump_printf_loc (MSG_NOTE, vect_location,
+                                    "***** Will not use %s mode as an"
+                                    " epilogue, since it leads to an higher"
+                                    " vectorization factor than main loop\n",
+                                    GET_MODE_NAME (loop_vinfo->vector_mode));
+                 delete loop_vinfo;
+                 loop_vinfo = opt_loop_vec_info::success (NULL);
+               }
            }
          else
            {
@@ -3153,13 +3332,32 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
*shared)
 
       /* Handle the case that the original loop can use partial
         vectorization, but want to only adopt it for the epilogue.
-        The retry should be in the same mode as original.  */
+        The retry should be in the same mode as original.
+        Also handle the case where we have unrolled the main loop and want to
+        retry all vector modes again for the epilogues, since the VF is now
+        at least twice as high as the current vector mode.  */
       if (vect_epilogues
          && loop_vinfo
-         && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+         && (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo)
+             || loop_vinfo->suggested_unroll_factor > 1))
        {
-         gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+         gcc_assert ((LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+                      || loop_vinfo->suggested_unroll_factor > 1)
                      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+         /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
+         if (loop_vinfo->suggested_unroll_factor > 1)
+           {
+             next_vector_mode = vector_modes[0];
+             mode_i = 1;
+
+             if (dump_enabled_p ())
+               dump_printf_loc (MSG_NOTE, vect_location,
+                                "***** Re-trying analysis with vector mode"
+                                " %s for epilogues after unrolling.\n",
+                                GET_MODE_NAME (next_vector_mode));
+             continue;
+           }
+
          if (dump_enabled_p ())
            dump_printf_loc (MSG_NOTE, vect_location,
                             "***** Re-trying analysis with same vector mode"
@@ -4222,8 +4420,8 @@ vect_estimate_min_profitable_iters (loop_vec_info 
loop_vinfo,
     }
 
   /* Complete the target-specific cost calculations.  */
-  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), &vec_prologue_cost,
-              &vec_inside_cost, &vec_epilogue_cost);
+  finish_cost (loop_vinfo, LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
+              &vec_prologue_cost, &vec_inside_cost, &vec_epilogue_cost);
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
 
@@ -7212,7 +7410,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 
024a1c38a2342246d7891db1de5f1d6e6458d5dd..dce8b953d306b90185ffe75c637f1fdb998aa953
 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5405,7 +5405,8 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
       while (si < li_scalar_costs.length ()
             && li_scalar_costs[si].first == sl);
       unsigned dummy;
-      finish_cost (scalar_target_cost_data, &dummy, &scalar_cost, &dummy);
+      finish_cost (bb_vinfo, scalar_target_cost_data, &dummy, &scalar_cost,
+                  &dummy);
       destroy_cost_data (scalar_target_cost_data);
 
       /* Complete the target-specific vector cost calculation.  */
@@ -5418,7 +5419,7 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
        }
       while (vi < li_vector_costs.length ()
             && li_vector_costs[vi].first == vl);
-      finish_cost (vect_target_cost_data, &vec_prologue_cost,
+      finish_cost (bb_vinfo, vect_target_cost_data, &vec_prologue_cost,
                   &vec_inside_cost, &vec_epilogue_cost);
       destroy_cost_data (vect_target_cost_data);
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 
c4c5678e7f1abafc25c465319dbacf3ef50f0ae9..e91fb6691857cbcc0b1c087d6de35164a7c75e48
 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -621,6 +621,13 @@ public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that the target suggested we unroll the vector loop
+     in order to promote more ILP.  This value will be used to re-analyze the
+     loop for vectorization and if successful the value will be folded into
+     vectorization_factor (and therefore exactly divides
+     vectorization_factor).  */
+  unsigned int suggested_unroll_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1570,10 +1577,10 @@ add_stmt_cost (vec_info *vinfo, void *data, 
stmt_info_for_cost *i)
 /* Alias targetm.vectorize.finish_cost.  */
 
 static inline void
-finish_cost (void *data, unsigned *prologue_cost,
+finish_cost (class vec_info *vinfo, void *data, unsigned *prologue_cost,
             unsigned *body_cost, unsigned *epilogue_cost)
 {
-  targetm.vectorize.finish_cost (data, prologue_cost, body_cost, 
epilogue_cost);
+  targetm.vectorize.finish_cost (vinfo, data, prologue_cost, body_cost, 
epilogue_cost);
 }
 
 /* Alias targetm.vectorize.destroy_cost_data.  */

Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling

Reply via email to