Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

Andre Vieira (lists) Fri, 25 Oct 2019 09:19:01 -0700



On 22/10/2019 18:52, Richard Sandiford wrote:

Thanks for doing this.  Hope this message doesn't cover too much old
ground or duplicate too much...

"Andre Vieira (lists)" <andre.simoesdiasvie...@arm.com> writes:

@@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
    else
      niters_prolog = build_int_cst (type, 0);

+ loop_vec_info epilogue_vinfo = NULL;

+  if (vect_epilogues)
+    {
+      /* Take the next epilogue_vinfo to vectorize for.  */
+      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+      loop_vinfo->epilogue_vinfos.ordered_remove (0);
+
+      /* Don't vectorize epilogues if this is not the most inner loop or if
+        the epilogue may need peeling for alignment as the vectorizer doesn't
+        know how to handle these situations properly yet.  */
+      if (loop->inner != NULL
+         || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
+       vect_epilogues = false;
+
+    }


Nit: excess blank line before "}".  Sorry if this was discussed before,
but what's the reason for delaying the check for "loop->inner" to
this point, rather than doing it in vect_analyze_loop?


Done.

+
+  tree niters_vector_mult_vf;
+  unsigned int lowest_vf = constant_lower_bound (vf);
+  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
+     on niters already ajusted for the iterations of the prologue.  */


Pre-existing typo: adjusted.  But...

+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && known_eq (vf, lowest_vf))
+    {
+      loop_vec_info orig_loop_vinfo;
+      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+       orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+      else
+       orig_loop_vinfo = loop_vinfo;
+      vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
+      unsigned next_size = 0;
+      unsigned HOST_WIDE_INT eiters
+       = (LOOP_VINFO_INT_NITERS (loop_vinfo)
+          - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+      if (prolog_peeling > 0)
+       eiters -= prolog_peeling;


...is that comment still true?  We're now subtracting the peeling
amount here.


It is not, "adjusted" the comment ;)

Might be worth asserting prolog_peeling >= 0, just to emphasise
that we can't get here for variable peeling amounts, and then subtract
prolog_peeling unconditionally (assuming that's the right thing to do).

Can't assert as LOOP_VINFO_NITERS_KNOWN_P can be true even withprolog_peeling < 0, since we still know the constant number of scalariterations, we just don't know how many vector iterations will beperformed due to the runtime peeling. I will however, not rejectvectorizing the epilogue, when we don't know how much we are peeling.

+      eiters
+       = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+
+      unsigned int ratio;
+      while (next_size < vector_sizes.length ()
+            && !(constant_multiple_p (current_vector_size,
+                                      vector_sizes[next_size], &ratio)
+                 && eiters >= lowest_vf / ratio))
+       next_size += 1;
+
+      if (next_size == vector_sizes.length ())
+       vect_epilogues = false;
+    }
+
    /* Prolog loop may be skipped.  */
    bool skip_prolog = (prolog_peeling != 0);
    /* Skip to epilog if scalar loop may be preferred.  It's only needed
-     when we peel for epilog loop and when it hasn't been checked with
-     loop versioning.  */
+     when we peel for epilog loop or when we loop version.  */
    bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
                      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
                                  bound_prolog + bound_epilog)
-                     : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+                     : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+                        || vect_epilogues));


The comment update looks wrong here: without epilogues, we don't need
the skip when loop versioning, because loop versioning ensures that we
have at least one vector iteration.

(I think "it" was supposed to mean "skipping to the epilogue" rather
than the epilogue loop itself, in case that's the confusion.)

It'd be good to mention the epilogue condition in the comment too.


Rewrote comment, hopefully this now better reflects reality.

+
+  if (vect_epilogues)
+    {
+      epilog->aux = epilogue_vinfo;
+      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
+
+      loop_constraint_clear (epilog, LOOP_C_INFINITE);
+
+      /* We now must calculate the number of iterations for our epilogue.  */
+      tree cond_niters, niters;
+
+      /* Depending on whether we peel for gaps we take niters or niters - 1,
+        we will refer to this as N - G, where N and G are the NITERS and
+        GAP for the original loop.  */
+      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+       ? LOOP_VINFO_NITERSM1 (loop_vinfo)
+       : LOOP_VINFO_NITERS (loop_vinfo);
+
+      /* Here we build a vector factorization mask:
+        vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */
+      tree vf_mask = build_int_cst (TREE_TYPE (niters),
+                                   LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),
+                            vf_mask,
+                            build_one_cst (TREE_TYPE (vf_mask)));
+      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);
+
+      /* Here we calculate:
+        niters = N - ((N-G) & ~(VF -1)) */
+      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
+                           LOOP_VINFO_NITERS (loop_vinfo),
+                           fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),
+                                        niters,
+                                        vf_mask));


Might be a daft question, sorry, but why does this need to be so
complicated?  Couldn't we just use the final value of the main loop's
IV to calculate how many iterations are left?

The current code wouldn't for example work for non-power-of-2 SVE vectors.
vect_set_loop_condition_unmasked is structured to cope with that case
(in length-agnostic mode only), even when an epilogue is needed.

Good call, as we discussed I changed my approach here. Rather than usinga conditional expression to guard against skipping the main loop, I nowuse a phi-node to carry the IV. This actually already exists, so I amduplicating here, but I didn't know what the best way was to "grab" thisexisting IV.

+     skipped when dealing with epilogues as we assume we already checked them
+     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */
+  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))
      {
        poly_uint64 niters_th = 0;
        unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
@@ -2307,14 +2342,8 @@ again:
     be vectorized.  */
  opt_loop_vec_info
  vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
-                  vec_info_shared *shared)
+                  vec_info_shared *shared, vector_sizes vector_sizes)
  {
-  auto_vector_sizes vector_sizes;
-
-  /* Autodetect first vector size we try.  */
-  current_vector_size = 0;
-  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,
-                                               loop->simdlen != 0);
    unsigned int next_size = 0;

DUMP_VECT_SCOPE ("analyze_loop_nest");

@@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info 
orig_loop_vinfo,
    poly_uint64 autodetected_vector_size = 0;
    opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
    poly_uint64 first_vector_size = 0;
+  poly_uint64 lowest_th = 0;
+  unsigned vectorized_loops = 0;
+  bool vect_epilogues = !loop->simdlen && PARAM_VALUE 
(PARAM_VECT_EPILOGUES_NOMASK);
    while (1)
      {
        /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info 
orig_loop_vinfo,

if (orig_loop_vinfo)

        LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+      else if (vect_epilogues && first_loop_vinfo)
+       LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;

opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);

        if (res)
        {
          LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+         vectorized_loops++;

- if (loop->simdlen

-             && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-                          (unsigned HOST_WIDE_INT) loop->simdlen))
+         if ((loop->simdlen
+              && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+                           (unsigned HOST_WIDE_INT) loop->simdlen))
+             || vect_epilogues)
            {
              if (first_loop_vinfo == NULL)
                {
                  first_loop_vinfo = loop_vinfo;
+                 lowest_th
+                   = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
                  first_vector_size = current_vector_size;
                  loop->aux = NULL;
                }
              else
-               delete loop_vinfo;
+               {
+                 /* Keep track of vector sizes that we know we can vectorize
+                    the epilogue with.  */
+                 if (vect_epilogues)
+                   {
+                     loop->aux = NULL;
+                     first_loop_vinfo->epilogue_vsizes.reserve (1);
+                     first_loop_vinfo->epilogue_vsizes.quick_push 
(current_vector_size);
+                     first_loop_vinfo->epilogue_vinfos.reserve (1);
+                     first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);


I've messed you around, sorry, but the patches I committed this weekend
mean we now store the vector size in the loop_vinfo.  It'd be good to
avoid a separate epilogue_vsizes array if possible.

Rebased. Actually quite happy with that, makes for a cleaner patch on myend :)

+
+         stmt_vinfo
+           = epilogue_vinfo->lookup_stmt (orig_stmt);


Nit: fits one line.

+
+         STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+         gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+         mapping.put (gimple_phi_result (orig_stmt),
+                       gimple_phi_result (new_stmt));


Nit: indented too far.

+
+         if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+           pattern_worklist.safe_push (stmt_vinfo);
+
+         related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+         while (related_vinfo && related_vinfo != stmt_vinfo)
+           {
+             related_worklist.safe_push (related_vinfo);
+             /* Set BB such that the assert in
+               'get_initial_def_for_reduction' is able to determine that
+               the BB of the related stmt is inside this loop.  */
+             gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+                            gimple_bb (new_stmt));
+             related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+           }
+       }
+
+      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
+          !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
+       {
+         orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];
+         LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);
+         new_stmt = gsi_stmt (epilogue_gsi);
+
+         stmt_vinfo
+           = epilogue_vinfo->lookup_stmt (orig_stmt);


Fits on one line.

+
+         STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+         gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+         if (is_gimple_assign (orig_stmt))
+           {
+             gcc_assert (is_gimple_assign (new_stmt));
+             mapping.put (gimple_assign_lhs (orig_stmt),
+                         gimple_assign_lhs (new_stmt));
+           }


Why just assigns?  Don't we need to handle calls too?

Maybe just use gimple_get_lhs here.


Changed.

@@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> 
*&simduid_to_vf_htab,
                 LOCATION_FILE (vect_location.get_location_t ()),
                 LOCATION_LINE (vect_location.get_location_t ()));

+ /* If this is an epilogue, we already know what vector sizes we will use for

+     vectorization as the analyzis was part of the main vectorized loop.  Use
+     these instead of going through all vector sizes again.  */
+  if (orig_loop_vinfo
+      && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ())
+    {
+      vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
+      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);
+      current_vector_size = vector_sizes[0];
+    }
+  else
+    {
+      /* Autodetect first vector size we try.  */
+      current_vector_size = 0;
+
+      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,
+                                                   loop->simdlen != 0);
+      vector_sizes = auto_vector_sizes;
+    }
+
    /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
-  opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
-  loop->aux = loop_vinfo;
+  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
+  if (loop_vec_info_for_loop (loop))
+    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));
+  else
+    {
+      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, 
vector_sizes);
+      loop->aux = loop_vinfo;
+    }


I don't really understand what this is doing for the epilogue case.
Do we call vect_analyze_loop again?  Are vector_sizes[1:] significant
for epilogues?

The vector sizes code here is no longer needed after your patch. Theloop_vec_info is just checking whether loop already has one set (whichis the case for epilogues) and use that, or if not then analyse it(which is the case for the first vectorization). I'll add some comments.


Thanks,
Richard

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

Reply via email to