[PATCH] Split load permutation

Richard Biener Fri, 19 Jun 2020 05:38:15 -0700


This splits SLP load nodes with load permutation into a SLP
load node (with load "permutation" removing gaps) and a
lane permutation node.  The first and foremost goal of this
is to be able to have a single SLP node for each load group
so we can start making decisions about how to vectorize
them factoring in all loads of that group.  The second
goal is to eventually be able to optimize permutations by
pushing them down where they can be combined from multiple
children to the output.  We do have one very special case
handled by vect_attempt_slp_rearrange_stmts doing it all
the way down for reductions that are associative.


For example for

  l1 = a[0]; l2 = a[1];
  b[0] = l1; b[1] = l2;
  c[0] = l2; c[1] = l1;

wecan avoid generating loads twice.  For

  l1 = a[0]; l2 = a[1]; l3 = a[2];
  b[0] = l1; b[1] = l2;
             c[0] = l2; c[1] = l3;

we will have a SLP load node with three lanes plus
two lane permutation nodes selecting two lanes each.  In
a loop context this will cause a VF of two and three
loads per vectorized loop iteration (plus two permutes)
while previously we used a VF of one with two loads
and no permutation per loop iteration.  In the new
scheme the number of loads is less but we pay with
permutes and a higher VF.

There is (bad) interaction with determining of the vectorization
factor which for BB vectorization causes missed vectorizations
because the "load parts of a dataref group directly" optimization
is not (yet) reflected in the SLP graph.

There is (bad) interaction with CTOR vectorization since we now
get confused about where to insert vectorized stmts when combining
two previously distinct SLP graphs.

My immediate focus is on the SSA verification FAILs but the
other part points at a missing piece in this - a "pass"
that "optimizes" the SLP graph with respect to permutations
and loads, ultimatively for example deciding between using
interleaving schemes, scalar loads, "SLP" + permutation,
load-lanes, etc.;  This is also the thing that blocks
SLP only (apart from teaching the few pieces that cannot do SLP
to do SLP).

I'm posting this mostly to make people think how it fits
their future development and architecture features.

And of course to see whether there are already made up ideas
how to structure such "optimization".

Thanks,
Richard.

PS: patch doesn't bootstrap, it ICEs during libgfortran build.
C vect.exp FAILs are

FAIL: gcc.dg/vect/pr59354.c (internal compiler error)
FAIL: gcc.dg/vect/pr88903-2.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-10.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-11.c (internal compiler error)
FAIL: gcc.dg/vect/vect-alias-check-12.c (internal compiler error)
FAIL: gcc.dg/vect/slp-23.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 2
FAIL: gcc.dg/vect/slp-43.c (internal compiler error)
FAIL: gcc.dg/vect/bb-slp-19.c scan-tree-dump-times slp2 "basic block 
vectorized" 1
FAIL: gcc.dg/vect/bb-slp-29.c scan-tree-dump-times slp1 "basic block 
vectorized" 1
FAIL: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "basic block 
vectorized" 1
FAIL: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization 
with gaps at the end of a load is not supported" 1
FAIL: gcc.dg/vect/bb-slp-subgroups-2.c (internal compiler error)
FAIL: gcc.dg/vect/bb-slp-subgroups-3.c (internal compiler error)

        * tree-vectorizer.h (vect_get_place_in_interleaving_chain):
        Add flag paramter defaulted to true.
        * tree-vect-slp.c (vect_get_place_in_interleaving_chain):
        Imlement alternate mode not counting gaps.
        (vect_build_slp_tree_2): Split load nodes into load and
        lane permutation.
        (vect_attempt_slp_rearrange_stmts): Pass in the unroll
        factor to consider.
        (calculate_unrolling_factor): New function.
        (vect_analyze_slp_instance): Adjust.
        (vect_analyze_slp): Nail down the unroll factor before
        optimizing permutations in SLP reductions.
        (vect_make_slp_decision): Remove determining unroll
        factor here.
        (vect_schedule_slp_instance): Adjust where we emit vectorized
        stmts.
---
 gcc/tree-vect-slp.c   | 246 +++++++++++++++++++++++++++++++++---------
 gcc/tree-vectorizer.h |   3 +-
 2 files changed, 199 insertions(+), 50 deletions(-)

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 5c169f37022..68b0750db44 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -244,7 +244,8 @@ vect_contains_pattern_stmt_p (vec<stmt_vec_info> stmts)
 
 int
 vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info,
-                                     stmt_vec_info first_stmt_info)
+                                     stmt_vec_info first_stmt_info,
+                                     bool add_gaps)
 {
   stmt_vec_info next_stmt_info = first_stmt_info;
   int result = 0;
@@ -258,7 +259,7 @@ vect_get_place_in_interleaving_chain (stmt_vec_info 
stmt_info,
        return result;
       next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
       if (next_stmt_info)
-       result += DR_GROUP_GAP (next_stmt_info);
+       result += add_gaps ? DR_GROUP_GAP (next_stmt_info) : 1;
     }
   while (next_stmt_info);
 
@@ -1243,27 +1244,104 @@ vect_build_slp_tree_2 (vec_info *vinfo,
        }
       else
        {
-         *max_nunits = this_max_nunits;
-         (*tree_size)++;
+         /* We represent a grouped load as a vector build of all
+            lanes of the load group in memory order (a load
+            with possibly a load permutation compacting a group
+            with gaps) plus a lane permutation node that selects
+            and permutes the load group lanes.  That allows
+            sharing of the vector build between different
+            SLP sub-graphs like for
+               l1 = a[0]; l2 = a[1];
+               b[0] = l1; b[1] = l2;
+               c[0] = l2; c[1] = l1;
+            where we can avoid generating loads twice.  For
+               l1 = a[0]; l2 = a[1]; l3 = a[2];
+               b[0] = l1; b[1] = l2;
+               c[0] = l2; c[1] = l3;
+            we will have a SLP load node with three lanes plus
+            two lane permutation nodes selecting two lanes each.
+
+            ???  This has ripple-down effects for vectorization
+            factor computation and bad interaction with vectorized
+            stmt placing.  */
          node = vect_create_new_slp_node (stmts, 0);
          SLP_TREE_VECTYPE (node) = vectype;
          /* And compute the load permutation.  Whether it is actually
             a permutation depends on the unrolling factor which is
             decided later.  */
          vec<unsigned> load_permutation;
+         vec<std::pair<unsigned, unsigned> > lane_permutation;
          int j;
          stmt_vec_info load_info;
+         lane_permutation.create (group_size);
          load_permutation.create (group_size);
          stmt_vec_info first_stmt_info
            = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
-         FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
+         for (load_info = first_stmt_info; load_info;
+              load_info = DR_GROUP_NEXT_ELEMENT (load_info))
            {
              int load_place = vect_get_place_in_interleaving_chain
-                 (load_info, first_stmt_info);
+                 (load_info, first_stmt_info, true);
              gcc_assert (load_place != -1);
              load_permutation.safe_push (load_place);
            }
-         SLP_TREE_LOAD_PERMUTATION (node) = load_permutation;
+         bool any_lane_permute = false;
+         FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
+           {
+             int lane_place = vect_get_place_in_interleaving_chain
+                 (load_info, first_stmt_info, false);
+             gcc_assert (lane_place != -1);
+             lane_permutation.quick_push (std::make_pair (0, lane_place));
+             if (lane_place != j)
+               any_lane_permute = true;
+           }
+         /* When there's a lane permutation or selection make a separate
+            SLP node for the full group load in lane-order (but keep handling
+            gaps with load permutations for now).  */
+         if (!any_lane_permute && load_permutation.length () == group_size)
+           {
+             lane_permutation.release ();
+             SLP_TREE_LOAD_PERMUTATION (node) = load_permutation;
+           }
+         else
+           {
+             /* Build a SLP node loading all members from the group
+                in order with gaps not represented.
+                ???  We easily end up with an odd number of lanes here
+                requiring a larger VF.  In some cases this is desirable
+                but as gaps are not explicitely represented here those
+                present an issue (see dg.exp/vect/slp-23.c for example).
+                That also can end up requiring unsupported load
+                permutations.  */
+             load_permutation.release ();
+             SLP_TREE_LANE_PERMUTATION (node) = lane_permutation;
+             SLP_TREE_CODE (node) = VEC_PERM_EXPR;
+             vec<stmt_vec_info> loads;
+             loads.create (group_size);
+             for (load_info = first_stmt_info; load_info;
+                  load_info = DR_GROUP_NEXT_ELEMENT (load_info))
+               loads.safe_push (load_info);
+             bool *lmatches = XALLOCAVEC (bool, loads.length ());
+             slp_tree load_node = vect_build_slp_tree (vinfo, loads,
+                                                       loads.length (),
+                                                       &this_max_nunits,
+                                                       lmatches, npermutes,
+                                                       &this_tree_size,
+                                                       bst_map);
+             /* ???  For BB vect the above doesn't always work.  */
+             if (!load_node)
+               {
+                 matches[0] = false;
+                 loads.release ();
+                 /* The caller still owns defs if we fail.  */
+                 SLP_TREE_SCALAR_STMTS (node) = vNULL;
+                 vect_free_slp_tree (node, false);
+                 return NULL;
+               }
+             SLP_TREE_CHILDREN (node).safe_push (load_node);
+           }
+         *tree_size += this_tree_size + 1;
+         *max_nunits = this_max_nunits;
          return node;
        }
     }
@@ -1781,7 +1859,8 @@ vect_slp_rearrange_stmts (slp_tree node, unsigned int 
group_size,
    otherwise return false.  */
 
 static bool
-vect_attempt_slp_rearrange_stmts (slp_instance slp_instn)
+vect_attempt_slp_rearrange_stmts (slp_instance slp_instn,
+                                 poly_uint64 unrolling_factor)
 {
   unsigned int i, j;
   unsigned int lidx;
@@ -1807,9 +1886,13 @@ vect_attempt_slp_rearrange_stmts (slp_instance slp_instn)
     }
 
   /* Check that the loads in the first sequence are different and there
-     are no gaps between them.  */
+     are no gaps between them.  Also check there is any permutation.  */
+  /* ???  We now would need to check lane permutations or more generally
+     apply optimization of permutations by pushing them up/down the
+     graph.  */
   auto_sbitmap load_index (group_size);
   bitmap_clear (load_index);
+  bool any = false;
   FOR_EACH_VEC_ELT (node->load_permutation, i, lidx)
     {
       if (lidx >= group_size)
@@ -1817,8 +1900,12 @@ vect_attempt_slp_rearrange_stmts (slp_instance slp_instn)
       if (bitmap_bit_p (load_index, lidx))
        return false;
 
+      if (lidx != i)
+       any = true;
       bitmap_set_bit (load_index, lidx);
     }
+  if (!any)
+    return false;
   for (i = 0; i < group_size; i++)
     if (!bitmap_bit_p (load_index, i))
       return false;
@@ -1844,7 +1931,6 @@ vect_attempt_slp_rearrange_stmts (slp_instance slp_instn)
                            node->load_permutation, visited);
 
   /* We are done, no actual permutations need to be generated.  */
-  poly_uint64 unrolling_factor = SLP_INSTANCE_UNROLLING_FACTOR (slp_instn);
   FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (slp_instn), i, node)
     {
       stmt_vec_info first_stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
@@ -1971,6 +2057,41 @@ calculate_unrolling_factor (poly_uint64 nunits, unsigned 
int group_size)
   return exact_div (common_multiple (nunits, group_size), group_size);
 }
 
+/* Since the group size now can change in the SLP graph the minimum
+   unroll factor needs to be decided on a node-by-node base and a common
+   multiple be found for the overall unrolling.  */
+
+static poly_uint64
+calculate_unrolling_factor (slp_tree node,
+                           hash_map<slp_tree, poly_uint64> &visited)
+{
+  bool existed_p;
+  poly_uint64 &uf = visited.get_or_insert (node, &existed_p);
+  if (existed_p)
+    return uf;
+
+  poly_uint64 this_uf
+    = calculate_unrolling_factor (node->max_nunits, SLP_TREE_LANES (node));
+  uf = this_uf;
+
+  slp_tree child;
+  unsigned i;
+  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
+    {
+      poly_uint64 child_uf = calculate_unrolling_factor (child, visited);
+      this_uf = force_common_multiple (child_uf, this_uf);
+    }
+  *visited.get (node) = this_uf;
+  return this_uf;
+}
+
+static poly_uint64
+calculate_unrolling_factor (slp_tree node)
+{
+  hash_map<slp_tree, poly_uint64> visited;
+  return calculate_unrolling_factor (node, visited);
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -2097,9 +2218,11 @@ vect_analyze_slp_instance (vec_info *vinfo,
                              &tree_size, bst_map);
   if (node != NULL)
     {
+      /* ???  Instead of walking the SLP tree here record the minimum
+        unroll factor instead of max_nunits in the SLP nodes?  */
       /* Calculate the unrolling factor based on the smallest type.  */
       poly_uint64 unrolling_factor
-       = calculate_unrolling_factor (max_nunits, group_size);
+       = calculate_unrolling_factor (node);
 
       if (maybe_ne (unrolling_factor, 1U)
          && is_a <bb_vec_info> (vinfo))
@@ -2339,8 +2462,28 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
max_tree_size)
       vect_free_slp_tree ((*it).second, false);
   delete bst_map;
 
-  /* Optimize permutations in SLP reductions.  */
   slp_instance instance;
+  /* Get the overall SLP induced unrolling factor.  */
+  poly_uint64 unrolling_factor = 1;
+  FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance)
+    {
+      /* All unroll factors have the form:
+
+          GET_MODE_SIZE (vinfo->vector_mode) * X
+
+        for some rational X, so they must have a common multiple.  */
+      unrolling_factor
+       = force_common_multiple (unrolling_factor,
+                                SLP_INSTANCE_UNROLLING_FACTOR (instance));
+
+    }
+  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
+    LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo) = unrolling_factor;
+  else
+    /* For BB vectorization we removed any entries that need unrolling.  */
+    gcc_assert (known_eq (unrolling_factor, 1U));
+
+  /* Optimize permutations in SLP reductions.  */
   FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance)
     {
       slp_tree node = SLP_INSTANCE_TREE (instance);
@@ -2349,7 +2492,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
         In reduction chain the order of the loads is not important.  */
       if (!STMT_VINFO_DATA_REF (stmt_info)
          && !REDUC_GROUP_FIRST_ELEMENT (stmt_info))
-       vect_attempt_slp_rearrange_stmts (instance);
+       vect_attempt_slp_rearrange_stmts (instance, unrolling_factor);
     }
 
   /* Gather all loads in the SLP graph.  */
@@ -2434,7 +2577,6 @@ bool
 vect_make_slp_decision (loop_vec_info loop_vinfo)
 {
   unsigned int i;
-  poly_uint64 unrolling_factor = 1;
   vec<slp_instance> slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
   slp_instance instance;
   int decided_to_slp = 0;
@@ -2444,15 +2586,6 @@ vect_make_slp_decision (loop_vec_info loop_vinfo)
   FOR_EACH_VEC_ELT (slp_instances, i, instance)
     {
       /* FORNOW: SLP if you can.  */
-      /* All unroll factors have the form:
-
-          GET_MODE_SIZE (vinfo->vector_mode) * X
-
-        for some rational X, so they must have a common multiple.  */
-      unrolling_factor
-       = force_common_multiple (unrolling_factor,
-                                SLP_INSTANCE_UNROLLING_FACTOR (instance));
-
       /* Mark all the stmts that belong to INSTANCE as PURE_SLP stmts.  Later 
we
         call vect_detect_hybrid_slp () to find stmts that need hybrid SLP and
         loop-based vectorization.  Such stmts will be marked as HYBRID.  */
@@ -2460,14 +2593,12 @@ vect_make_slp_decision (loop_vec_info loop_vinfo)
       decided_to_slp++;
     }
 
-  LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo) = unrolling_factor;
-
   if (decided_to_slp && dump_enabled_p ())
     {
       dump_printf_loc (MSG_NOTE, vect_location,
                       "Decided to SLP %d instances. Unrolling factor ",
                       decided_to_slp);
-      dump_dec (MSG_NOTE, unrolling_factor);
+      dump_dec (MSG_NOTE, LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo));
       dump_printf (MSG_NOTE, "\n");
     }
 
@@ -4205,33 +4336,50 @@ vect_schedule_slp_instance (vec_info *vinfo,
                     "------>vectorizing SLP node starting from: %G",
                     stmt_info->stmt);
 
-  /* Vectorized stmts go before the last scalar stmt which is where
-     all uses are ready.  */
-  stmt_vec_info last_stmt_info = vect_find_last_scalar_stmt_in_slp (node);
-  if (last_stmt_info)
-    si = gsi_for_stmt (last_stmt_info->stmt);
+  /* Vectorized stmts for leafs and nodes with datarefs go before
+     the last scalar stmt.  */
+  if (SLP_TREE_CHILDREN (node).is_empty ()
+      || (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+         && STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node))))
+    {
+      stmt_vec_info last_stmt_info = vect_find_last_scalar_stmt_in_slp (node);
+      si = gsi_for_stmt (last_stmt_info->stmt);
+    }
   else
     {
-      /* Or if we do not have 1:1 matching scalar stmts emit after the
-        children vectorized defs.  */
+      /* Else emit after the children vectorized defs.  */
       gimple *last_stmt = NULL;
       FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
-       /* ???  With only external defs the following breaks.  Note
-          external defs do not get all vector init stmts generated
-          at the same place.  */
-       if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
-         {
-           /* We are emitting all vectorized stmts in the same place and
-              the last one is the last.
-              ???  Unless we have a load permutation applied and that
-              figures to re-use an earlier generated load.  */
-           unsigned j;
-           gimple *vstmt;
-           FOR_EACH_VEC_ELT (SLP_TREE_VEC_STMTS (child), j, vstmt)
-             if (!last_stmt
-                 || vect_stmt_dominates_stmt_p (last_stmt, vstmt))
-               last_stmt = vstmt;
-         }
+       {
+         if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
+           {
+             /* We are emitting all vectorized stmts in the same place and
+                the last one is the last.
+                ???  Unless we have a load permutation applied and that
+                figures to re-use an earlier generated load.  */
+             unsigned j;
+             gimple *vstmt;
+             FOR_EACH_VEC_ELT (SLP_TREE_VEC_STMTS (child), j, vstmt)
+               if (!last_stmt
+                   || vect_stmt_dominates_stmt_p (last_stmt, vstmt))
+                 last_stmt = vstmt;
+           }
+         else
+           {
+             /* For externals we have to look at all defs since their
+                insertion place is decided per vector.  */
+             unsigned j;
+             tree vdef;
+             FOR_EACH_VEC_ELT (SLP_TREE_VEC_DEFS (child), j, vdef)
+               if (TREE_CODE (vdef) == SSA_NAME)
+                 {
+                   gimple *vstmt = SSA_NAME_DEF_STMT (vdef);
+                   if (!last_stmt
+                       || vect_stmt_dominates_stmt_p (last_stmt, vstmt))
+                     last_stmt = vstmt;
+                 }
+           }
+       }
       if (is_a <gphi *> (last_stmt))
        si = gsi_after_labels (gimple_bb (last_stmt));
       else
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 32feec3e24b..07ab8d2c4e3 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2007,7 +2007,8 @@ extern bool can_duplicate_and_interleave_p (vec_info *, 
unsigned int, tree,
                                            tree * = NULL, tree * = NULL);
 extern void duplicate_and_interleave (vec_info *, gimple_seq *, tree,
                                      vec<tree>, unsigned int, vec<tree> &);
-extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info);
+extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info,
+                                                bool = true);
 
 /* In tree-vect-patterns.c.  */
 /* Pattern recognition functions.
-- 
2.26.2

[PATCH] Split load permutation

Reply via email to