On Fri, 2 Oct 2020, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > This introduces a permute optimization phase for SLP which is > > intended to cover the existing permute eliding for SLP reductions > > plus handling commonizing the easy cases. > > > > It currently uses graphds to compute a postorder on the reverse > > SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts > > did (hopefully - I've adjusted most testcases that triggered it > > a few days ago). It restricts itself to move around bijective > > permutations to simplify things for now, mainly around constant nodes. > > > > As a prerequesite it makes the SLP graph cyclic (ugh). It looks > > like it would pay off to compute a PRE/POST order visit array > > once and elide all the recursive SLP graph walks and their > > visited hash-set. At least for the time where we do not change > > the SLP graph during such walk. > > > > I do not like using graphds too much but at least I don't have to > > re-implement yet another RPO walk, so maybe it isn't too bad. > > > > Comments are welcome - I do want to see vect_attempt_slp_rearrange_stmts > > go way for GCC 11 and the permute optimization helps non-store > > BB vectorization opportunities where we can end up with a lot of > > useless load permutes otherwise. > > Looks really nice. Got a couple of questions that probably just show > my misunderstanding :-) > > Is this intended to compute an optimal-ish solution?
The intent was to keep it simple but compute a solution that will not increase the number of permutes. > It looked from > a quick read like it tried to push permutes as far away from loads as > possible without creating permuted and unpermuted versions of the same > node. But I guess there will be cases where the optimal placement is > somewhere between the two extremes of permuting at the loads and > permuting as far away as possible. So what it does is that it pushes permutes away from the loads until there's a use requiring a different permutation. But handling of constants/externals as having "all" permutations causes us to push permutes along binary ops with one constant/external argument (in addition to pushing it along all unary operations). I have some patches that try to unify constant/external nodes during SLP build (we're currently _not_ sharing them, thus not computing their cost correctly) - once that's in (not sure if it happens this stage1) it would make sense to try to not have too many different permutes of constants/externals (esp. externals I guess). Now, did you have some other sub-optimality in mind? > Of course, whatever we do will be a heuristic. I just wasn't sure how > often this would be best in practice. Yeah, so I'm not sure where in a "series" of unary ops we'd want to push a permutation. The argument could be to leave it at the load for as little as possible changes from the current handling. That could be done with a reverse propagation stage. I'll see if splitting out some predicates from the current code makes it not too much duplication to introduce this. > It looks like the materialisation phase changes the choices for nodes > on the fly, is that right? If so, how does that work for backedges? > I'd expected the materialisation phase to treat the permutation choice > as read-only, and simply implement what the graph already said. The materialization phase is also the decision stage (wanted to avoid duplicating the loop). When we materialize a permutation at the node which has differing uses we have to update the graph from there. As for backedges I wasn't sure and indeed there may be bugs - I do have to investigate one libgomp FAIL from the testing. It would be odd to require iteration in the decision stage again but in case we're breaking a cycle we have to re-consider the backedge permutation as well. Which would mean we'd better to the decision where to materialize during the propagation stage(?) I'm going to analyze the FAIL now. Richard. > Thanks, > Richard > > > > > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. > > > > Richard. > > > > 2020-10-02 Richard Biener <rguent...@suse.de> > > > > * tree-vect-data-refs.c (vect_slp_analyze_instance_dependence): > > Use SLP_TREE_REPRESENTATIVE. > > * tree-vectorizer.h (_slp_tree::vertex): New member used > > for graphds interfacing. > > * tree-vect-slp.c (vect_build_slp_tree_2): Allocate space > > for PHI SLP children. > > (vect_analyze_slp_backedges): New function filling in SLP > > node children for PHIs that correspond to backedge values. > > (vect_analyze_slp): Call vect_analyze_slp_backedges for the > > graph. > > (vect_slp_analyze_node_operations): Deal with a cyclic graph. > > (vect_schedule_slp_instance): Likewise. > > (vect_schedule_slp): Likewise. > > (slp_copy_subtree): Remove. > > (vect_slp_rearrange_stmts): Likewise. > > (vect_attempt_slp_rearrange_stmts): Likewise. > > (vect_slp_build_vertices): New functions. > > (vect_slp_permute): Likewise. > > (vect_optimize_slp): Remove special code to elide > > permutations with SLP reductions. Implement generic > > permute optimization. > > > > * gcc.dg/vect/bb-slp-50.c: New testcase. > > * gcc.dg/vect/bb-slp-51.c: Likewise. > > --- > > gcc/testsuite/gcc.dg/vect/bb-slp-50.c | 20 + > > gcc/testsuite/gcc.dg/vect/bb-slp-51.c | 20 + > > gcc/tree-vect-data-refs.c | 2 +- > > gcc/tree-vect-slp.c | 660 +++++++++++++++++--------- > > gcc/tree-vectorizer.h | 2 + > > 5 files changed, 479 insertions(+), 225 deletions(-) > > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-50.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-51.c > > > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-50.c > > b/gcc/testsuite/gcc.dg/vect/bb-slp-50.c > > new file mode 100644 > > index 00000000000..80216be4ebf > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-50.c > > @@ -0,0 +1,20 @@ > > +/* { dg-do compile } */ > > +/* { dg-require-effective-target vect_double } */ > > + > > +double a[2]; > > +double b[2]; > > +double c[2]; > > +double d[2]; > > +double e[2]; > > +void foo(double x) > > +{ > > + double tembc0 = b[1] + c[1]; > > + double tembc1 = b[0] + c[0]; > > + double temde0 = d[0] + e[1]; > > + double temde1 = d[1] + e[0]; > > + a[0] = tembc0 + temde0; > > + a[1] = tembc1 + temde1; > > +} > > + > > +/* We should common the permutation on the tembc{0,1} operands. */ > > +/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = > > VEC_PERM_EXPR" 2 "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-51.c > > b/gcc/testsuite/gcc.dg/vect/bb-slp-51.c > > new file mode 100644 > > index 00000000000..1481018428e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-51.c > > @@ -0,0 +1,20 @@ > > +/* { dg-do compile } */ > > +/* { dg-require-effective-target vect_double } */ > > + > > +double a[2]; > > +double b[2]; > > +double c[2]; > > +double e[2]; > > +void foo(double x) > > +{ > > + double tembc0 = b[1] + c[1]; > > + double tembc1 = b[0] + c[0]; > > + double temde0 = 5 + e[1]; > > + double temde1 = 11 + e[0]; > > + a[0] = tembc0 + temde0; > > + a[1] = tembc1 + temde1; > > +} > > + > > +/* We should common the permutations on the tembc{0,1} and temde{0,1} > > + operands. */ > > +/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\r\\n\]* > > VEC_PERM_EXPR" 1 "slp2" } } */ > > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c > > index 5bf93e2942b..fdc1f47dded 100644 > > --- a/gcc/tree-vect-data-refs.c > > +++ b/gcc/tree-vect-data-refs.c > > @@ -841,7 +841,7 @@ vect_slp_analyze_instance_dependence (vec_info *vinfo, > > slp_instance instance) > > > > /* The stores of this instance are at the root of the SLP tree. */ > > slp_tree store = SLP_INSTANCE_TREE (instance); > > - if (! STMT_VINFO_DATA_REF (SLP_TREE_SCALAR_STMTS (store)[0])) > > + if (! STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (store))) > > store = NULL; > > > > /* Verify we can sink stores to the vectorized stmt insert location. */ > > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c > > index 085662bf468..05e1e5abec9 100644 > > --- a/gcc/tree-vect-slp.c > > +++ b/gcc/tree-vect-slp.c > > @@ -1236,8 +1236,8 @@ vect_build_slp_tree_2 (vec_info *vinfo, > > if (gimple_assign_rhs_code (stmt) == COND_EXPR) > > nops++; > > } > > - else if (is_a <gphi *> (stmt_info->stmt)) > > - nops = 0; > > + else if (gphi *phi = dyn_cast <gphi *> (stmt_info->stmt)) > > + nops = gimple_phi_num_args (phi); > > else > > return NULL; > > > > @@ -1273,7 +1273,7 @@ vect_build_slp_tree_2 (vec_info *vinfo, > > else > > return NULL; > > (*tree_size)++; > > - node = vect_create_new_slp_node (stmts, 0); > > + node = vect_create_new_slp_node (stmts, nops); > > SLP_TREE_VECTYPE (node) = vectype; > > return node; > > } > > @@ -1795,188 +1795,6 @@ vect_mark_slp_stmts_relevant (slp_tree node) > > vect_mark_slp_stmts_relevant (node, visited); > > } > > > > -/* Copy the SLP subtree rooted at NODE. */ > > - > > -static slp_tree > > -slp_copy_subtree (slp_tree node, hash_map<slp_tree, slp_tree> &map) > > -{ > > - unsigned i; > > - > > - bool existed_p; > > - slp_tree ©_ref = map.get_or_insert (node, &existed_p); > > - if (existed_p) > > - return copy_ref; > > - > > - copy_ref = new _slp_tree; > > - slp_tree copy = copy_ref; > > - SLP_TREE_DEF_TYPE (copy) = SLP_TREE_DEF_TYPE (node); > > - SLP_TREE_VECTYPE (copy) = SLP_TREE_VECTYPE (node); > > - SLP_TREE_REPRESENTATIVE (copy) = SLP_TREE_REPRESENTATIVE (node); > > - SLP_TREE_LANES (copy) = SLP_TREE_LANES (node); > > - copy->max_nunits = node->max_nunits; > > - SLP_TREE_REF_COUNT (copy) = 0; > > - if (SLP_TREE_SCALAR_STMTS (node).exists ()) > > - SLP_TREE_SCALAR_STMTS (copy) = SLP_TREE_SCALAR_STMTS (node).copy (); > > - if (SLP_TREE_SCALAR_OPS (node).exists ()) > > - SLP_TREE_SCALAR_OPS (copy) = SLP_TREE_SCALAR_OPS (node).copy (); > > - if (SLP_TREE_LOAD_PERMUTATION (node).exists ()) > > - SLP_TREE_LOAD_PERMUTATION (copy) = SLP_TREE_LOAD_PERMUTATION > > (node).copy (); > > - if (SLP_TREE_LANE_PERMUTATION (node).exists ()) > > - SLP_TREE_LANE_PERMUTATION (copy) = SLP_TREE_LANE_PERMUTATION > > (node).copy (); > > - if (SLP_TREE_CHILDREN (node).exists ()) > > - SLP_TREE_CHILDREN (copy) = SLP_TREE_CHILDREN (node).copy (); > > - gcc_assert (!SLP_TREE_VEC_STMTS (node).exists ()); > > - > > - slp_tree child; > > - FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (copy), i, child) > > - { > > - SLP_TREE_CHILDREN (copy)[i] = slp_copy_subtree (child, map); > > - SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (copy)[i])++; > > - } > > - return copy; > > -} > > - > > -/* Rearrange the statements of NODE according to PERMUTATION. */ > > - > > -static void > > -vect_slp_rearrange_stmts (slp_tree node, unsigned int group_size, > > - vec<unsigned> permutation, > > - hash_set<slp_tree> &visited) > > -{ > > - unsigned int i; > > - slp_tree child; > > - > > - if (visited.add (node)) > > - return; > > - > > - FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > - vect_slp_rearrange_stmts (child, group_size, permutation, visited); > > - > > - if (SLP_TREE_SCALAR_STMTS (node).exists ()) > > - { > > - gcc_assert (group_size == SLP_TREE_SCALAR_STMTS (node).length ()); > > - vec<stmt_vec_info> tmp_stmts; > > - tmp_stmts.create (group_size); > > - tmp_stmts.quick_grow (group_size); > > - stmt_vec_info stmt_info; > > - FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info) > > - tmp_stmts[permutation[i]] = stmt_info; > > - SLP_TREE_SCALAR_STMTS (node).release (); > > - SLP_TREE_SCALAR_STMTS (node) = tmp_stmts; > > - } > > - if (SLP_TREE_SCALAR_OPS (node).exists ()) > > - { > > - gcc_assert (group_size == SLP_TREE_SCALAR_OPS (node).length ()); > > - vec<tree> tmp_ops; > > - tmp_ops.create (group_size); > > - tmp_ops.quick_grow (group_size); > > - tree op; > > - FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > - tmp_ops[permutation[i]] = op; > > - SLP_TREE_SCALAR_OPS (node).release (); > > - SLP_TREE_SCALAR_OPS (node) = tmp_ops; > > - } > > - if (SLP_TREE_LANE_PERMUTATION (node).exists ()) > > - { > > - gcc_assert (group_size == SLP_TREE_LANE_PERMUTATION (node).length > > ()); > > - for (i = 0; i < group_size; ++i) > > - SLP_TREE_LANE_PERMUTATION (node)[i].second > > - = permutation[SLP_TREE_LANE_PERMUTATION (node)[i].second]; > > - } > > -} > > - > > - > > -/* Attempt to reorder stmts in a reduction chain so that we don't > > - require any load permutation. Return true if that was possible, > > - otherwise return false. */ > > - > > -static bool > > -vect_attempt_slp_rearrange_stmts (slp_instance slp_instn) > > -{ > > - unsigned int i, j; > > - unsigned int lidx; > > - slp_tree node, load; > > - > > - if (SLP_INSTANCE_LOADS (slp_instn).is_empty ()) > > - return false; > > - > > - /* Compare all the permutation sequences to the first one. We know > > - that at least one load is permuted. */ > > - node = SLP_INSTANCE_LOADS (slp_instn)[0]; > > - if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) > > - return false; > > - unsigned int group_size = SLP_TREE_LOAD_PERMUTATION (node).length (); > > - for (i = 1; SLP_INSTANCE_LOADS (slp_instn).iterate (i, &load); ++i) > > - { > > - if (!SLP_TREE_LOAD_PERMUTATION (load).exists () > > - || SLP_TREE_LOAD_PERMUTATION (load).length () != group_size) > > - return false; > > - FOR_EACH_VEC_ELT (SLP_TREE_LOAD_PERMUTATION (load), j, lidx) > > - if (lidx != SLP_TREE_LOAD_PERMUTATION (node)[j]) > > - return false; > > - } > > - > > - /* Check that the loads in the first sequence are different and there > > - are no gaps between them and that there is an actual permutation. */ > > - bool any_permute = false; > > - auto_sbitmap load_index (group_size); > > - bitmap_clear (load_index); > > - FOR_EACH_VEC_ELT (node->load_permutation, i, lidx) > > - { > > - if (lidx != i) > > - any_permute = true; > > - if (lidx >= group_size) > > - return false; > > - if (bitmap_bit_p (load_index, lidx)) > > - return false; > > - > > - bitmap_set_bit (load_index, lidx); > > - } > > - if (!any_permute) > > - return false; > > - for (i = 0; i < group_size; i++) > > - if (!bitmap_bit_p (load_index, i)) > > - return false; > > - > > - /* This permutation is valid for reduction. Since the order of the > > - statements in the nodes is not important unless they are memory > > - accesses, we can rearrange the statements in all the nodes > > - according to the order of the loads. */ > > - > > - /* We have to unshare the SLP tree we modify. */ > > - hash_map<slp_tree, slp_tree> map; > > - slp_tree unshared = slp_copy_subtree (SLP_INSTANCE_TREE (slp_instn), > > map); > > - vect_free_slp_tree (SLP_INSTANCE_TREE (slp_instn)); > > - SLP_TREE_REF_COUNT (unshared)++; > > - SLP_INSTANCE_TREE (slp_instn) = unshared; > > - FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (slp_instn), i, node) > > - SLP_INSTANCE_LOADS (slp_instn)[i] = *map.get (node); > > - node = SLP_INSTANCE_LOADS (slp_instn)[0]; > > - > > - /* Do the actual re-arrangement. */ > > - hash_set<slp_tree> visited; > > - vect_slp_rearrange_stmts (SLP_INSTANCE_TREE (slp_instn), group_size, > > - node->load_permutation, visited); > > - > > - /* We are done, no actual permutations need to be generated. */ > > - poly_uint64 unrolling_factor = SLP_INSTANCE_UNROLLING_FACTOR (slp_instn); > > - FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (slp_instn), i, node) > > - { > > - stmt_vec_info first_stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; > > - first_stmt_info = DR_GROUP_FIRST_ELEMENT (first_stmt_info); > > - /* But we have to keep those permutations that are required because > > - of handling of gaps. */ > > - if (known_eq (unrolling_factor, 1U) > > - || (group_size == DR_GROUP_SIZE (first_stmt_info) > > - && DR_GROUP_GAP (first_stmt_info) == 0)) > > - SLP_TREE_LOAD_PERMUTATION (node).release (); > > - else > > - for (j = 0; j < SLP_TREE_LOAD_PERMUTATION (node).length (); ++j) > > - SLP_TREE_LOAD_PERMUTATION (node)[j] = j; > > - } > > - > > - return true; > > -} > > > > /* Gather loads in the SLP graph NODE and populate the INST loads array. > > */ > > > > @@ -2440,6 +2258,51 @@ vect_analyze_slp_instance (vec_info *vinfo, > > return false; > > } > > > > +static void > > +vect_analyze_slp_backedges (vec_info *vinfo, slp_tree node, > > + scalar_stmts_to_slp_tree_map_t *bst_map, > > + hash_set<slp_tree> visited) > > +{ > > + if (SLP_TREE_DEF_TYPE (node) != vect_internal_def > > + || visited.add (node)) > > + return; > > + > > + slp_tree child; > > + unsigned i; > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > + if (child) > > + vect_analyze_slp_backedges (vinfo, child, bst_map, visited); > > + > > + if (gphi *phi = dyn_cast <gphi *> (SLP_TREE_REPRESENTATIVE (node)->stmt)) > > + for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i) > > + { > > + auto_vec<stmt_vec_info, 64> stmts; > > + unsigned j; > > + stmt_vec_info phi_info; > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, phi_info) > > + { > > + tree def = gimple_phi_arg_def (as_a <gphi *>(phi_info->stmt), i); > > + stmt_vec_info def_info = vinfo->lookup_def (def); > > + if (!def_info) > > + break; > > + stmts.safe_push (vect_stmt_to_vectorize (def_info)); > > + } > > + if (j != SLP_TREE_LANES (node)) > > + continue; > > + slp_tree *edge_def = bst_map->get (stmts); > > + if (edge_def) > > + { > > + /* ??? We're currently not recording non-backedge children > > + of PHIs like external reduction initial values. Avoid > > + NULL entries in SLP_TREE_CHILDREN for those and thus > > + for now simply only record backedge defs at a > > + SLP_TREE_CHILDREN index possibly not matching that of > > + the corresponding PHI argument index. */ > > + SLP_TREE_CHILDREN (node).quick_push (*edge_def); > > + (*edge_def)->refcnt++; > > + } > > + } > > +} > > > > /* Check if there are stmts in the loop can be vectorized using SLP. > > Build SLP > > trees of packed scalar stmts if SLP is possible. */ > > @@ -2491,6 +2354,13 @@ vect_analyze_slp (vec_info *vinfo, unsigned > > max_tree_size) > > max_tree_size); > > } > > > > + /* Fill in backedges. */ > > + slp_instance instance; > > + hash_set<slp_tree> visited; > > + FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance) > > + vect_analyze_slp_backedges (vinfo, SLP_INSTANCE_TREE (instance), > > + bst_map, visited); > > + > > /* The map keeps a reference on SLP nodes built, release that. */ > > for (scalar_stmts_to_slp_tree_map_t::iterator it = bst_map->begin (); > > it != bst_map->end (); ++it) > > @@ -2501,34 +2371,376 @@ vect_analyze_slp (vec_info *vinfo, unsigned > > max_tree_size) > > return opt_result::success (); > > } > > > > +#if 0 > > +/* Fill the reverse SLP graph edges and collect leaf nods in LEAFS. */ > > + > > +static void > > +vect_slp_build_parents (hash_set<slp_tree> visited, slp_tree node, > > + vec<slp_tree> &leafs) > > +{ > > + if (visited.add (node)) > > + return; > > + > > + if (SLP_TREE_CHILDREN (node).is_empty ()) > > + leafs.safe_push (node); > > + else > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > + { > > + SLP_TREE_PARENTS (child).safe_push (node); > > + vect_slp_build_parents (visited, info child); > > + } > > +} > > + > > +static void > > +vect_slp_build_parents (vec_info *info, vec<slp_tree> &leafs) > > +{ > > + hash_set<slp_tree> visited; > > + FOR_EACH_VEC_ELT (info->slp_instances, i, instance) > > + vect_slp_build_parents (visited, SLP_INSTANCE_ROOT (instance), leafs); > > +} > > +#endif > > + > > +/* Fill the vertices vector with all nodes in the SLP graph. */ > > + > > +static void > > +vect_slp_build_vertices (hash_set<slp_tree> visited, slp_tree node, > > + vec<slp_tree> &vertices, vec<int> &leafs) > > +{ > > + unsigned i; > > + slp_tree child; > > + > > + if (visited.add (node)) > > + return; > > + > > + node->vertex = vertices.length (); > > + vertices.safe_push (node); > > + if (SLP_TREE_CHILDREN (node).is_empty ()) > > + leafs.safe_push (node->vertex); > > + else > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > + vect_slp_build_vertices (visited, child, vertices, leafs); > > +} > > + > > +static void > > +vect_slp_build_vertices (vec_info *info, vec<slp_tree> &vertices, > > + vec<int> &leafs) > > +{ > > + hash_set<slp_tree> visited; > > + unsigned i; > > + slp_instance instance; > > + FOR_EACH_VEC_ELT (info->slp_instances, i, instance) > > + vect_slp_build_vertices (visited, SLP_INSTANCE_TREE (instance), > > vertices, > > + leafs); > > +} > > + > > +/* Apply (reverse) PERM to VEC. */ > > + > > +template <class T> > > +static void > > +vect_slp_permute (vec<unsigned> perm, > > + vec<T> &vec, bool reverse) > > +{ > > + auto_vec<T, 64> saved; > > + saved.create (vec.length ()); > > + for (unsigned i = 0; i < vec.length (); ++i) > > + saved.quick_push (vec[i]); > > + > > + if (reverse) > > + { > > + for (unsigned i = 0; i < vec.length (); ++i) > > + vec[perm[i]] = saved[i]; > > + for (unsigned i = 0; i < vec.length (); ++i) > > + gcc_assert (vec[perm[i]] == saved[i]); > > + } > > + else > > + { > > + for (unsigned i = 0; i < vec.length (); ++i) > > + vec[i] = saved[perm[i]]; > > + for (unsigned i = 0; i < vec.length (); ++i) > > + gcc_assert (vec[i] == saved[perm[i]]); > > + } > > +} > > + > > void > > vect_optimize_slp (vec_info *vinfo) > > { > > - /* Optimize permutations in SLP reductions. */ > > - slp_instance instance; > > + if (vinfo->slp_instances.is_empty ()) > > + return; > > + > > + slp_tree node; > > unsigned i; > > - FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance) > > + auto_vec<slp_tree> vertices; > > + auto_vec<int> leafs; > > + vect_slp_build_vertices (vinfo, vertices, leafs); > > + > > + struct graph *slpg = new_graph (vertices.length ()); > > + FOR_EACH_VEC_ELT (vertices, i, node) > > { > > - slp_tree node = SLP_INSTANCE_TREE (instance); > > - stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; > > - /* Reduction (there are no data-refs in the root). > > - In reduction chain the order of the loads is not important. */ > > - if (!STMT_VINFO_DATA_REF (stmt_info) > > - && !REDUC_GROUP_FIRST_ELEMENT (stmt_info) > > - && !SLP_INSTANCE_ROOT_STMT (instance)) > > - vect_attempt_slp_rearrange_stmts (instance); > > + unsigned j; > > + slp_tree child; > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child) > > + add_edge (slpg, i, child->vertex); > > } > > > > - /* Gather all loads in the SLP graph. */ > > - auto_vec<slp_tree> slp_loads; > > - hash_set<slp_tree> visited; > > - FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance) > > - vect_gather_slp_loads (slp_loads, SLP_INSTANCE_TREE (instance), > > - visited); > > + /* Compute (reverse) postorder on the inverted graph. */ > > + auto_vec<int> ipo; > > + graphds_dfs (slpg, &leafs[0], leafs.length (), &ipo, false, NULL, NULL); > > > > - slp_tree node; > > - FOR_EACH_VEC_ELT (slp_loads, i, node) > > + auto_sbitmap n_visited (vertices.length ()); > > + auto_vec<int> n_perm (vertices.length ()); > > + auto_vec<vec<unsigned> > perms; > > + > > + bitmap_clear (n_visited); > > + n_perm.quick_grow_cleared (vertices.length ()); > > + perms.safe_push (vNULL); /* zero is no permute */ > > + > > + /* Produce initial permutations. */ > > + for (i = 0; i < leafs.length (); ++i) > > { > > + int idx = leafs[i]; > > + slp_tree node = vertices[idx]; > > + > > + /* Handle externals and constants optimistically throughout the > > + iteration. */ > > + if (SLP_TREE_DEF_TYPE (node) == vect_external_def > > + || SLP_TREE_DEF_TYPE (node) == vect_constant_def) > > + continue; > > + > > + /* Loads are the only thing generating permutes and leafs do not > > + change across iterations. */ > > + bitmap_set_bit (n_visited, idx); > > + if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) > > + continue; > > + > > + /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the > > + node unpermuted, record this permute. */ > > + stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node); > > + if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt)) > > + continue; > > + dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt); > > + unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0; > > + bool any_permute = false; > > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > > + { > > + unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j]; > > + imin = MIN (imin, idx); > > + imax = MAX (imax, idx); > > + if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j) > > + any_permute = true; > > + } > > + /* If there's no permute no need to split one out. */ > > + if (!any_permute) > > + continue; > > + /* If the span doesn't match we'd disrupt VF computation, avoid > > + that for now. */ > > + if (imax - imin + 1 != SLP_TREE_LANES (node)) > > + continue; > > + > > + /* For now only handle true permutes, like > > + vect_attempt_slp_rearrange_stmts did. This allows us to be lazy > > + when permuting constants and invariants keeping the permute > > + bijective. */ > > + auto_sbitmap load_index (SLP_TREE_LANES (node)); > > + bitmap_clear (load_index); > > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > > + bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin); > > + unsigned j; > > + for (j = 0; j < SLP_TREE_LANES (node); ++j) > > + if (!bitmap_bit_p (load_index, j)) > > + break; > > + if (j != SLP_TREE_LANES (node)) > > + continue; > > + > > + vec<unsigned> perm = vNULL; > > + perm.safe_grow (SLP_TREE_LANES (node), true); > > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > > + perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > > + perms.safe_push (perm); > > + n_perm[idx] = perms.length () - 1; > > + } > > + > > + bool changed; > > + bool materialize = false; > > + do > > + { > > + changed = false; > > + > > + /* Propagate permutes along the graph. */ > > + for (i = vertices.length (); i > 0 ; --i) > > + { > > + int idx = ipo[i-1]; > > + slp_tree node = vertices[idx]; > > + /* For leafs there's nothing to do - we've seeded permutes > > + on those above. */ > > + if (SLP_TREE_DEF_TYPE (node) != vect_internal_def) > > + continue; > > + > > + bitmap_set_bit (n_visited, idx); > > + > > + /* We cannot move a permute across a store. */ > > + if (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node)) > > + && DR_IS_WRITE > > + (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node)))) > > + continue; > > + > > + int perm = -1; > > + for (graph_edge *succ = slpg->vertices[idx].succ; > > + succ; succ = succ->succ_next) > > + { > > + int succ_idx = succ->dest; > > + /* Handle unvisited nodes optimistically. */ > > + /* ??? But for constants once we want to handle non-bijective > > + permutes we have to verify the permute, when unifying lanes, > > + will not unify different constants. For example see > > + gcc.dg/vect/bb-slp-14.c for a case that would break. */ > > + if (!bitmap_bit_p (n_visited, succ_idx)) > > + continue; > > + int succ_perm = n_perm[succ_idx]; > > + if (perm == -1) > > + perm = succ_perm; > > + else if (succ_perm == 0) > > + { > > + perm = 0; > > + break; > > + } > > + else if (perm != succ_perm > > + && (perms[perm].length () != perms[succ_perm].length () > > + || memcmp (&perms[perm][0], &perms[succ_perm][0], > > + sizeof (unsigned) * perms[perm].length > > ()))) > > + { > > + perm = 0; > > + break; > > + } > > + } > > + > > + if (perm == -1) > > + perm = n_perm[idx]; > > + else if (n_perm[idx] != perm) > > + { > > + n_perm[idx] = perm; > > + changed = true; > > + } > > + > > + if (!materialize || perm == 0) > > + continue; > > + /* Permute materialization phase. Look whether there's > > + a use (pred) edge that is permuted differently than us. > > + In that case replace ourselves with a permute node > > + and clear n_perm. In any case apply n_perm. */ > > + bool all_preds_permuted = slpg->vertices[idx].pred != NULL; > > + for (graph_edge *pred = slpg->vertices[idx].pred; > > + pred; pred = pred->pred_next) > > + { > > + int pred_perm = n_perm[pred->src]; > > + if (pred_perm != perm > > + && (perms[perm].length () != perms[pred_perm].length () > > + || memcmp (&perms[perm][0], &perms[pred_perm][0], > > + sizeof (unsigned) * perms[perm].length ()))) > > + { > > + all_preds_permuted = false; > > + break; > > + } > > + } > > + > > + /* First permute invariant/external original successors. */ > > + unsigned j; > > + slp_tree child; > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child) > > + { > > + if (SLP_TREE_DEF_TYPE (child) == vect_internal_def) > > + continue; > > + > > + /* We can end up sharing some externals via two_operator > > + handling. Be prepared to unshare those. */ > > + if (child->refcnt != 1) > > + { > > + gcc_assert (slpg->vertices[child->vertex].pred->pred_next); > > + SLP_TREE_CHILDREN (node)[j] = child > > + = vect_create_new_slp_node > > + (SLP_TREE_SCALAR_OPS (child).copy ()); > > + } > > + vect_slp_permute (perms[perm], > > + SLP_TREE_SCALAR_OPS (child), true); > > + } > > + > > + if (all_preds_permuted) > > + { > > + /* Apply the reverse permutation to our stmts. */ > > + vect_slp_permute (perms[perm], > > + SLP_TREE_SCALAR_STMTS (node), true); > > + /* And to the load permutation, which we can simply > > + make regular by design. */ > > + if (SLP_TREE_LOAD_PERMUTATION (node).exists ()) > > + { > > + /* ??? When we handle non-bijective permutes the idea > > + is that we can force the load-permutation to be > > + { min, min + 1, min + 2, ... max }. But then the > > + scalar defs might no longer match the lane content > > + which means wrong-code with live lane vectorization. > > + So we possibly have to have NULL entries for those. */ > > + vect_slp_permute (perms[perm], > > + SLP_TREE_LOAD_PERMUTATION (node), true); > > + } > > + } > > + else if (SLP_TREE_LOAD_PERMUTATION (node).exists ()) > > + { > > + /* For loads simply drop n_perm, the load permutation > > + already performs the desired permutation. */ > > + n_perm[idx] = 0; > > + } > > + else > > + { > > + /* Make a copy of NODE and in-place change it to a > > + VEC_PERM node to permute the lanes of the copy. */ > > + slp_tree copy = new _slp_tree; > > + SLP_TREE_CHILDREN (copy) = SLP_TREE_CHILDREN (node); > > + SLP_TREE_CHILDREN (node) = vNULL; > > + SLP_TREE_SCALAR_STMTS (copy) > > + = SLP_TREE_SCALAR_STMTS (node).copy (); > > + vect_slp_permute (perms[perm], > > + SLP_TREE_SCALAR_STMTS (copy), true); > > + gcc_assert (!SLP_TREE_SCALAR_OPS (node).exists ()); > > + SLP_TREE_REPRESENTATIVE (copy) = SLP_TREE_REPRESENTATIVE (node); > > + gcc_assert (!SLP_TREE_LOAD_PERMUTATION (node).exists ()); > > + SLP_TREE_LANE_PERMUTATION (copy) > > + = SLP_TREE_LANE_PERMUTATION (node); > > + SLP_TREE_LANE_PERMUTATION (node) = vNULL; > > + SLP_TREE_VECTYPE (copy) = SLP_TREE_VECTYPE (node); > > + copy->refcnt = 1; > > + copy->max_nunits = node->max_nunits; > > + SLP_TREE_DEF_TYPE (copy) = SLP_TREE_DEF_TYPE (node); > > + SLP_TREE_LANES (copy) = SLP_TREE_LANES (node); > > + SLP_TREE_CODE (copy) = SLP_TREE_CODE (node); > > + > > + /* Now turn NODE into a VEC_PERM. */ > > + SLP_TREE_CHILDREN (node).safe_push (copy); > > + SLP_TREE_LANE_PERMUTATION (node).create (SLP_TREE_LANES (node)); > > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > > + SLP_TREE_LANE_PERMUTATION (node) > > + .quick_push (std::make_pair (0, perms[perm][j])); > > + SLP_TREE_CODE (node) = VEC_PERM_EXPR; > > + > > + /* Drop the permute for parents. */ > > + n_perm[idx] = 0; > > + } > > + } > > + if (materialize) > > + break; > > + if (!changed) > > + materialize = true; > > + } > > + while (changed || materialize); > > + > > + /* Free the perms vector used for propagation. */ > > + while (!perms.is_empty ()) > > + perms.pop ().release (); > > + free_graph (slpg); > > + > > + > > + /* Now elide load permutations that are not necessary. */ > > + for (i = 0; i < leafs.length (); ++i) > > + { > > + node = vertices[i]; > > if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) > > continue; > > > > @@ -2940,12 +3152,9 @@ vect_slp_analyze_node_operations (vec_info *vinfo, > > slp_tree node, > > return true; > > > > /* If we already analyzed the exact same set of scalar stmts we're done. > > - We share the generated vector stmts for those. > > - The SLP graph is acyclic so not caching whether we failed or succeeded > > - doesn't result in any issue since we throw away the lvisited set > > - when we fail. */ > > + We share the generated vector stmts for those. */ > > if (visited.contains (node) > > - || lvisited.contains (node)) > > + || lvisited.add (node)) > > return true; > > > > bool res = true; > > @@ -2958,12 +3167,10 @@ vect_slp_analyze_node_operations (vec_info *vinfo, > > slp_tree node, > > } > > > > if (res) > > - { > > - res = vect_slp_analyze_node_operations_1 (vinfo, node, node_instance, > > - cost_vec); > > - if (res) > > - lvisited.add (node); > > - } > > + res = vect_slp_analyze_node_operations_1 (vinfo, node, node_instance, > > + cost_vec); > > + if (!res) > > + lvisited.remove (node); > > > > /* When the node can be vectorized cost invariant nodes it references. > > This is not done in DFS order to allow the refering node > > @@ -4566,7 +4773,8 @@ vectorizable_slp_permutation (vec_info *vinfo, > > gimple_stmt_iterator *gsi, > > > > static void > > vect_schedule_slp_instance (vec_info *vinfo, > > - slp_tree node, slp_instance instance) > > + slp_tree node, slp_instance instance, > > + hash_set<slp_tree> &visited) > > { > > gimple_stmt_iterator si; > > int i; > > @@ -4593,8 +4801,12 @@ vect_schedule_slp_instance (vec_info *vinfo, > > return; > > } > > > > + /* ??? If we'd have a way to mark backedges that would be cheaper. */ > > + if (visited.add (node)) > > + return; > > + > > FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > - vect_schedule_slp_instance (vinfo, child, instance); > > + vect_schedule_slp_instance (vinfo, child, instance, visited); > > > > gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0); > > SLP_TREE_VEC_STMTS (node).create (SLP_TREE_NUMBER_OF_VEC_STMTS (node)); > > @@ -4618,14 +4830,13 @@ vect_schedule_slp_instance (vec_info *vinfo, > > last_stmt_info = vect_find_last_scalar_stmt_in_slp (node); > > si = gsi_for_stmt (last_stmt_info->stmt); > > } > > - else if (SLP_TREE_CHILDREN (node).is_empty ()) > > + else if ((STMT_VINFO_TYPE (SLP_TREE_REPRESENTATIVE (node)) > > + == cycle_phi_info_type) > > + || (STMT_VINFO_TYPE (SLP_TREE_REPRESENTATIVE (node)) > > + == induc_vec_info_type)) > > { > > - /* This happens for reduction and induction PHIs where we do not use > > the > > + /* For reduction and induction PHIs we do not use the > > insertion iterator. */ > > - gcc_assert (STMT_VINFO_TYPE (SLP_TREE_REPRESENTATIVE (node)) > > - == cycle_phi_info_type > > - || (STMT_VINFO_TYPE (SLP_TREE_REPRESENTATIVE (node)) > > - == induc_vec_info_type)); > > si = gsi_none (); > > } > > else > > @@ -4837,6 +5048,7 @@ vect_schedule_slp (vec_info *vinfo, vec<slp_instance> > > slp_instances) > > slp_instance instance; > > unsigned int i; > > > > + hash_set<slp_tree> visited; > > FOR_EACH_VEC_ELT (slp_instances, i, instance) > > { > > slp_tree node = SLP_INSTANCE_TREE (instance); > > @@ -4851,7 +5063,7 @@ vect_schedule_slp (vec_info *vinfo, vec<slp_instance> > > slp_instances) > > SLP_INSTANCE_TREE (instance)); > > } > > /* Schedule the tree of INSTANCE. */ > > - vect_schedule_slp_instance (vinfo, node, instance); > > + vect_schedule_slp_instance (vinfo, node, instance, visited); > > > > if (SLP_INSTANCE_ROOT_STMT (instance)) > > vectorize_slp_instance_root_stmt (node, instance); > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > > index 37b091558fd..93d14ef9089 100644 > > --- a/gcc/tree-vectorizer.h > > +++ b/gcc/tree-vectorizer.h > > @@ -161,6 +161,8 @@ struct _slp_tree { > > unsigned int lanes; > > /* The operation of this node. */ > > enum tree_code code; > > + > > + int vertex; > > }; > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imend