Hi,
On 20/05/2021 11:22, Richard Biener wrote:
On Mon, 17 May 2021, Andre Vieira (lists) wrote:
Hi,
So this is my second attempt at finding a way to improve how we generate the
vector IV's and teach the vectorizer to share them between main loop and
epilogues. On IRC we discussed my idea to use the loop's control_iv, but that
was a terrible idea and I quickly threw it in the bin. The main problem, that
for some reason I failed to see, was that the control_iv increases by 's' and
the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
amount of elements we handle per iteration. That means the epilogue loops
would have to start from the last loop's IV * the last loop's NELEMENT's and
that would just cause a mess.
Instead I started to think about creating IV's for the datarefs and what I
thought worked best was to create these in scalar before peeling. That way the
peeling mechanisms takes care of the duplication of these for the vector and
scalar epilogues and it also takes care of adding phi-nodes for the
skip_vector paths.
How does this work for if-converted loops where we use the
non-if-converted scalar loop for (scalar) peeling but the
if-converted scalar loop for vectorized epilogues? I suppose
you're only adjusting the if-converted copy.
True hadn't thought about this :(
These new IV's have two functions:
1) 'vect_create_data_ref_ptr' can use them to:
a) if it's the main loop: replace the values of the 'initial' value of the
main loop's IV and the initial values in the skip_vector phi-nodes
b) Update the the skip_vector phi-nodes argument for the non-skip path with
the updated vector ptr.
b) means the prologue IV will not be dead there so we actually need
to compute it? I suppose IVOPTs could be teached to replace an
IV with its final value (based on some other IV) when it's unused?
Or does it already magically do good?
It does not and ...
2) They are used for the scalar epilogue ensuring they share the same
datareference ptr.
There are still a variety of 'hacky' elements here and a lot of testing to be
done, but I hope to be able to clean them away. One of the main issues I had
was that I had to skip a couple of checks and things for the added phi-nodes
and update statements as these do not have stmt_vec_info representation.
Though I'm not sure adding this representation at their creation was much
cleaner... It is something I could play around with but I thought this was a
good moment to ask you for input. For instance, maybe we could do this
transformation before analysis?
Also be aware that because I create a IV for each dataref this leads to
regressions with SVE codegen for instance. NEON is able to use the post-index
addressing mode to increase each dr IV at access time, but SVE can't do this.
For this I don't know if maybe we could try to be smart and create shared
IV's. So rather than make them based on the actual vector ptr, use a shared
sizetype IV that can be shared among dr IV's with the same step. Or maybe this
is something for IVOPTs?
Certainly IVOPTs could decide to use the newly created IVs in the
scalar loops for the DRs therein as well. But since IVOPTs only
considers a single loop at a time it will probably not pay too
much attention and is only influenced by the out-of-loop uses of
the final values of the IVs.
My gut feeling tells me that whatever we do we'll have to look
into improving IVOPTs to consider multiple loops.
So I redid the IV-sharing and it's looking a lot simpler and neater,
however it only shares IVs between vectorized loops and not scalar pro-
or epilogues. I am not certain IVOPTs will be able to deal with these,
as it has no knowledge of the number of iterations of each different
loop. So take for instance a prologue peeling for alignment loop and a
first main vectorization loop. To be able to reuse the IV's from the
prologue in the main vectorization loop it would need to know that the
initial start adress + PEELING_NITERS == base address for main
vectorization loop.
I'll starting testing this approach for correctness if there are no
major concerns. Though I suspect we will only want to turn this into a
patch once we have the IVOPTs work done to a point where it at least
doesn't regress codegen because of shared IVs and eventually we can look
at how to solve the sharing between vectorized and scalar loops.
A small nitpick on my own RFC. I will probably move the 'skip_e' to
outside of the map, as we only need one per loop_vinfo and not one per
DR. Initially I didnt even have this skip_e in, but was using the
creation of a dummy PHI node and then replacing it with the real thing
later. Though this made the code simpler, especially when inserting the
'init's stmt_list.
Kind regards,
Andre
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index
b317df532a9a92a619de9572378437d09c632ab0..e7d0f1e657b1a0c9bec75799817242e0bc1d8282
100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4909,11 +4909,27 @@ vect_create_data_ref_ptr (vec_info *vinfo,
stmt_vec_info stmt_info,
offset, byte_offset);
if (new_stmt_list)
{
- if (pe)
- {
- new_bb = gsi_insert_seq_on_edge_immediate (pe, new_stmt_list);
- gcc_assert (!new_bb);
- }
+ if (loop_vinfo)
+ {
+ iv_struct *iv = loop_vinfo->dr_ivs->get (dr);
+ if (iv && iv->skip_e)
+ {
+ /* The initial value of the IV we are inserting here will be used
+ in a PHI-node that will be used as the initial IV for the next
+ vectorized epilogue, so we have to make sure we insert this
+ NEW_STMT_LIST before the SKIP_E. */
+ gimple_stmt_iterator g = gsi_last_bb (iv->skip_e->src);
+ if (gimple_code (gsi_stmt (g)) == GIMPLE_COND)
+ gsi_insert_seq_before (&g, new_stmt_list, GSI_NEW_STMT);
+ else
+ gsi_insert_seq_after (&g, new_stmt_list, GSI_NEW_STMT);
+ }
+ else
+ {
+ new_bb = gsi_insert_seq_on_edge_immediate (pe, new_stmt_list);
+ gcc_assert (!new_bb);
+ }
+ }
else
gsi_insert_seq_before (gsi, new_stmt_list, GSI_SAME_STMT);
}
@@ -4946,12 +4962,55 @@ vect_create_data_ref_ptr (vec_info *vinfo,
stmt_vec_info stmt_info,
standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+ /* If this is an epilogue, make sure to use the previous loop's IV for
+ the same DR as the init of this DR's IV. */
+ if (loop_vinfo
+ && LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+ {
+ iv_struct *iv
+ = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)->dr_ivs->get (dr);
+ gcc_assert (iv);
+ aggr_ptr_init = iv->def;
+ }
+
create_iv (aggr_ptr_init,
fold_convert (aggr_ptr_type, iv_step),
aggr_ptr, loop, &incr_gsi, insert_after,
&indx_before_incr, &indx_after_incr);
incr = gsi_stmt (incr_gsi);
+ if (loop_vinfo)
+ {
+ iv_struct &iv = loop_vinfo->dr_ivs->get_or_insert (dr);
+ tree incr_lhs = gimple_get_lhs (incr);
+ if (iv.skip_e)
+ {
+ edge e;
+ edge_iterator ei;
+ tree val;
+ /* Create a skip PHI, use INCR_LHS as the value for the not-skip
+ path(s) and AGGR_PTR_INIT as the value for the skip path. */
+ iv.def = copy_ssa_name (incr_lhs);
+ gphi *skip_phi = create_phi_node (iv.def,
+ iv.skip_e->dest);
+
+ FOR_EACH_EDGE (e, ei, skip_phi->bb->preds)
+ {
+ if (e == iv.skip_e)
+ val = aggr_ptr_init;
+ else
+ val = incr_lhs;
+ add_phi_arg (skip_phi, val, e, UNKNOWN_LOCATION);
+ }
+ }
+ else
+ {
+ /* No skip PHI has been created, so we can use INCR_LHS as the
+ IV.DEF. */
+ iv.def = incr_lhs;
+ }
+ }
+
/* Copy the points-to information if it exists. */
if (DR_PTR_INFO (dr))
{
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index
012f48bd4870125c820049b4fc70db0ef0759bdf..2475ab8b8263499f3471b64aa10b956523fe56df
100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2554,8 +2554,7 @@ class loop *
vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
tree *niters_vector, tree *step_vector,
tree *niters_vector_mult_vf_var, int th,
- bool check_profitability, bool niters_no_overflow,
- tree *advance)
+ bool check_profitability, bool niters_no_overflow)
{
edge e, guard_e;
tree type = TREE_TYPE (niters), guard_cond;
@@ -3059,10 +3058,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters,
tree nitersm1,
niters = PHI_RESULT (new_phi);
}
- /* Set ADVANCE to the number of iterations performed by the previous
- loop and its prologue. */
- *advance = niters;
-
if (!vect_epilogues_updated_niters)
{
/* Subtract the number of iterations performed by the vectorized loop
@@ -3090,6 +3085,21 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters,
tree nitersm1,
adjust_vec.release ();
free_original_copy_tables ();
+ if (vect_epilogues && skip_e)
+ {
+ /* Keep track of the SKIP_E to later construct a phi-node per
+ * datareference. */
+ unsigned int i;
+ vec<data_reference_p> datarefs = LOOP_VINFO_DATAREFS (loop_vinfo);
+ struct data_reference *dr;
+
+ FOR_EACH_VEC_ELT (datarefs, i, dr)
+ {
+ iv_struct &iv = loop_vinfo->dr_ivs->get_or_insert (dr);
+ iv.skip_e = skip_e;
+ }
+ }
+
return vect_epilogues ? epilog : NULL;
}
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index
ba36348b835c25bc556da71a133f81f8a6fc3745..fc2468ebc001935d86fee9b06855a02832b444ad
100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -901,6 +901,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,
vec_info_shared *shared)
}
epilogue_vinfos.create (6);
+ dr_ivs = new hash_map<data_reference *, iv_struct> ();
}
/* Free all levels of rgroup CONTROLS. */
@@ -927,6 +928,7 @@ _loop_vec_info::~_loop_vec_info ()
delete ivexpr_map;
delete scan_map;
epilogue_vinfos.release ();
+ delete dr_ivs;
/* When we release an epiloge vinfo that we do not intend to use
avoid clearing AUX of the main loop which should continue to
@@ -9252,7 +9254,7 @@ find_in_mapping (tree t, void *context)
prologue of the main loop. */
static void
-update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
+update_epilogue_loop_vinfo (class loop *epilogue)
{
loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
auto_vec<gimple *> stmt_worklist;
@@ -9267,11 +9269,6 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree
advance)
free (LOOP_VINFO_BBS (epilogue_vinfo));
LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
- /* Advance data_reference's with the number of iterations of the previous
- loop and its prologue. */
- vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
-
-
/* The EPILOGUE loop is a copy of the original loop so they share the same
gimple UIDs. In this loop we update the loop_vec_info of the EPILOGUE to
point to the copied statements. We also create a mapping of all LHS' in
@@ -9489,8 +9486,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple
*loop_vectorized_call)
epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
&step_vector, &niters_vector_mult_vf, th,
- check_profitability, niters_no_overflow,
- &advance);
+ check_profitability, niters_no_overflow);
if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
&& LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
@@ -9818,7 +9814,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple
*loop_vectorized_call)
if (epilogue)
{
- update_epilogue_loop_vinfo (epilogue, advance);
+ update_epilogue_loop_vinfo (epilogue);
epilogue->simduid = loop->simduid;
epilogue->force_vectorize = loop->force_vectorize;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index
7dcb4cd0b46b03eef90705eed776d9c3dd797101..da7e2fe72a549634d3cb197e7136de995e20c2b0
100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -545,6 +545,12 @@ typedef auto_vec<rgroup_controls> vec_loop_lens;
typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
+struct iv_struct
+{
+ tree def;
+ edge skip_e;
+};
+
/*-----------------------------------------------------------------*/
/* Info on vectorized loops. */
/*-----------------------------------------------------------------*/
@@ -756,6 +762,8 @@ public:
analysis. */
vec<_loop_vec_info *> epilogue_vinfos;
+ hash_map<data_reference *, iv_struct> *dr_ivs;
+
} *loop_vec_info;
/* Access Functions. */
@@ -1791,8 +1799,7 @@ class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class
loop *,
class loop *, edge);
class loop *vect_loop_versioning (loop_vec_info, gimple *);
extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
- tree *, tree *, tree *, int, bool, bool,
- tree *);
+ tree *, tree *, tree *, int, bool, bool);
extern void vect_prepare_for_masked_peels (loop_vec_info);
extern dump_user_location_t find_loop_location (class loop *);
extern bool vect_can_advance_ivs_p (loop_vec_info);