Hi, this patch modifies the loop invariant pass so that is can operate only on a single requested loop and its sub-loops and ignore the rest of the function, much like it currently ignores basic blocks that are not in any real loop. It then invokes it from within the loop interchange pass when it successfully swaps two loops. This avoids the non-LTO -Ofast run-time regressions of 410.bwaves and 503.bwaves_r (which are 19% and 15% faster than current master on an AMD zen2 machine) while not introducing a full LIM pass into the pass pipeline.
I have not modified the LIM data structures, this means that it still contains vectors indexed by loop->num even though only a single loop nest is actually processed. I also did not replace the uses of pre_and_rev_post_order_compute_fn with a function that would count a postorder only for a given loop. I can of course do so if the approach is otherwise deemed viable. The patch adds one additional global variable requested_loop to the pass and then at various places behaves differently when it is set. I was considering storing the fake root loop into it for normal operation, but since this loop often requires special handling anyway, I came to the conclusion that the code would actually end up less straightforward. I have bootstrapped and tested the patch on x86_64-linux and a very similar one on aarch64-linux. I have also tested it by modifying the tree_ssa_lim function to run loop_invariant_motion_from_loop on each real outermost loop in a function and this variant also passed bootstrap and all tests, including dump scans, of all languages. I have built the entire SPEC 2006 FPrate monitoring the activity of the LIM pass without and with the patch (on top of commit b642fca1c31 with which 526.blender_r and 538.imagick_r seemed to be failing) and it only examined 0.2% more loops, 0.02% more BBs and even fewer percent of statements because it is invoked only in a rather special circumstance. But the patch allows for more such need-based uses at hopefully reasonable cost. Since I do not have much experience with loop optimizers, I expect that there will be requests to adjust the patch during the review. Still, it fixes a performance regression against GCC 9 and so I hope to address the concerns in time to get it into GCC 11. Thanks, Martin gcc/ChangeLog: 2020-11-08 Martin Jambor <mjam...@suse.cz> * gimple-loop-interchange.cc (pass_linterchange::execute): Call loop_invariant_motion_from_loop on affected loop nests. * tree-ssa-loop-im.c (requested_loop): New variable. (get_topmost_lim_loop): New function. (outermost_invariant_loop): Use it, cap discovered topmost loop at requested_loop. (determine_max_movement): Use get_topmost_lim_loop. (set_level): Assert that the selected loop is not outside of requested_loop. (compute_invariantness): Do not process loops outside of requested_loop, if non-NULL. (move_computations_worker): Likewise. (mark_ref_stored): Stop iteration at requested_loop, if non-NULL. (mark_ref_loaded): Likewise. (analyze_memory_references): If non-NULL, only process basic blocks and loops in requested_loop. Compute contains_call bitmap. (do_store_motion): Only process requested_loop if non-NULL. (fill_always_executed_in): Likewise. Also accept contains_call as a parameter rather than computing it. (tree_ssa_lim_initialize): New parameter which is stored into requested_loop. Additonal dumping. Only initialize bb_loop_postorder for loops within requested_loop, if non-NULL. (tree_ssa_lim_finalize): Clear requested_loop, additional dumping. (loop_invariant_motion_from_loop): New function. (tree_ssa_lim): Move all functionality to loop_invariant_motion_from_loop, call it. * tree-ssa-loop-manip.h (loop_invariant_motion_from_loop): Declare. --- gcc/gimple-loop-interchange.cc | 30 +++++- gcc/tree-ssa-loop-im.c | 176 ++++++++++++++++++++++++--------- gcc/tree-ssa-loop-manip.h | 2 + 3 files changed, 156 insertions(+), 52 deletions(-) diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc index 1656004ecf0..8c376228779 100644 --- a/gcc/gimple-loop-interchange.cc +++ b/gcc/gimple-loop-interchange.cc @@ -2068,6 +2068,7 @@ pass_linterchange::execute (function *fun) return 0; bool changed_p = false; + auto_vec<class loop *, 4> loops_to_lim; class loop *loop; FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) { @@ -2077,7 +2078,11 @@ pass_linterchange::execute (function *fun) if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs)) { tree_loop_interchange loop_interchange (loop_nest); - changed_p |= loop_interchange.interchange (datarefs, ddrs); + if (loop_interchange.interchange (datarefs, ddrs)) + { + changed_p = true; + loops_to_lim.safe_push (loop_nest[0]); + } } free_dependence_relations (ddrs); free_data_refs_with_aux (datarefs); @@ -2085,8 +2090,27 @@ pass_linterchange::execute (function *fun) } if (changed_p) - scev_reset (); - return changed_p ? (TODO_update_ssa_only_virtuals) : 0; + { + unsigned todo = TODO_update_ssa_only_virtuals; + unsigned len = loops_to_lim.length (); + for (unsigned i = len - 1; i > 0; i--) + for (int j = i - 1; j >= 0; j--) + if (loops_to_lim[j] == loops_to_lim[i] + || flow_loop_nested_p (loops_to_lim[j], loops_to_lim[i])) + { + loops_to_lim.pop (); + break; + } + + len = loops_to_lim.length (); + for (unsigned i = 0; i < len; i++) + todo |= loop_invariant_motion_from_loop (cfun, loops_to_lim[i]); + + scev_reset (); + return todo; + } + else + return 0; } } // anon namespace diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 6bb07e133cd..24b541dfb17 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -244,6 +244,11 @@ static struct static bitmap_obstack lim_bitmap_obstack; static obstack mem_ref_obstack; +/* If LIM has been requested only for a particular loop (and all of the + enclosed loops this variable points to it, otherwise it points to the dummy + loop spanning whole function. */ +static class loop *requested_loop; + static bool ref_indep_loop_p (class loop *, im_mem_ref *, dep_kind); static bool ref_always_accessed_p (class loop *, im_mem_ref *, bool); static bool refs_independent_p (im_mem_ref *, im_mem_ref *, bool = true); @@ -410,6 +415,21 @@ movement_possibility (gimple *stmt) return ret; } +/* Return either the topmost real loop in which LOOP is nested if processing + the whole function, or requested_loop if not. */ + +static class loop * +get_topmost_lim_loop (class loop *loop) +{ + if (requested_loop) + { + gcc_assert (loop == requested_loop + || flow_loop_nested_p (requested_loop, loop)); + return requested_loop; + } + return superloop_at_depth (loop, 1); +} + /* Suppose that operand DEF is used inside the LOOP. Returns the outermost loop to that we could move the expression using DEF if it did not have other operands, i.e. the outermost loop enclosing LOOP in that the value @@ -424,18 +444,18 @@ outermost_invariant_loop (tree def, class loop *loop) struct lim_aux_data *lim_data; if (!def) - return superloop_at_depth (loop, 1); + return get_topmost_lim_loop (loop); if (TREE_CODE (def) != SSA_NAME) { gcc_assert (is_gimple_min_invariant (def)); - return superloop_at_depth (loop, 1); + return get_topmost_lim_loop (loop); } def_stmt = SSA_NAME_DEF_STMT (def); def_bb = gimple_bb (def_stmt); if (!def_bb) - return superloop_at_depth (loop, 1); + return get_topmost_lim_loop (loop); max_loop = find_common_loop (loop, def_bb->loop_father); @@ -443,6 +463,16 @@ outermost_invariant_loop (tree def, class loop *loop) if (lim_data != NULL && lim_data->max_loop != NULL) max_loop = find_common_loop (max_loop, loop_outer (lim_data->max_loop)); + if (requested_loop) + { + class loop *req_outer = loop_outer (requested_loop); + if (flow_loop_nested_p (max_loop, req_outer)) + max_loop = req_outer; + else + gcc_assert (max_loop == req_outer + || flow_loop_nested_p (req_outer, max_loop)); + } + if (max_loop == loop) return NULL; max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1); @@ -677,7 +707,7 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) if (must_preserve_exec) level = ALWAYS_EXECUTED_IN (bb); else - level = superloop_at_depth (loop, 1); + level = get_topmost_lim_loop (loop); lim_data->max_loop = level; if (gphi *phi = dyn_cast <gphi *> (stmt)) @@ -813,6 +843,9 @@ set_level (gimple *stmt, class loop *orig_loop, class loop *level) gcc_assert (level == lim_data->max_loop || flow_loop_nested_p (lim_data->max_loop, level)); + gcc_assert (!requested_loop + || requested_loop == lim_data->max_loop + || flow_loop_nested_p (requested_loop, level)); lim_data->tgt_loop = level; FOR_EACH_VEC_ELT (lim_data->depends, i, dep_stmt) @@ -983,7 +1016,12 @@ compute_invariantness (basic_block bb) class loop *outermost = ALWAYS_EXECUTED_IN (bb); struct lim_aux_data *lim_data; - if (!loop_outer (bb->loop_father)) + if (requested_loop) + { + if (!flow_bb_inside_loop_p (requested_loop, bb)) + return; + } + else if (!loop_outer (bb->loop_father)) return; if (dump_file && (dump_flags & TDF_DETAILS)) @@ -1122,7 +1160,12 @@ move_computations_worker (basic_block bb) struct lim_aux_data *lim_data; unsigned int todo = 0; - if (!loop_outer (bb->loop_father)) + if (requested_loop) + { + if (!flow_bb_inside_loop_p (requested_loop, bb)) + return todo; + } + else if (!loop_outer (bb->loop_father)) return todo; for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); ) @@ -1414,7 +1457,9 @@ set_ref_stored_in_loop (im_mem_ref *ref, class loop *loop) static void mark_ref_stored (im_mem_ref *ref, class loop *loop) { - while (loop != current_loops->tree_root + class loop *stop_loop = requested_loop + ? loop_outer (requested_loop) : current_loops->tree_root; + while (loop != stop_loop && set_ref_stored_in_loop (ref, loop)) loop = loop_outer (loop); } @@ -1435,7 +1480,9 @@ set_ref_loaded_in_loop (im_mem_ref *ref, class loop *loop) static void mark_ref_loaded (im_mem_ref *ref, class loop *loop) { - while (loop != current_loops->tree_root + class loop *stop_loop = requested_loop + ? loop_outer (requested_loop) : current_loops->tree_root; + while (loop != stop_loop && set_ref_loaded_in_loop (ref, loop)) loop = loop_outer (loop); } @@ -1619,24 +1666,35 @@ sort_locs_in_loop_postorder_cmp (const void *loc1_, const void *loc2_, return bb_loop_postorder[loop1->num] < bb_loop_postorder[loop2->num] ? -1 : 1; } -/* Gathers memory references in loops. */ +/* Gathers memory references in loops. Set a bit in CONTAINS_CALL + corresponding to index of each basic block which contains a non-pure call + statement. */ static void -analyze_memory_references (void) +analyze_memory_references (sbitmap contains_call) { gimple_stmt_iterator bsi; basic_block bb, *bbs; class loop *loop, *outer; unsigned i, n; - /* Collect all basic-blocks in loops and sort them after their - loops postorder. */ + /* Collect all basic-blocks in loops (either in the whole function or just in + the requested loop) and sort them after their loops postorder. */ i = 0; - bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS); - FOR_EACH_BB_FN (bb, cfun) - if (bb->loop_father != current_loops->tree_root) - bbs[i++] = bb; - n = i; + if (requested_loop) + { + bbs = get_loop_body (requested_loop); + n = requested_loop->num_nodes; + } + else + { + bbs = XNEWVEC (basic_block, + n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS); + FOR_EACH_BB_FN (bb, cfun) + if (bb->loop_father != current_loops->tree_root) + bbs[i++] = bb; + n = i; + } gcc_sort_r (bbs, n, sizeof (basic_block), sort_bbs_in_loop_postorder_cmp, bb_loop_postorder); @@ -1648,7 +1706,11 @@ analyze_memory_references (void) { basic_block bb = bbs[i]; for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi)) - gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi)); + { + gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi)); + if (nonpure_call_p (gsi_stmt (bsi))) + bitmap_set_bit (contains_call, bb->index); + } } /* Verify the list of gathered memory references is sorted after their @@ -1667,7 +1729,9 @@ analyze_memory_references (void) /* Propagate the information about accessed memory references up the loop hierarchy. */ - FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) + class loop *stop_loop = requested_loop + ? loop_outer (requested_loop) : current_loops->tree_root; + FOR_EACH_ENCLOSED_LOOP (requested_loop, loop, LI_FROM_INNERMOST) { /* Finalize the overall touched references (including subloops). */ bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[loop->num], @@ -1676,7 +1740,7 @@ analyze_memory_references (void) /* Propagate the information about accessed memory references up the loop hierarchy. */ outer = loop_outer (loop); - if (outer == current_loops->tree_root) + if (outer == stop_loop) continue; bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[outer->num], @@ -2895,8 +2959,11 @@ do_store_motion (void) class loop *loop; bitmap sm_executed = BITMAP_ALLOC (&lim_bitmap_obstack); - for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next) - store_motion_loop (loop, sm_executed); + if (requested_loop) + store_motion_loop (requested_loop, sm_executed); + else + for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next) + store_motion_loop (loop, sm_executed); BITMAP_FREE (sm_executed); } @@ -2982,27 +3049,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) of its header implies execution of bb. */ static void -fill_always_executed_in (void) +fill_always_executed_in (sbitmap contains_call) { - basic_block bb; class loop *loop; - auto_sbitmap contains_call (last_basic_block_for_fn (cfun)); - bitmap_clear (contains_call); - FOR_EACH_BB_FN (bb, cfun) - { - gimple_stmt_iterator gsi; - for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) - { - if (nonpure_call_p (gsi_stmt (gsi))) - break; - } - - if (!gsi_end_p (gsi)) - bitmap_set_bit (contains_call, bb->index); - } - - for (loop = current_loops->tree_root->inner; loop; loop = loop->next) + for (loop = requested_loop ? requested_loop : current_loops->tree_root->inner; + loop; + loop = loop->next) fill_always_executed_in_1 (loop, contains_call); } @@ -3010,11 +3063,17 @@ fill_always_executed_in (void) /* Compute the global information needed by the loop invariant motion pass. */ static void -tree_ssa_lim_initialize (void) +tree_ssa_lim_initialize (class loop *req_loop) { class loop *loop; unsigned i; + gcc_assert (!requested_loop || requested_loop != current_loops->tree_root); + requested_loop = req_loop; + + if (dump_file && (dump_flags & TDF_DETAILS) && requested_loop) + fprintf (dump_file, "Initializing LIM for loop %d.\n\n", req_loop->num); + bitmap_obstack_initialize (&lim_bitmap_obstack); gcc_obstack_init (&mem_ref_obstack); lim_aux_data_map = new hash_map<gimple *, lim_aux_data *>; @@ -3051,7 +3110,7 @@ tree_ssa_lim_initialize (void) its postorder index. */ i = 0; bb_loop_postorder = XNEWVEC (unsigned, number_of_loops (cfun)); - FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) + FOR_EACH_ENCLOSED_LOOP (req_loop, loop, LI_FROM_INNERMOST) bb_loop_postorder[loop->num] = i++; } @@ -3086,23 +3145,32 @@ tree_ssa_lim_finalize (void) free_affine_expand_cache (&memory_accesses.ttae_cache); free (bb_loop_postorder); + requested_loop = NULL; + + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "LIM finished.\n\n"); } -/* Moves invariants from loops. Only "expensive" invariants are moved out -- - i.e. those that are likely to be win regardless of the register pressure. */ +/* Moves invariants from LOOP in FUN. If LOOP is NULL, perform the + tranformation on all loops within FUN. Only "expensive" invariants are + moved out -- i.e. those that are likely to be win regardless of the register + pressure. Return the pass TODO flags that need to be carried out after the + transformation. */ -static unsigned int -tree_ssa_lim (function *fun) +unsigned int +loop_invariant_motion_from_loop (function *fun, class loop *loop) { unsigned int todo = 0; - tree_ssa_lim_initialize (); + tree_ssa_lim_initialize (loop); + auto_sbitmap contains_call (last_basic_block_for_fn (cfun)); + bitmap_clear (contains_call); /* Gathers information about memory accesses in the loops. */ - analyze_memory_references (); + analyze_memory_references (contains_call); /* Fills ALWAYS_EXECUTED_IN information for basic blocks. */ - fill_always_executed_in (); + fill_always_executed_in (contains_call); int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun)); int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false); @@ -3135,6 +3203,16 @@ tree_ssa_lim (function *fun) return todo; } +/* Moves invariants from all loops in FUN. Only "expensive" invariants are + moved out -- i.e. those that are likely to be win regardless of the register + pressure. */ + +static unsigned int +tree_ssa_lim (function *fun) +{ + return loop_invariant_motion_from_loop (fun, NULL); +} + /* Loop invariant motion pass. */ namespace { diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h index e789e4fcb0b..c89a86aef4e 100644 --- a/gcc/tree-ssa-loop-manip.h +++ b/gcc/tree-ssa-loop-manip.h @@ -56,6 +56,8 @@ extern void tree_unroll_loop (class loop *, unsigned, edge, class tree_niter_desc *); extern tree canonicalize_loop_ivs (class loop *, tree *, bool); +extern unsigned int loop_invariant_motion_from_loop (function *fun, + class loop *loop); #endif /* GCC_TREE_SSA_LOOP_MANIP_H */ -- 2.29.2