Re: [PATCH] Loop unswitching: support gswitch statements.

Martin Liška Tue, 23 Nov 2021 07:20:37 -0800

On 11/23/21 14:58, Richard Biener wrote:

On Mon, Nov 22, 2021 at 4:07 PM Martin Liška <mli...@suse.cz> wrote:


On 11/19/21 11:00, Richard Biener wrote:

On Tue, Nov 16, 2021 at 3:40 PM Martin Liška <mli...@suse.cz> wrote:


On 11/11/21 08:15, Richard Biener wrote:

So I'd try to do no functional change first, improving the costing and
setting up the transform to simply pick up the stmts to "fold" as discovered
during analysis (as I hinted you possibly can use gimple_uid to mark
the stmts that simplify, IIRC gimple_uid is preserved during copying.
gimple_uid would also scale better than gimple_plf in case we do
the analysis for all candidates at once).


Thinking about the analysis. Am I correct that we want to properly calculate
loop size for true and false edge of a potential gcond before the actually 
unswitching?


Yes.

We can do that by finding a first gcond candidate, evaluate (symbolic + irange 
approache)
all other gcond in the loop body and use BB_REACHABLE discovery. Similarly to 
what we do now
at lines 378-446. Then tree_num_loop_insns can be adjusted for only these 
reachable blocks.
Having that, we can calculate # of insns that will live in true/false loops.


So whatever we do here we should record as "this control stmt folds to
{true,false}" (or {true,unknown},
or in future, "this control stmt will lead to edge {e,unknown}"),
recording the simplification
on the true/false loop version in a way we can apply it after the transform.

Then we can call tree_unswitch_loop and make the gcond folding as we do in the 
versioned loops.

Is it a step in good direction? Having that we can then extend it to gswitch 
statements.


One issue I see is that BB_REACHABLE is there only once but you could use
auto_bb_flag reachable_true, reachable_false to distinguish the
true/false loop version
copies.

So yes, I think that sounds reasonable.  At the point we want to
evaluate different
(first) unswitching opportunities against each other storing this only
as BB flag is
likely to hit limits.  When we want to evaluate multiple levels of
unswitching before
doing any transforms even more so (if there are 3 opportunities there'd be
many cases to be considered when going to level 3 ;)).  I _think_ that a sparse
lattice of stmt UID -> edge might do the trick if we change tree_num_loop_insns
do to a DFS walk from the loop header, ignoring not taken edges by
consulting the
lattice.  Or, for speed reason, pre-compute tree_num_loop_insns for each BB
so we just have to sum a different set of BBs rather than walking all
stmts again.

That said, the second step would definitely be to choose the "best" opportunity
on the current level.

Richard.

Cheers,
Martin


Hello.

I'm sending a new version where I changed:
1) all unswitch_predicates are find for a loop
2) context sensitive costing happens based on an unswitch_predicate and BB 
reachability
     is implemented
3) folding happens in recursive invocation once we decide to unswitch
4) the patch folds both symbolic gcond predicates and irange provided by ranger
5) debug counter was added

Patch can bootstrap on x86_64-linux-gnu and survives regression tests. Plus, I 
tested it
on SPEC2006 and SPEC2017 with -size=ref.


Meh, diff made a mess out if this ;)  Random comments, I'm walking
myself the optimizations
flow.


Sure.


tree_unswitch_single_loop:

+  unswitch_predicate *predicate = NULL;
+  if (num > param_max_unswitch_level)
+    {
+      if (dump_file
+         && (dump_flags & TDF_DETAILS))
+       fprintf (dump_file, ";; Not unswitching anymore, hit max level\n");
+      goto exit;
+    }

this looks like we can do this check before find_all_unswitching_predicates?


Makes sense.


+  for (auto pred: candidates)
+    {
+      unsigned cost
+       = evaluate_loop_insns_for_predicate (loop, bbs, ranger, pred);
...

so this searches for the first candidate that fits in
param_max_unswitch_insns, it doesn't
yet try to find the cheapest / best one.  Please add a comment to say
that.  After we
found one candidate we apply unswitching to such one candidate (and throw the
others away).  I guess that's OK - it's what the old code did - what
you did for this
intermediate step is actually gather all unswitching predicates
upfront.  Hopefully
we'll be able to share some of the work done for the recursive invocations.

+         fprintf (dump_file, ";; Unswitching loop with condition: ");

"on condition"

+         fprintf (dump_file, ";; Not unswitching condition, loop too big "
+                  "(%d insns): ", cost);

"cost too big"?  I assume 'cost' is the number of stmts we'll add,
loop-size - true-eliminated - false-eliminated?


I'm going to adjust this.


+exit:
+  for (auto predicate: candidates)
+    delete predicate;

Some refactoring should get rid of the goto ...


You don't like it? It seems to me quite logical as one does not have to repeat
a clean up code before each return statement.


+static unsigned
+evaluate_insns (class loop *loop,  basic_block *bbs,
+               unswitch_predicate *candidate, bool true_edge,
+               gimple_ranger *ranger, auto_bb_flag &reachable_flag)
+{
...
+                 unswitch_predicate *predicate
+                   = tree_may_unswitch_on (bb, loop, ranger);
+                 if (predicate != NULL)
+                   {
+                     tree folded
+                       = simplify_using_entry_checks (predicate, candidate,
+                                                      true_edge);

so just looking at the implementation it seems that if you assign UIDs
to all last_stmt() in a loop you can keep tree_may_unswitch_on ()
in a dense array and you do not have to re-compute it?  UIDs should
survive copying so even for recursive unswitchings the predicates should
be shareable (pointing to the wrong stmt then, but then just do not record
the stmt in there - it should be available from the caller somehow).


I like the idea, lemme implement it.


But then I wonder how this translates to gswitch handling where the
result is a taken edge / a set of known not taken edges.  I suggest
to try wiring in gcond support in a separate commit on top of this to


s/gcond/gswitch, right?

see whether some refactoring is needed.


Sure, so for e.g. case 1 ... 5 we would need to create a new unswitch_predicate
with 1 <= index && index <= 5 tree predicate (and the corresponding irange 
range).
Later once we unswitch on it, we should use a special unreachable_flag that will
be used for marking of dead edges (similarly how we fold gconds to 
boolean_{false/true}_node.
Does it make sense?


Looking at evaluate_insns it also seems that 'reachable_flag'
is unused - computing the sizes could be done once you process


It is used:

          if (dest->loop_father == loop
              && !(dest->flags & reachable_flag)
              && !(e->flags & flags))
            {
              worklist.safe_push (dest);
              dest->flags |= reachable_flag;
            }

Where's problem?

a block, no?  Previously I suggested to compute estimate_num_insns
for each BB of the original loop once so you can re-use that and
just need to sum up BB costs.


Yep, makes sense.



+static bool
+find_all_unswitching_predicates (class loop *loop, basic_block *bbs,
+                                bool true_edge,
+                                unswitch_predicate *parent_predicate,
+                                gimple_ranger *ranger,
+                                vec<unswitch_predicate *> &candidates)
+{
...

so you re-do the simplification step here, in fact this function seems
to not only find unswitching predicates but apply simplifications
from the previous unswitching step.  (yes, that's what the old code did ...)
The costing step already did all the work though, it should be
possible to keep a vector of stmts to simplify and the simplification
from that analysis time.


Makes sense.


You can get at the "other loop copy" stmt via

   last_stmt (get_bb_copy (gimple_bb (cond)))

if you do that before free_original_copy_tables.  I think it would be
nice to have the simplification step explicit before recursing
even if that means some duplicate walking for now
(I do hope we can share the find_all_unswitching_predicates
analysis step)


Nods.


You are applying param_max_unswitch_insns with

+      unsigned cost
+       = evaluate_loop_insns_for_predicate (loop, bbs, ranger, pred);

-      cond = simplify_using_entry_checks (loop, cond);
-      stmt = last_stmt (bbs[i]);
-      if (integer_nonzerop (cond))
+      if (cost <= (unsigned)param_max_unswitch_insns)

where cost is the sum of the simplified sizes of both loop copies after
unswitching.  By its very nature this will match for all recursive invocations
(loops only will get smaller) which is why we need param_max_unswitch_level.
The idea was to have one param_max_unswitch_insns budget for each
_original_ loop the recursive invocations will eat from, removing the need
for param_max_unswitch_level.  I'd re-interpret param_max_unswitch_insns
as the number of insns we may add, so with cost calculated as above we'd
have

     if (cost <= budget)
...

and

     budget -= cost;

before the recursion (and pass by reference so that the two recursions
share the budget).

Initialize by

   budget = original_loop_size + param_max_unswitch_insns;


Ah, ok, lemme change that.


Hope this wasn't too many comments ;)


It was, but they are all very useful.

Thanks,
Martin


Thanks,
Richard.

Thoughts?
Thanks,
Martin

Re: [PATCH] Loop unswitching: support gswitch statements.

Reply via email to