Make try_unroll_loop_completely to use loop bounds recorded

Jan Hubicka Thu, 11 Oct 2012 09:21:26 -0700

Hi,
while looking into RTL loop peeling micopmilation I found that we now do a lot 
of
RTL loop peeling for loops of the form


int a[2];
test(int c)
{
  int i;
  for (i=0;i<c;i++)
    a[i]=5;
}
this is because tree-ssa-loop-niter is able to prove that the loop iterates at
most twice.  This is better to be done on gimple level because destroying the
loop enables more propagation.

This patch started as simple attempt to make try_unroll_loop_completely
to use max_loop_iterations_int.  I had to track several issues

1) try_unroll_loop_completely handles only inner loops.  I am fine with not
   peeling loops with subloops, but if the loop is known to iterate 0 times,
   we should turn it into non-loop.
2) try_unroll_loop_completely now handles removal of the loop by making
   exit edge conditional to be true and relying on cleanup_cfg to do
   the job.  This does not work for all the cases we can handle
   by max_loop_iterations_int.  When loop has multiple possible exits
   we don't really know what exit will be taken (we know it will be one
   of them).

   I added logic handling simple case where loop has single exit and
   in generic case I unloop it by adding __builtin_unreachable on the
   loopback path.  While this seems bit involved it works well.
3) The last iteration has parts that are provably unreachable and should
   be optimized out later.  VRP however tends to report array bound warnings
   that breaks -O3 bootstrap.
   I sent separate fix for that.  Still I need to add 3 initializations
   to avoid maybe used uninitialized warning to get -O3 bootstrap working
   with this patch.

The patch got quite a lot of snowballing effect, so I decided to not include:

1) the cost model is skewed for last iteration.  It is often just duplicated
   loop header - i.e. all code dominated by the optmized out exit test should
   not be accounted. This would make cunroll-4.c testcase bellow to be unlooped
   during cunrolli rather than during complette unroling.
2) profile updating is broken for any loop containing non-trivial control flow.
   We need to teach loop duplication code to update sanely the profile when
   wont_exit is true and we need to update profile
3) we probably want to do less unrolling when result is not a single basic
   block. Current limit of 400 isnsns seems bit too large; making one basic
   block is important win. Making sequence of basic blocks with exits is
   less so.

        * gcc.dg/tree-ssa/cunroll-1.c: New testcase.
        * gcc.dg/tree-ssa/cunroll-2.c: New testcase.
        * gcc.dg/tree-ssa/cunroll-3.c: New testcase.
        * gcc.dg/tree-ssa/cunroll-4.c: New testcase.
        * gcc.dg/tree-ssa/cunroll-5.c: New testcase.

        * cfgloopmanip.c (unloop): Export.
        * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Estimate
        also with unknown exit conditoinal.
        (try_unroll_loop_completely): Use max_loop_iterations_int to unroll
        also loops with low upper bound; handle unlooping of the last loop
        even when exit conditional is not known; unloop loop that do not loop
        even if they are not innermost.
        (canonicalize_loop_induction_variables): Record niter bounds known;
        try unrolling even if number of iterations is not known;
        (canonicalize_induction_variables): Be ready for loops disappearing.
        (tree_unroll_loops_completely): Likewise.
        * cfgloop.h (unloop): Declare.

        * f95-lang.c (gfc_init_builtin_functions): Build __builtin_unreachable.

Index: testsuite/gcc.dg/tree-ssa/cunroll-1.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-1.c       (revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-1.c       (revision 0)
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+int a[2];
+test(int c)
+{ 
+  int i;
+  for (i=0;i<c;i++)
+    a[i]=5;
+}
+/* Array bounds says the loop will not roll much.  */
+/* { dg-final { scan-tree-dump-not "Unrolled loop 1 completely (duplicated 1 
times). Last iteration exit edge was proved true." "cunroll"} } */
+/* { dg-final { cleanup-tree-dump "cunroll" } } */
Index: testsuite/gcc.dg/tree-ssa/cunroll-2.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-2.c       (revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-2.c       (revision 0)
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+int a[2];
+test(int c)
+{ 
+  int i;
+  for (i=0;i<c;i++)
+    {
+      a[i]=5;
+      if (test2())
+       return;
+    }
+}
+/* We are not able to get rid of the final conditional because the loop has 
two exits.  */
+/* { dg-final { scan-tree-dump-not "Unrolled loop 1 completely (duplicated 2 
times)." "cunroll"} } */
+/* { dg-final { cleanup-tree-dump "cunroll" } } */
Index: testsuite/gcc.dg/tree-ssa/cunroll-3.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-3.c       (revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-3.c       (revision 0)
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
+int a[1];
+test(int c)
+{ 
+  int i;
+  for (i=0;i<c;i++)
+    {
+      a[i]=5;
+    }
+}
+/* If we start duplicating headers prior curoll, this loop will have 0 
iterations.  */
+
+/* { dg-final { scan-tree-dump-not "Unrolled loop 1 completely (duplicated 1 
times)." "cunrolli"} } */
+/* { dg-final { cleanup-tree-dump "cunrolli" } } */
Index: testsuite/gcc.dg/tree-ssa/cunroll-4.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-4.c       (revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-4.c       (revision 0)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+int a[1];
+test(int c)
+{ 
+  int i=0,j;
+  for (i=0;i<c;i++)
+    {
+      for (j=0;j<c;j++)
+       {
+          a[i]=5;
+         test2();
+       }
+    }
+}
+
+/* We should do this as part of cunrolli, but our cost model do not take into 
account early exit
+   from the last iteration.  */
+/* { dg-final { scan-tree-dump-not "Turned loop 1 to non-loop; it never loops. 
Last iteration exit edge was proved true." "cunrolli"} } */
+/* { dg-final { cleanup-tree-dump "cunroll" } } */
Index: testsuite/gcc.dg/tree-ssa/cunroll-5.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-5.c       (revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-5.c       (revision 0)
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+int *a;
+test(int c)
+{ 
+  int i;
+  for (i=0;i<6;i++)
+    a[i]=5;
+}
+/* Basic testcase for complette unrolling.  */
+/* { dg-final { scan-tree-dump-not "Unrolled loop 1 completely (duplicated 5 
times). Exit condition of peeled iterations was eliminated. Last iteration exit 
edge was proved true." "cunroll"} } */
+/* { dg-final { cleanup-tree-dump "cunroll" } } */
Index: cfgloopmanip.c
===================================================================
--- cfgloopmanip.c      (revision 192360)
+++ cfgloopmanip.c      (working copy)
@@ -37,7 +37,6 @@ static int find_path (edge, basic_block 
 static void fix_loop_placements (struct loop *, bool *);
 static bool fix_bb_placement (basic_block);
 static void fix_bb_placements (basic_block, bool *);
-static void unloop (struct loop *, bool *);
 
 /* Checks whether basic block BB is dominated by DATA.  */
 static bool
@@ -895,7 +894,7 @@ loopify (edge latch_edge, edge header_ed
    If this may cause the information about irreducible regions to become
    invalid, IRRED_INVALIDATED is set to true.  */
 
-static void
+void
 unloop (struct loop *loop, bool *irred_invalidated)
 {
   basic_block *body;
Index: tree-ssa-loop-ivcanon.c
===================================================================
--- tree-ssa-loop-ivcanon.c     (revision 192360)
+++ tree-ssa-loop-ivcanon.c     (working copy)
@@ -231,7 +231,7 @@ tree_estimate_loop_size (struct loop *lo
          /* Look for reasons why we might optimize this stmt away. */
 
          /* Exit conditional.  */
-         if (body[i] == exit->src && stmt == last_stmt (exit->src))
+         if (exit && body[i] == exit->src && stmt == last_stmt (exit->src))
            {
              if (dump_file && (dump_flags & TDF_DETAILS))
                fprintf (dump_file, "   Exit condition will be eliminated.\n");
@@ -326,13 +326,43 @@ try_unroll_loop_completely (struct loop 
   unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
   gimple cond;
   struct loop_size size;
+  bool n_unroll_found = false;
+  HOST_WIDE_INT maxiter;
+  basic_block latch;
+  edge latch_edge;
+  location_t locus;
+  int flags;
+  bool irred_invalidated = false;
+  gimple stmt;
+  gimple_stmt_iterator gsi;
+  edge other_edge = NULL, last_exit;
+  int num = loop->num;
+  bool last_iteration_updated = false;
+
+  /* See if we proved number of iterations to be low constant.  */
+  if (host_integerp (niter, 1))
+    {
+      n_unroll = tree_low_cst (niter, 1);
+      n_unroll_found = true;
+    }
 
-  if (loop->inner)
+  /* See if we can improve our estimate by using recorded loop bounds.  */
+  maxiter = max_loop_iterations_int (loop);
+  if (maxiter >= 0
+      && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
+    {
+      n_unroll = maxiter;
+      n_unroll_found = true;
+    }
+
+  if (!n_unroll_found)
     return false;
 
-  if (!host_integerp (niter, 1))
+  /* We unroll only inner loops, because we do not consider it profitable
+     otheriwse.  We still can cancel loopback edge of not rolling loop;
+     this is always a good idea.  */
+  if (loop->inner && n_unroll)
     return false;
-  n_unroll = tree_low_cst (niter, 1);
 
   max_unroll = PARAM_VALUE (PARAM_MAX_COMPLETELY_PEEL_TIMES);
   if (n_unroll > max_unroll)
@@ -340,6 +370,10 @@ try_unroll_loop_completely (struct loop 
 
   if (n_unroll)
     {
+      sbitmap wont_exit;
+      edge e;
+      unsigned i;
+      VEC (edge, heap) *to_remove = NULL;
       if (ul == UL_SINGLE_ITER)
        return false;
 
@@ -372,14 +408,6 @@ try_unroll_loop_completely (struct loop 
            fprintf (dump_file, "Not unrolling loop %d.\n", loop->num);
          return false;
        }
-    }
-
-  if (n_unroll)
-    {
-      sbitmap wont_exit;
-      edge e;
-      unsigned i;
-      VEC (edge, heap) *to_remove = NULL;
 
       initialize_original_copy_tables ();
       wont_exit = sbitmap_alloc (n_unroll + 1);
@@ -408,24 +436,143 @@ try_unroll_loop_completely (struct loop 
       free_original_copy_tables ();
     }
 
-  cond = last_stmt (exit->src);
-  if (exit->flags & EDGE_TRUE_VALUE)
-    gimple_cond_make_true (cond);
-  else
-    gimple_cond_make_false (cond);
-  update_stmt (cond);
+  /* After complette unrolling we still may get rid of the conditional
+     on exit in the last copy even if we have no idea what it does.
+     This is quite common case for loops of form
+
+     int a[5];
+     for (i=0;i<b;i++)
+       a[i]=0;
+
+     Here we prove the loop to iterate 5 times but we do not know
+     it from induction variable.
+
+     We want to make the conditional of exit true, so we can only 
+     consider exits that are not part of the inner (irreducible) loops
+     so we know that the conditional is tested at most once per iteration.
+
+     Situation is more complicated withloops with multiple exists:
+
+     int a[5];
+     for (i=0;i<b;i++)
+       {
+        if (blah)
+          break;
+        a[i]=0;
+       }
+
+     Here we could cancel the conditional of if "(blah)". And:
+
+     int a[5];
+     for (i=0;i<b;i++)
+       {
+        a[i]=0;
+        if (blah)
+          break;
+       }
+ 
+     Here we can cancel the last i<b test.
+
+     To handle these we need to track all statements containing code that
+     can not execute on last iteration (in tree-ssa-loop-niter).
+     Then we can use control dependnecy (not computed here) to cancel all
+     the paths leading to them unless they are part of the inner loops.
+     This however seems bit like an overkill so we handle only the
+     simple case of single exit until interesting testcases appears.  */
+  last_exit = exit;
+  if (!last_exit)
+    {
+      last_exit = single_exit (loop);
+      if (!last_exit || last_exit->src->loop_father != loop
+         || !(last_exit->flags & (EDGE_TRUE_VALUE | EDGE_FALSE_VALUE)))
+       last_exit = NULL;
+      else 
+       {
+         if (last_exit->src->flags & BB_IRREDUCIBLE_LOOP)
+           last_exit = NULL;
+       }
+    }
+
+  /* If loop has only one exit edge, we can remove the conditional from
+     the last copy of the loop.  
+     TODO: We should account this update into cost model.  */
+  if (last_exit)
+    {
+      cond = last_stmt (last_exit->src);
+      if (last_exit->flags & EDGE_TRUE_VALUE)
+       gimple_cond_make_true (cond);
+      else
+       gimple_cond_make_false (cond);
+      update_stmt (cond);
+      other_edge = EDGE_SUCC (last_exit->src, 0);
+      if (other_edge == last_exit)
+       other_edge = EDGE_SUCC (last_exit->src, 1);
+      last_iteration_updated = true;
+    }
+
+  /* Now destroy the loop.  First try to do so by cancelling the
+     patch from exit conditional if we identified good candidate.  
+
+     TODO: We should update the loop profile for the fact that
+     the last iteration no longer executes.  */
+  if (!other_edge || !remove_path (other_edge))
+    {
+      /* We did not manage to cancel the loop by removing the patch.
+         The loop latch remains reachable even if it will never be reached
+         at runtime.  We must redirect it to somewhere, so create basic
+         block containg __builtin_unreachable call for this reason.  */
+      latch = loop->latch;
+      latch_edge = loop_latch_edge (loop);
+      flags = latch_edge->flags;
+      locus = latch_edge->goto_locus;
+
+      /* Unloop destroys the latch edge.  */
+      unloop (loop, &irred_invalidated);
+      if (irred_invalidated)
+       mark_irreducible_loops ();
+
+      /* Create new basic block for the latch edge destination and wire
+        it in.  */
+      stmt = gimple_build_call (builtin_decl_implicit (BUILT_IN_UNREACHABLE), 
0);
+      latch_edge = make_edge (latch, create_basic_block (NULL, NULL, latch), 
flags);
+      gsi = gsi_start_bb (latch_edge->dest);
+      gsi_insert_after (&gsi, stmt, GSI_NEW_STMT);
+      latch_edge->dest->loop_father = current_loops->tree_root;
+      set_immediate_dominator (CDI_DOMINATORS, latch_edge->dest, 
latch_edge->src);
+      latch_edge->probability = 0;
+      latch_edge->count = 0;
+      latch_edge->flags |= flags;
+      latch_edge->goto_locus = locus;
+    }
 
   if (dump_file && (dump_flags & TDF_DETAILS))
-    fprintf (dump_file, "Unrolled loop %d completely.\n", loop->num);
+    {
+      if (!n_unroll)
+        fprintf (dump_file, "Turned loop %d to non-loop; it never loops.%s\n", 
num,
+                last_iteration_updated
+                ? " Last iteration exit edge was proved true." : "");
+      else
+       {
+          fprintf (dump_file, "Unrolled loop %d completely "
+                  "(duplicated %i times).%s%s\n", num, (int)n_unroll,
+                  exit
+                  ? " Exit condition of peeled iterations was eliminated." : 
"",
+                  last_iteration_updated
+                  ? " Last iteration exit edge was proved true." : "");
+       }
+    }
 
-  return true;
+  return true;
 }
 
 /* Adds a canonical induction variable to LOOP if suitable.
    CREATE_IV is true if we may create a new iv.  UL determines
    which loops we are allowed to completely unroll.  If TRY_EVAL is true, we 
try
    to determine the number of iterations of a loop by direct evaluation.
-   Returns true if cfg is changed.  */
+   Returns true if cfg is changed.  
+
+   IRREDUCIBLE_LOOPS_FOUND is used to bookkeep if we discovered irreducible
+   loops.  This is used in a special case of the exit condition analysis.  */
 
 static bool
 canonicalize_loop_induction_variables (struct loop *loop,
@@ -455,22 +602,34 @@ canonicalize_loop_induction_variables (s
              || TREE_CODE (niter) != INTEGER_CST))
        niter = find_loop_niter_by_eval (loop, &exit);
 
-      if (chrec_contains_undetermined (niter)
-         || TREE_CODE (niter) != INTEGER_CST)
-       return false;
+      if (TREE_CODE (niter) != INTEGER_CST)
+       exit = NULL;
     }
 
-  if (dump_file && (dump_flags & TDF_DETAILS))
+  /* We work exceptionally hard here to estimate the bound
+     by find_loop_niter_by_eval.  Be sure to keep it for future.  */
+  if (niter && TREE_CODE (niter) == INTEGER_CST)
+    record_niter_bound (loop, tree_to_double_int (niter), false, true);
+
+  if (dump_file && (dump_flags & TDF_DETAILS)
+      && TREE_CODE (niter) == INTEGER_CST)
     {
       fprintf (dump_file, "Loop %d iterates ", loop->num);
       print_generic_expr (dump_file, niter, TDF_SLIM);
       fprintf (dump_file, " times.\n");
     }
+  if (dump_file && (dump_flags & TDF_DETAILS)
+      && max_loop_iterations_int (loop) >= 0)
+    {
+      fprintf (dump_file, "Loop %d iterates at most %i times.\n", loop->num,
+              (int)max_loop_iterations_int (loop));
+    }
 
   if (try_unroll_loop_completely (loop, exit, niter, ul))
     return true;
 
-  if (create_iv)
+  if (create_iv
+      && niter && !chrec_contains_undetermined (niter))
     create_canonical_iv (loop, exit, niter);
 
   return false;
@@ -483,11 +642,14 @@ unsigned int
 canonicalize_induction_variables (void)
 {
   loop_iterator li;
-  struct loop *loop;
+  struct loop *loop, *next;
   bool changed = false;
 
-  FOR_EACH_LOOP (li, loop, 0)
+  for (fel_init (&li, &next, LI_ONLY_INNERMOST); next;)
     {
+      loop = next;
+      fel_next (&li, &next);
+      
       changed |= canonicalize_loop_induction_variables (loop,
                                                        true, UL_SINGLE_ITER,
                                                        true);
@@ -591,14 +753,18 @@ tree_unroll_loops_completely (bool may_i
   bool changed;
   enum unroll_level ul;
   int iteration = 0;
+  struct loop *next;
 
   do
     {
       changed = false;
 
-      FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
+      for (fel_init (&li, &next, 0); next;)
        {
-         struct loop *loop_father = loop_outer (loop);
+         struct loop *loop_father = loop_outer (next);
+
+         loop = next;
+         fel_next (&li, &next);
 
          if (may_increase_size && optimize_loop_for_speed_p (loop)
              /* Unroll outermost loops only if asked to do so or they do
Index: fortran/f95-lang.c
===================================================================
--- fortran/f95-lang.c  (revision 192360)
+++ fortran/f95-lang.c  (working copy)
@@ -908,6 +908,11 @@ gfc_init_builtin_functions (void)
   gfc_define_builtin ("__builtin_expect", ftype, BUILT_IN_EXPECT,
                      "__builtin_expect", ATTR_CONST_NOTHROW_LEAF_LIST);
 
+  ftype = build_function_type_list (void_type_node, NULL_TREE);
+
+  gfc_define_builtin ("__builtin_unreachable", ftype, BUILT_IN_EXPECT,
+                     "__builtin_unreachable", ATTR_NORETURN_NOTHROW_LEAF_LIST);
+
   ftype = build_function_type_list (void_type_node,
                                     pvoid_type_node, NULL_TREE);
   gfc_define_builtin ("__builtin_free", ftype, BUILT_IN_FREE,
Index: cfgloop.h
===================================================================
--- cfgloop.h   (revision 192360)
+++ cfgloop.h   (working copy)
@@ -319,7 +319,8 @@ extern struct loop *loopify (edge, edge,
 struct loop * loop_version (struct loop *, void *,
                            basic_block *, unsigned, unsigned, unsigned, bool);
 extern bool remove_path (edge);
-void scale_loop_frequencies (struct loop *, int, int);
+extern void scale_loop_frequencies (struct loop *, int, int);
+extern void unloop (struct loop *, bool *);
 
 /* Induction variable analysis.  */

Make try_unroll_loop_completely to use loop bounds recorded

Reply via email to