Re: Make try_unroll_loop_completely to use loop bounds recorded

Richard Biener Fri, 12 Oct 2012 01:09:41 -0700

On Thu, 11 Oct 2012, Jan Hubicka wrote:

> Hi,
> while looking into RTL loop peeling micopmilation I found that we now do a 
> lot of
> RTL loop peeling for loops of the form
> 
> int a[2];
> test(int c)
> {
>   int i;
>   for (i=0;i<c;i++)
>     a[i]=5;
> }
> this is because tree-ssa-loop-niter is able to prove that the loop iterates at
> most twice.  This is better to be done on gimple level because destroying the
> loop enables more propagation.
> 
> This patch started as simple attempt to make try_unroll_loop_completely
> to use max_loop_iterations_int.  I had to track several issues
> 
> 1) try_unroll_loop_completely handles only inner loops.  I am fine with not
>    peeling loops with subloops, but if the loop is known to iterate 0 times,
>    we should turn it into non-loop.
> 2) try_unroll_loop_completely now handles removal of the loop by making
>    exit edge conditional to be true and relying on cleanup_cfg to do
>    the job.  This does not work for all the cases we can handle
>    by max_loop_iterations_int.  When loop has multiple possible exits
>    we don't really know what exit will be taken (we know it will be one
>    of them).
> 
>    I added logic handling simple case where loop has single exit and
>    in generic case I unloop it by adding __builtin_unreachable on the
>    loopback path.  While this seems bit involved it works well.
> 3) The last iteration has parts that are provably unreachable and should
>    be optimized out later.  VRP however tends to report array bound warnings
>    that breaks -O3 bootstrap.
>    I sent separate fix for that.  Still I need to add 3 initializations
>    to avoid maybe used uninitialized warning to get -O3 bootstrap working
>    with this patch.
> 
> The patch got quite a lot of snowballing effect, so I decided to not include:
> 
> 1) the cost model is skewed for last iteration.  It is often just duplicated
>    loop header - i.e. all code dominated by the optmized out exit test should
>    not be accounted. This would make cunroll-4.c testcase bellow to be 
> unlooped
>    during cunrolli rather than during complette unroling.
> 2) profile updating is broken for any loop containing non-trivial control 
> flow.
>    We need to teach loop duplication code to update sanely the profile when
>    wont_exit is true and we need to update profile
> 3) we probably want to do less unrolling when result is not a single basic
>    block. Current limit of 400 isnsns seems bit too large; making one basic
>    block is important win. Making sequence of basic blocks with exits is
>    less so.


Nice - this was on my TODO list as well, btw ;)

>       * gcc.dg/tree-ssa/cunroll-1.c: New testcase.
>       * gcc.dg/tree-ssa/cunroll-2.c: New testcase.
>       * gcc.dg/tree-ssa/cunroll-3.c: New testcase.
>       * gcc.dg/tree-ssa/cunroll-4.c: New testcase.
>       * gcc.dg/tree-ssa/cunroll-5.c: New testcase.
> 
>       * cfgloopmanip.c (unloop): Export.
>       * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Estimate
>       also with unknown exit conditoinal.
>       (try_unroll_loop_completely): Use max_loop_iterations_int to unroll
>       also loops with low upper bound; handle unlooping of the last loop
>       even when exit conditional is not known; unloop loop that do not loop
>       even if they are not innermost.
>       (canonicalize_loop_induction_variables): Record niter bounds known;
>       try unrolling even if number of iterations is not known;
>       (canonicalize_induction_variables): Be ready for loops disappearing.
>       (tree_unroll_loops_completely): Likewise.
>       * cfgloop.h (unloop): Declare.
> 
>       * f95-lang.c (gfc_init_builtin_functions): Build __builtin_unreachable.

I wonder if other languages need similar adjustment?

+  /* Now destroy the loop.  First try to do so by cancelling the
+     patch from exit conditional if we identified good candidate.
+

you mean 'path' here, right?  Similar in other places.

+      /* Unloop destroys the latch edge.  */
+      unloop (loop, &irred_invalidated);
+      if (irred_invalidated)
+       mark_irreducible_loops ();

this makes the whole thing quadratic in the number of loops.
Please pass down a flag and handle this in the place we
update SSA form (thus, once per function / unroll iteration).

Or provide a more optimial implementation that only considers
parent loops of 'loop' (even that is excessive though, quadratic
in the loop depth).

-  FOR_EACH_LOOP (li, loop, 0)
+  for (fel_init (&li, &next, LI_ONLY_INNERMOST); next;)

FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)

but I'm sure that IV canonicalization should happen for all loops.
So, why do you need this change?  Esp. we should get rid of
single-iteration outer loops here, too.

-      FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
+      for (fel_init (&li, &next, 0); next;)

see above - FOR_EACH_LOOP (li, loop, 0)

Otherwise looks ok.  Please update and re-post.

Thanks,
Richard.


> Index: testsuite/gcc.dg/tree-ssa/cunroll-1.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-1.c     (revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-1.c     (revision 0)
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
> +int a[2];
> +test(int c)
> +{ 
> +  int i;
> +  for (i=0;i<c;i++)
> +    a[i]=5;
> +}
> +/* Array bounds says the loop will not roll much.  */
> +/* { dg-final { scan-tree-dump-not "Unrolled loop 1 completely (duplicated 1 
> times). Last iteration exit edge was proved true." "cunroll"} } */
> +/* { dg-final { cleanup-tree-dump "cunroll" } } */
> Index: testsuite/gcc.dg/tree-ssa/cunroll-2.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-2.c     (revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-2.c     (revision 0)
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
> +int a[2];
> +test(int c)
> +{ 
> +  int i;
> +  for (i=0;i<c;i++)
> +    {
> +      a[i]=5;
> +      if (test2())
> +     return;
> +    }
> +}
> +/* We are not able to get rid of the final conditional because the loop has 
> two exits.  */
> +/* { dg-final { scan-tree-dump-not "Unrolled loop 1 completely (duplicated 2 
> times)." "cunroll"} } */
> +/* { dg-final { cleanup-tree-dump "cunroll" } } */
> Index: testsuite/gcc.dg/tree-ssa/cunroll-3.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-3.c     (revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-3.c     (revision 0)
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> +int a[1];
> +test(int c)
> +{ 
> +  int i;
> +  for (i=0;i<c;i++)
> +    {
> +      a[i]=5;
> +    }
> +}
> +/* If we start duplicating headers prior curoll, this loop will have 0 
> iterations.  */
> +
> +/* { dg-final { scan-tree-dump-not "Unrolled loop 1 completely (duplicated 1 
> times)." "cunrolli"} } */
> +/* { dg-final { cleanup-tree-dump "cunrolli" } } */
> Index: testsuite/gcc.dg/tree-ssa/cunroll-4.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-4.c     (revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-4.c     (revision 0)
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
> +int a[1];
> +test(int c)
> +{ 
> +  int i=0,j;
> +  for (i=0;i<c;i++)
> +    {
> +      for (j=0;j<c;j++)
> +     {
> +          a[i]=5;
> +       test2();
> +     }
> +    }
> +}
> +
> +/* We should do this as part of cunrolli, but our cost model do not take 
> into account early exit
> +   from the last iteration.  */
> +/* { dg-final { scan-tree-dump-not "Turned loop 1 to non-loop; it never 
> loops. Last iteration exit edge was proved true." "cunrolli"} } */
> +/* { dg-final { cleanup-tree-dump "cunroll" } } */
> Index: testsuite/gcc.dg/tree-ssa/cunroll-5.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-5.c     (revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-5.c     (revision 0)
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
> +int *a;
> +test(int c)
> +{ 
> +  int i;
> +  for (i=0;i<6;i++)
> +    a[i]=5;
> +}
> +/* Basic testcase for complette unrolling.  */
> +/* { dg-final { scan-tree-dump-not "Unrolled loop 1 completely (duplicated 5 
> times). Exit condition of peeled iterations was eliminated. Last iteration 
> exit edge was proved true." "cunroll"} } */
> +/* { dg-final { cleanup-tree-dump "cunroll" } } */
> Index: cfgloopmanip.c
> ===================================================================
> --- cfgloopmanip.c    (revision 192360)
> +++ cfgloopmanip.c    (working copy)
> @@ -37,7 +37,6 @@ static int find_path (edge, basic_block 
>  static void fix_loop_placements (struct loop *, bool *);
>  static bool fix_bb_placement (basic_block);
>  static void fix_bb_placements (basic_block, bool *);
> -static void unloop (struct loop *, bool *);
>  
>  /* Checks whether basic block BB is dominated by DATA.  */
>  static bool
> @@ -895,7 +894,7 @@ loopify (edge latch_edge, edge header_ed
>     If this may cause the information about irreducible regions to become
>     invalid, IRRED_INVALIDATED is set to true.  */
>  
> -static void
> +void
>  unloop (struct loop *loop, bool *irred_invalidated)
>  {
>    basic_block *body;
> Index: tree-ssa-loop-ivcanon.c
> ===================================================================
> --- tree-ssa-loop-ivcanon.c   (revision 192360)
> +++ tree-ssa-loop-ivcanon.c   (working copy)
> @@ -231,7 +231,7 @@ tree_estimate_loop_size (struct loop *lo
>         /* Look for reasons why we might optimize this stmt away. */
>  
>         /* Exit conditional.  */
> -       if (body[i] == exit->src && stmt == last_stmt (exit->src))
> +       if (exit && body[i] == exit->src && stmt == last_stmt (exit->src))
>           {
>             if (dump_file && (dump_flags & TDF_DETAILS))
>               fprintf (dump_file, "   Exit condition will be eliminated.\n");
> @@ -326,13 +326,43 @@ try_unroll_loop_completely (struct loop 
>    unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
>    gimple cond;
>    struct loop_size size;
> +  bool n_unroll_found = false;
> +  HOST_WIDE_INT maxiter;
> +  basic_block latch;
> +  edge latch_edge;
> +  location_t locus;
> +  int flags;
> +  bool irred_invalidated = false;
> +  gimple stmt;
> +  gimple_stmt_iterator gsi;
> +  edge other_edge = NULL, last_exit;
> +  int num = loop->num;
> +  bool last_iteration_updated = false;
> +
> +  /* See if we proved number of iterations to be low constant.  */
> +  if (host_integerp (niter, 1))
> +    {
> +      n_unroll = tree_low_cst (niter, 1);
> +      n_unroll_found = true;
> +    }
>  
> -  if (loop->inner)
> +  /* See if we can improve our estimate by using recorded loop bounds.  */
> +  maxiter = max_loop_iterations_int (loop);
> +  if (maxiter >= 0
> +      && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
> +    {
> +      n_unroll = maxiter;
> +      n_unroll_found = true;
> +    }
> +
> +  if (!n_unroll_found)
>      return false;
>  
> -  if (!host_integerp (niter, 1))
> +  /* We unroll only inner loops, because we do not consider it profitable
> +     otheriwse.  We still can cancel loopback edge of not rolling loop;
> +     this is always a good idea.  */
> +  if (loop->inner && n_unroll)
>      return false;
> -  n_unroll = tree_low_cst (niter, 1);
>  
>    max_unroll = PARAM_VALUE (PARAM_MAX_COMPLETELY_PEEL_TIMES);
>    if (n_unroll > max_unroll)
> @@ -340,6 +370,10 @@ try_unroll_loop_completely (struct loop 
>  
>    if (n_unroll)
>      {
> +      sbitmap wont_exit;
> +      edge e;
> +      unsigned i;
> +      VEC (edge, heap) *to_remove = NULL;
>        if (ul == UL_SINGLE_ITER)
>       return false;
>  
> @@ -372,14 +408,6 @@ try_unroll_loop_completely (struct loop 
>           fprintf (dump_file, "Not unrolling loop %d.\n", loop->num);
>         return false;
>       }
> -    }
> -
> -  if (n_unroll)
> -    {
> -      sbitmap wont_exit;
> -      edge e;
> -      unsigned i;
> -      VEC (edge, heap) *to_remove = NULL;
>  
>        initialize_original_copy_tables ();
>        wont_exit = sbitmap_alloc (n_unroll + 1);
> @@ -408,24 +436,143 @@ try_unroll_loop_completely (struct loop 
>        free_original_copy_tables ();
>      }
>  
> -  cond = last_stmt (exit->src);
> -  if (exit->flags & EDGE_TRUE_VALUE)
> -    gimple_cond_make_true (cond);
> -  else
> -    gimple_cond_make_false (cond);
> -  update_stmt (cond);
> +  /* After complette unrolling we still may get rid of the conditional
> +     on exit in the last copy even if we have no idea what it does.
> +     This is quite common case for loops of form
> +
> +     int a[5];
> +     for (i=0;i<b;i++)
> +       a[i]=0;
> +
> +     Here we prove the loop to iterate 5 times but we do not know
> +     it from induction variable.
> +
> +     We want to make the conditional of exit true, so we can only 
> +     consider exits that are not part of the inner (irreducible) loops
> +     so we know that the conditional is tested at most once per iteration.
> +
> +     Situation is more complicated withloops with multiple exists:
> +
> +     int a[5];
> +     for (i=0;i<b;i++)
> +       {
> +      if (blah)
> +        break;
> +      a[i]=0;
> +       }
> +
> +     Here we could cancel the conditional of if "(blah)". And:
> +
> +     int a[5];
> +     for (i=0;i<b;i++)
> +       {
> +      a[i]=0;
> +      if (blah)
> +        break;
> +       }
> + 
> +     Here we can cancel the last i<b test.
> +
> +     To handle these we need to track all statements containing code that
> +     can not execute on last iteration (in tree-ssa-loop-niter).
> +     Then we can use control dependnecy (not computed here) to cancel all
> +     the paths leading to them unless they are part of the inner loops.
> +     This however seems bit like an overkill so we handle only the
> +     simple case of single exit until interesting testcases appears.  */
> +  last_exit = exit;
> +  if (!last_exit)
> +    {
> +      last_exit = single_exit (loop);
> +      if (!last_exit || last_exit->src->loop_father != loop
> +       || !(last_exit->flags & (EDGE_TRUE_VALUE | EDGE_FALSE_VALUE)))
> +     last_exit = NULL;
> +      else 
> +     {
> +       if (last_exit->src->flags & BB_IRREDUCIBLE_LOOP)
> +         last_exit = NULL;
> +     }
> +    }
> +
> +  /* If loop has only one exit edge, we can remove the conditional from
> +     the last copy of the loop.  
> +     TODO: We should account this update into cost model.  */
> +  if (last_exit)
> +    {
> +      cond = last_stmt (last_exit->src);
> +      if (last_exit->flags & EDGE_TRUE_VALUE)
> +     gimple_cond_make_true (cond);
> +      else
> +     gimple_cond_make_false (cond);
> +      update_stmt (cond);
> +      other_edge = EDGE_SUCC (last_exit->src, 0);
> +      if (other_edge == last_exit)
> +     other_edge = EDGE_SUCC (last_exit->src, 1);
> +      last_iteration_updated = true;
> +    }
> +
> +  /* Now destroy the loop.  First try to do so by cancelling the
> +     patch from exit conditional if we identified good candidate.  
> +
> +     TODO: We should update the loop profile for the fact that
> +     the last iteration no longer executes.  */
> +  if (!other_edge || !remove_path (other_edge))
> +    {
> +      /* We did not manage to cancel the loop by removing the patch.
> +         The loop latch remains reachable even if it will never be reached
> +         at runtime.  We must redirect it to somewhere, so create basic
> +         block containg __builtin_unreachable call for this reason.  */
> +      latch = loop->latch;
> +      latch_edge = loop_latch_edge (loop);
> +      flags = latch_edge->flags;
> +      locus = latch_edge->goto_locus;
> +
> +      /* Unloop destroys the latch edge.  */
> +      unloop (loop, &irred_invalidated);
> +      if (irred_invalidated)
> +     mark_irreducible_loops ();
> +
> +      /* Create new basic block for the latch edge destination and wire
> +      it in.  */
> +      stmt = gimple_build_call (builtin_decl_implicit 
> (BUILT_IN_UNREACHABLE), 0);
> +      latch_edge = make_edge (latch, create_basic_block (NULL, NULL, latch), 
> flags);
> +      gsi = gsi_start_bb (latch_edge->dest);
> +      gsi_insert_after (&gsi, stmt, GSI_NEW_STMT);
> +      latch_edge->dest->loop_father = current_loops->tree_root;
> +      set_immediate_dominator (CDI_DOMINATORS, latch_edge->dest, 
> latch_edge->src);
> +      latch_edge->probability = 0;
> +      latch_edge->count = 0;
> +      latch_edge->flags |= flags;
> +      latch_edge->goto_locus = locus;
> +    }
>  
>    if (dump_file && (dump_flags & TDF_DETAILS))
> -    fprintf (dump_file, "Unrolled loop %d completely.\n", loop->num);
> +    {
> +      if (!n_unroll)
> +        fprintf (dump_file, "Turned loop %d to non-loop; it never 
> loops.%s\n", num,
> +              last_iteration_updated
> +              ? " Last iteration exit edge was proved true." : "");
> +      else
> +     {
> +          fprintf (dump_file, "Unrolled loop %d completely "
> +                "(duplicated %i times).%s%s\n", num, (int)n_unroll,
> +                exit
> +                ? " Exit condition of peeled iterations was eliminated." : 
> "",
> +                last_iteration_updated
> +                ? " Last iteration exit edge was proved true." : "");
> +     }
> +    }
>  
> -  return true;
> +  return true;
>  }
>  
>  /* Adds a canonical induction variable to LOOP if suitable.
>     CREATE_IV is true if we may create a new iv.  UL determines
>     which loops we are allowed to completely unroll.  If TRY_EVAL is true, we 
> try
>     to determine the number of iterations of a loop by direct evaluation.
> -   Returns true if cfg is changed.  */
> +   Returns true if cfg is changed.  
> +
> +   IRREDUCIBLE_LOOPS_FOUND is used to bookkeep if we discovered irreducible
> +   loops.  This is used in a special case of the exit condition analysis.  */
>  
>  static bool
>  canonicalize_loop_induction_variables (struct loop *loop,
> @@ -455,22 +602,34 @@ canonicalize_loop_induction_variables (s
>             || TREE_CODE (niter) != INTEGER_CST))
>       niter = find_loop_niter_by_eval (loop, &exit);
>  
> -      if (chrec_contains_undetermined (niter)
> -       || TREE_CODE (niter) != INTEGER_CST)
> -     return false;
> +      if (TREE_CODE (niter) != INTEGER_CST)
> +     exit = NULL;
>      }
>  
> -  if (dump_file && (dump_flags & TDF_DETAILS))
> +  /* We work exceptionally hard here to estimate the bound
> +     by find_loop_niter_by_eval.  Be sure to keep it for future.  */
> +  if (niter && TREE_CODE (niter) == INTEGER_CST)
> +    record_niter_bound (loop, tree_to_double_int (niter), false, true);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS)
> +      && TREE_CODE (niter) == INTEGER_CST)
>      {
>        fprintf (dump_file, "Loop %d iterates ", loop->num);
>        print_generic_expr (dump_file, niter, TDF_SLIM);
>        fprintf (dump_file, " times.\n");
>      }
> +  if (dump_file && (dump_flags & TDF_DETAILS)
> +      && max_loop_iterations_int (loop) >= 0)
> +    {
> +      fprintf (dump_file, "Loop %d iterates at most %i times.\n", loop->num,
> +            (int)max_loop_iterations_int (loop));
> +    }
>  
>    if (try_unroll_loop_completely (loop, exit, niter, ul))
>      return true;
>  
> -  if (create_iv)
> +  if (create_iv
> +      && niter && !chrec_contains_undetermined (niter))
>      create_canonical_iv (loop, exit, niter);
>  
>    return false;
> @@ -483,11 +642,14 @@ unsigned int
>  canonicalize_induction_variables (void)
>  {
>    loop_iterator li;
> -  struct loop *loop;
> +  struct loop *loop, *next;
>    bool changed = false;
>  
> -  FOR_EACH_LOOP (li, loop, 0)
> +  for (fel_init (&li, &next, LI_ONLY_INNERMOST); next;)
>      {
> +      loop = next;
> +      fel_next (&li, &next);
> +      
>        changed |= canonicalize_loop_induction_variables (loop,
>                                                       true, UL_SINGLE_ITER,
>                                                       true);
> @@ -591,14 +753,18 @@ tree_unroll_loops_completely (bool may_i
>    bool changed;
>    enum unroll_level ul;
>    int iteration = 0;
> +  struct loop *next;
>  
>    do
>      {
>        changed = false;
>  
> -      FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
> +      for (fel_init (&li, &next, 0); next;)
>       {
> -       struct loop *loop_father = loop_outer (loop);
> +       struct loop *loop_father = loop_outer (next);
> +
> +       loop = next;
> +       fel_next (&li, &next);
>  
>         if (may_increase_size && optimize_loop_for_speed_p (loop)
>             /* Unroll outermost loops only if asked to do so or they do
> Index: fortran/f95-lang.c
> ===================================================================
> --- fortran/f95-lang.c        (revision 192360)
> +++ fortran/f95-lang.c        (working copy)
> @@ -908,6 +908,11 @@ gfc_init_builtin_functions (void)
>    gfc_define_builtin ("__builtin_expect", ftype, BUILT_IN_EXPECT,
>                     "__builtin_expect", ATTR_CONST_NOTHROW_LEAF_LIST);
>  
> +  ftype = build_function_type_list (void_type_node, NULL_TREE);
> +
> +  gfc_define_builtin ("__builtin_unreachable", ftype, BUILT_IN_EXPECT,
> +                   "__builtin_unreachable", ATTR_NORETURN_NOTHROW_LEAF_LIST);
> +
>    ftype = build_function_type_list (void_type_node,
>                                      pvoid_type_node, NULL_TREE);
>    gfc_define_builtin ("__builtin_free", ftype, BUILT_IN_FREE,
> Index: cfgloop.h
> ===================================================================
> --- cfgloop.h (revision 192360)
> +++ cfgloop.h (working copy)
> @@ -319,7 +319,8 @@ extern struct loop *loopify (edge, edge,
>  struct loop * loop_version (struct loop *, void *,
>                           basic_block *, unsigned, unsigned, unsigned, bool);
>  extern bool remove_path (edge);
> -void scale_loop_frequencies (struct loop *, int, int);
> +extern void scale_loop_frequencies (struct loop *, int, int);
> +extern void unloop (struct loop *, bool *);
>  
>  /* Induction variable analysis.  */
>  
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend

Re: Make try_unroll_loop_completely to use loop bounds recorded

Reply via email to