Re: Add a loop versioning pass

Richard Biener Sat, 15 Dec 2018 02:27:15 -0800

On December 14, 2018 5:18:38 PM GMT+01:00, Richard Sandiford 
<richard.sandif...@arm.com> wrote:
>Richard Biener <rguent...@suse.de> writes:
>> On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford
><richard.sandif...@arm.com> wrote:
>>>Richard Biener <richard.guent...@gmail.com> writes:
>>>> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
>>>>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
>>>>> Also repeated the performance testing (but haven't yet tried an
>>>>> LTO variant; will do that over the weekend).
>>>>
>>>> Any results?
>>>
>>>Sorry, I should've remembered that finding time to run tests is easy,
>>>finding time to analyse them is hard.
>>>
>>>Speed-wise, the impact of the patch for LTO is similar to without,
>>>with 554.roms_r being the main beneficiary for both AArch64 and
>x86_64.
>>>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native
>>>-flto=jobserver.
>>>
>>>Size-wise, there are three tests that grow by >=2% on x86_64:
>>>
>>>549.fotonik3d_r: 5.5%
>>>548.exchange2_r: 29.5%
>>>554.roms_r: 39.6%
>>
>> Uh. With LTO we might have a reasonable guessed profile and you do
>have a optimize_loop_nest_for_speed guard on the transform? 
>
>Guard now added :-)  But unfortunately it doesn't make any significant
>difference.  548.exchange2_r goes from 29.5% to 27.7%, but the other
>two are the same as before.
>
>> How does compile time fare with the above benchmarks?
>
>For 554.roms_r it's +80%(!) with -flto=1, but I think that's par for
>the course given the increase in function sizes.


:(

>For 549.fotonik3d_r it's +5% with -flto=1.
>
>For 503.bwaves_r (as an example of a benchmark whose size doesn't
>change),
>the difference is in the noise.
>
>[...]
>
>>>You mean something like:
>>>
>>>  real :: foo(:,:), bar(:)
>>>
>>>  do i=1,n
>>>    do j=1,n
>>>      foo(i,j) = ...
>>>    end do
>>>    bar(i) = ..
>>>  end do
>>>
>>>?  I can add a test if so.
>>
>> Please. 
>
>OK, I've added them to loop_versioning_4.f90.
>
>>>> There may also be some subtle issues with substitute_and_fold being
>>>> applied to non-up-to-date SSA form given it folds stmts looking at
>>>> (single-use!) SSA edges.  The single-use guard might be what saves
>>>you
>>>> here (SSA uses in the copies are not yet updated to point to the
>>>> copied DEFs).
>>>
>>>OK.  I was hoping that because we only apply substitute_and_fold
>>>to new code, there would be no problem with uses elsewhere.
>>
>> Might be, yes.
>>
>>>Would it be safer to:
>>>
>>>  - version all loops we want to version
>>>  - update SSA explicitly
>>>  - apply substitute and fold to all "new" loops
>>
>> That would be definitely less fishy. But you need to get at the
>actual
>> 'new' SSA names for the replacements as I guess they'll be rewritten?
>> Or maybe those are not.
>>
>>>?  Could we then get away with returning a 0 TODO at the end?
>>
>> Yes. 
>
>OK, the updated patch does it this way.
>
>Tested as before.

OK. 

Thanks, 
Richard. 

>Thanks,
>Richard
>
>
>2018-12-14  Richard Sandiford  <richard.sandif...@arm.com>
>           Ramana Radhakrishnan  <ramana.radhakrish...@arm.com>
>           Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>
>gcc/
>       * doc/invoke.texi (-fversion-loops-for-strides): Document
>       (loop-versioning-group-size, loop-versioning-max-inner-insns)
>       (loop-versioning-max-outer-insns): Document new --params.
>       * Makefile.in (OBJS): Add gimple-loop-versioning.o.
>       * common.opt (fversion-loops-for-strides): New option.
>       * opts.c (default_options_table): Enable fversion-loops-for-strides
>       at -O3.
>       * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE)
>       (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS)
>       (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters.
>       * passes.def: Add pass_loop_versioning.
>       * timevar.def (TV_LOOP_VERSIONING): New time variable.
>       * tree-ssa-propagate.h
>       (substitute_and_fold_engine::substitute_and_fold): Add an optional
>       block parameter.
>       * tree-ssa-propagate.c
>       (substitute_and_fold_engine::substitute_and_fold): Likewise.
>       When passed, only walk blocks dominated by that block.
>       * tree-vrp.h (range_includes_p): Declare.
>       (range_includes_zero_p): Turn into an inline wrapper around
>       range_includes_p.
>       * tree-vrp.c (range_includes_p): New function, generalizing...
>       (range_includes_zero_p): ...this.
>       * tree-pass.h (make_pass_loop_versioning): Declare.
>       * gimple-loop-versioning.cc: New file.
>
>gcc/testsuite/
>       * gcc.dg/loop-versioning-1.c: New test.
>       * gcc.dg/loop-versioning-10.c: Likewise.
>       * gcc.dg/loop-versioning-11.c: Likewise.
>       * gcc.dg/loop-versioning-2.c: Likewise.
>       * gcc.dg/loop-versioning-3.c: Likewise.
>       * gcc.dg/loop-versioning-4.c: Likewise.
>       * gcc.dg/loop-versioning-5.c: Likewise.
>       * gcc.dg/loop-versioning-6.c: Likewise.
>       * gcc.dg/loop-versioning-7.c: Likewise.
>       * gcc.dg/loop-versioning-8.c: Likewise.
>       * gcc.dg/loop-versioning-9.c: Likewise.
>       * gfortran.dg/loop_versioning_1.f90: Likewise.
>       * gfortran.dg/loop_versioning_2.f90: Likewise.
>       * gfortran.dg/loop_versioning_3.f90: Likewise.
>       * gfortran.dg/loop_versioning_4.f90: Likewise.
>       * gfortran.dg/loop_versioning_5.f90: Likewise.
>       * gfortran.dg/loop_versioning_6.f90: Likewise.
>       * gfortran.dg/loop_versioning_7.f90: Likewise.
>       * gfortran.dg/loop_versioning_8.f90: Likewise.
>
>Index: gcc/doc/invoke.texi
>===================================================================
>--- gcc/doc/invoke.texi        2018-12-11 15:49:19.061544092 +0000
>+++ gcc/doc/invoke.texi        2018-12-14 15:02:53.813883464 +0000
>@@ -8220,7 +8220,8 @@ by @option{-O2} and also turns on the fo
> -ftree-partial-pre @gol
> -ftree-slp-vectorize @gol
> -funswitch-loops @gol
>--fvect-cost-model}
>+-fvect-cost-model @gol
>+-fversion-loops-for-strides}
> 
> @item -O0
> @opindex O0
>@@ -10772,6 +10773,30 @@ of the loop on both branches (modified a
> 
> Enabled by @option{-fprofile-use} and @option{-fauto-profile}.
> 
>+@item -fversion-loops-for-strides
>+@opindex fversion-loops-for-strides
>+If a loop iterates over an array with a variable stride, create
>another
>+version of the loop that assumes the stride is always one.  For
>example:
>+
>+@smallexample
>+for (int i = 0; i < n; ++i)
>+  x[i * stride] = @dots{};
>+@end smallexample
>+
>+becomes:
>+
>+@smallexample
>+if (stride == 1)
>+  for (int i = 0; i < n; ++i)
>+    x[i] = @dots{};
>+else
>+  for (int i = 0; i < n; ++i)
>+    x[i * stride] = @dots{};
>+@end smallexample
>+
>+This is particularly useful for assumed-shape arrays in Fortran where
>+(for example) it allows better vectorization assuming contiguous
>accesses.
>+
> @item -ffunction-sections
> @itemx -fdata-sections
> @opindex ffunction-sections
>@@ -11981,6 +12006,15 @@ Hardware autoprefetcher scheduler model
> Number of lookahead cycles the model looks into; at '
> ' only enable instruction sorting heuristic.
> 
>+@item loop-versioning-max-inner-insns
>+The maximum number of instructions that an inner loop can have
>+before the loop versioning pass considers it too big to copy.
>+
>+@item loop-versioning-max-outer-insns
>+The maximum number of instructions that an outer loop can have
>+before the loop versioning pass considers it too big to copy,
>+discounting any instructions in inner loops that directly benefit
>+from versioning.
> 
> @end table
> @end table
>Index: gcc/Makefile.in
>===================================================================
>--- gcc/Makefile.in    2018-12-06 18:00:29.000000000 +0000
>+++ gcc/Makefile.in    2018-12-14 15:02:53.809883498 +0000
>@@ -1320,6 +1320,7 @@ OBJS = \
>       gimple-laddress.o \
>       gimple-loop-interchange.o \
>       gimple-loop-jam.o \
>+      gimple-loop-versioning.o \
>       gimple-low.o \
>       gimple-pretty-print.o \
>       gimple-ssa-backprop.o \
>Index: gcc/common.opt
>===================================================================
>--- gcc/common.opt     2018-12-06 18:00:29.000000000 +0000
>+++ gcc/common.opt     2018-12-14 15:02:53.809883498 +0000
>@@ -2775,6 +2775,10 @@ fsplit-loops
> Common Report Var(flag_split_loops) Optimization
> Perform loop splitting.
> 
>+fversion-loops-for-strides
>+Common Report Var(flag_version_loops_for_strides) Optimization
>+Version loops based on whether indices have a stride of one.
>+
> funwind-tables
> Common Report Var(flag_unwind_tables) Optimization
> Just generate unwind tables for exception handling.
>Index: gcc/opts.c
>===================================================================
>--- gcc/opts.c 2018-12-06 18:00:29.000000000 +0000
>+++ gcc/opts.c 2018-12-14 15:02:53.813883464 +0000
>@@ -556,6 +556,7 @@ static const struct default_options defa
>     { OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
>     { OPT_LEVELS_3_PLUS, OPT_funswitch_loops, NULL, 1 },
>{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL,
>VECT_COST_MODEL_DYNAMIC },
>+    { OPT_LEVELS_3_PLUS, OPT_fversion_loops_for_strides, NULL, 1 },
> 
>     /* -Ofast adds optimizations to -O3.  */
>     { OPT_LEVELS_FAST, OPT_ffast_math, NULL, 1 },
>Index: gcc/params.def
>===================================================================
>--- gcc/params.def     2018-12-06 18:00:29.000000000 +0000
>+++ gcc/params.def     2018-12-14 15:02:53.817883430 +0000
>@@ -1365,6 +1365,19 @@ DEFPARAM(PARAM_LOGICAL_OP_NON_SHORT_CIRC
>        "True if a non-short-circuit operation is optimal.",
>        -1, -1, 1)
> 
>+DEFPARAM(PARAM_LOOP_VERSIONING_MAX_INNER_INSNS,
>+       "loop-versioning-max-inner-insns",
>+       "The maximum number of instructions in an inner loop that is being"
>+       " considered for versioning.",
>+       200, 0, 0)
>+
>+DEFPARAM(PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS,
>+       "loop-versioning-max-outer-insns",
>+       "The maximum number of instructions in an outer loop that is being"
>+       " considered for versioning, on top of the instructions in inner"
>+       " loops.",
>+       100, 0, 0)
>+
> /*
> 
> Local variables:
>Index: gcc/passes.def
>===================================================================
>--- gcc/passes.def     2018-12-06 18:00:29.000000000 +0000
>+++ gcc/passes.def     2018-12-14 15:02:53.817883430 +0000
>@@ -265,6 +265,7 @@ along with GCC; see the file COPYING3.
>         NEXT_PASS (pass_tree_unswitch);
>         NEXT_PASS (pass_scev_cprop);
>         NEXT_PASS (pass_loop_split);
>+        NEXT_PASS (pass_loop_versioning);
>         NEXT_PASS (pass_loop_jam);
>         /* All unswitching, final value replacement and splitting can expose
>            empty loops.  Remove them now.  */
>Index: gcc/timevar.def
>===================================================================
>--- gcc/timevar.def    2018-12-06 18:00:29.000000000 +0000
>+++ gcc/timevar.def    2018-12-14 15:02:53.821883396 +0000
>@@ -234,6 +234,7 @@ DEFTIMEVAR (TV_DSE1                  , "
> DEFTIMEVAR (TV_DSE2                  , "dead store elim2")
> DEFTIMEVAR (TV_LOOP                  , "loop analysis")
> DEFTIMEVAR (TV_LOOP_INIT           , "loop init")
>+DEFTIMEVAR (TV_LOOP_VERSIONING             , "loop versioning")
> DEFTIMEVAR (TV_LOOP_MOVE_INVARIANTS  , "loop invariant motion")
> DEFTIMEVAR (TV_LOOP_UNROLL           , "loop unrolling")
> DEFTIMEVAR (TV_LOOP_DOLOOP           , "loop doloop")
>Index: gcc/tree-ssa-propagate.h
>===================================================================
>--- gcc/tree-ssa-propagate.h   2018-12-06 18:00:29.000000000 +0000
>+++ gcc/tree-ssa-propagate.h   2018-12-14 15:02:53.821883396 +0000
>@@ -104,7 +104,7 @@ extern void propagate_tree_value_into_st
>   virtual bool fold_stmt (gimple_stmt_iterator *) { return false; }
>   virtual tree get_value (tree) { return NULL_TREE; }
> 
>-  bool substitute_and_fold (void);
>+  bool substitute_and_fold (basic_block = NULL);
>   bool replace_uses_in (gimple *);
>   bool replace_phi_args_in (gphi *);
> };
>Index: gcc/tree-ssa-propagate.c
>===================================================================
>--- gcc/tree-ssa-propagate.c   2018-12-06 18:00:29.000000000 +0000
>+++ gcc/tree-ssa-propagate.c   2018-12-14 15:02:53.821883396 +0000
>@@ -1154,6 +1154,10 @@ substitute_and_fold_dom_walker::before_d
> 
> 
> /* Perform final substitution and folding of propagated values.
>+   Process the whole function if BLOCK is null, otherwise only
>+   process the blocks that BLOCK dominates.  In the latter case,
>+   it is the caller's responsibility to ensure that dominator
>+   information is available and up-to-date.
> 
>    PROP_VALUE[I] contains the single value that should be substituted
>    at every use of SSA name N_I.  If PROP_VALUE is NULL, no values are
>@@ -1170,16 +1174,24 @@ substitute_and_fold_dom_walker::before_d
>    Return TRUE when something changed.  */
> 
> bool
>-substitute_and_fold_engine::substitute_and_fold (void)
>+substitute_and_fold_engine::substitute_and_fold (basic_block block)
> {
>   if (dump_file && (dump_flags & TDF_DETAILS))
>fprintf (dump_file, "\nSubstituting values and folding
>statements\n\n");
> 
>   memset (&prop_stats, 0, sizeof (prop_stats));
> 
>-  calculate_dominance_info (CDI_DOMINATORS);
>+  /* Don't call calculate_dominance_info when iterating over a
>subgraph.
>+     Callers that are using the interface this way are likely to want
>to
>+     iterate over several disjoint subgraphs, and it would be
>expensive
>+     in enable-checking builds to revalidate the whole dominance tree
>+     each time.  */
>+  if (block)
>+    gcc_assert (dom_info_state (CDI_DOMINATORS));
>+  else
>+    calculate_dominance_info (CDI_DOMINATORS);
>   substitute_and_fold_dom_walker walker (CDI_DOMINATORS, this);
>-  walker.walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>+  walker.walk (block ? block : ENTRY_BLOCK_PTR_FOR_FN (cfun));
> 
>   /* We cannot remove stmts during the BB walk, especially not release
>      SSA names there as that destroys the lattice of our callers.
>Index: gcc/tree-vrp.h
>===================================================================
>--- gcc/tree-vrp.h     2018-12-06 18:00:29.000000000 +0000
>+++ gcc/tree-vrp.h     2018-12-14 15:02:53.821883396 +0000
>@@ -243,7 +243,7 @@ struct assert_info
> extern void register_edge_assert_for (tree, edge, enum tree_code,
>                                     tree, tree, vec<assert_info> &);
> extern bool stmt_interesting_for_vrp (gimple *);
>-extern bool range_includes_zero_p (const value_range_base *);
>+extern bool range_includes_p (const value_range_base *,
>HOST_WIDE_INT);
> extern bool infer_value_range (gimple *, tree, tree_code *, tree *);
> 
> extern bool vrp_bitmap_equal_p (const_bitmap, const_bitmap);
>@@ -285,4 +285,12 @@ extern tree get_single_symbol (tree, boo
> extern void maybe_set_nonzero_bits (edge, tree);
>extern value_range_kind determine_value_range (tree, wide_int *,
>wide_int *);
> 
>+/* Return TRUE if *VR includes the value zero.  */
>+
>+inline bool
>+range_includes_zero_p (const value_range_base *vr)
>+{
>+  return range_includes_p (vr, 0);
>+}
>+
> #endif /* GCC_TREE_VRP_H */
>Index: gcc/tree-vrp.c
>===================================================================
>--- gcc/tree-vrp.c     2018-12-07 14:59:04.255508495 +0000
>+++ gcc/tree-vrp.c     2018-12-14 15:02:53.821883396 +0000
>@@ -1173,15 +1173,14 @@ value_inside_range (tree val, tree min,
> }
> 
> 
>-/* Return TRUE if *VR includes the value zero.  */
>+/* Return TRUE if *VR includes the value X.  */
> 
> bool
>-range_includes_zero_p (const value_range_base *vr)
>+range_includes_p (const value_range_base *vr, HOST_WIDE_INT x)
> {
>   if (vr->varying_p () || vr->undefined_p ())
>     return true;
>-  tree zero = build_int_cst (vr->type (), 0);
>-  return vr->may_contain_p (zero);
>+  return vr->may_contain_p (build_int_cst (vr->type (), x));
> }
> 
>/* If *VR has a value range that is a single constant value return
>that,
>Index: gcc/tree-pass.h
>===================================================================
>--- gcc/tree-pass.h    2018-12-06 18:00:29.000000000 +0000
>+++ gcc/tree-pass.h    2018-12-14 15:02:53.821883396 +0000
>@@ -362,6 +362,7 @@ extern gimple_opt_pass *make_pass_fix_lo
> extern gimple_opt_pass *make_pass_tree_loop (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_tree_no_loop (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt);
>+extern gimple_opt_pass *make_pass_loop_versioning (gcc::context
>*ctxt);
> extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_linterchange (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt);
>Index: gcc/gimple-loop-versioning.cc
>===================================================================
>--- /dev/null  2018-11-29 13:15:04.463550658 +0000
>+++ gcc/gimple-loop-versioning.cc      2018-12-14 15:02:53.813883464 +0000
>@@ -0,0 +1,1758 @@
>+/* Loop versioning pass.
>+   Copyright (C) 2018 Free Software Foundation, Inc.
>+
>+This file is part of GCC.
>+
>+GCC is free software; you can redistribute it and/or modify it
>+under the terms of the GNU General Public License as published by the
>+Free Software Foundation; either version 3, or (at your option) any
>+later version.
>+
>+GCC is distributed in the hope that it will be useful, but WITHOUT
>+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>+for more details.
>+
>+You should have received a copy of the GNU General Public License
>+along with GCC; see the file COPYING3.  If not see
>+<http://www.gnu.org/licenses/>.  */
>+
>+#include "config.h"
>+#include "system.h"
>+#include "coretypes.h"
>+#include "backend.h"
>+#include "tree.h"
>+#include "gimple.h"
>+#include "gimple-iterator.h"
>+#include "tree-pass.h"
>+#include "gimplify-me.h"
>+#include "cfgloop.h"
>+#include "tree-ssa-loop.h"
>+#include "ssa.h"
>+#include "tree-scalar-evolution.h"
>+#include "tree-chrec.h"
>+#include "tree-ssa-loop-ivopts.h"
>+#include "fold-const.h"
>+#include "tree-ssa-propagate.h"
>+#include "tree-inline.h"
>+#include "domwalk.h"
>+#include "alloc-pool.h"
>+#include "vr-values.h"
>+#include "gimple-ssa-evrp-analyze.h"
>+#include "tree-vectorizer.h"
>+#include "omp-general.h"
>+#include "predict.h"
>+#include "tree-into-ssa.h"
>+#include "params.h"
>+
>+namespace {
>+
>+/* This pass looks for loops that could be simplified if certain loop
>+   invariant conditions were true.  It is effectively a form of loop
>+   splitting in which the pass produces the split conditions itself,
>+   instead of using ones that are already present in the IL.
>+
>+   Versioning for when strides are 1
>+   ---------------------------------
>+
>+   At the moment the only thing the pass looks for are memory
>references
>+   like:
>+
>+     for (auto i : ...)
>+       ...x[i * stride]...
>+
>+   It considers changing such loops to:
>+
>+     if (stride == 1)
>+       for (auto i : ...)    [A]
>+       ...x[i]...
>+     else
>+       for (auto i : ...)    [B]
>+       ...x[i * stride]...
>+
>+   This can have several benefits:
>+
>+   (1) [A] is often easier or cheaper to vectorize than [B].
>+
>+   (2) The scalar code in [A] is simpler than the scalar code in [B]
>+       (if the loops cannot be vectorized or need an epilogue loop).
>+
>+   (3) We might recognize [A] as a pattern, such as a memcpy or
>memset.
>+
>+   (4) [A] has simpler address evolutions, which can help other passes
>+       like loop interchange.
>+
>+   The optimization is particularly useful for assumed-shape arrays in
>+   Fortran, where the stride of the innermost dimension depends on the
>+   array descriptor but is often equal to 1 in practice.  For example:
>+
>+     subroutine f1(x)
>+       real :: x(:)
>+       x(:) = 100
>+     end subroutine f1
>+
>+   generates the equivalent of:
>+
>+     raw_stride = *x.dim[0].stride;
>+     stride = raw_stride != 0 ? raw_stride : 1;
>+     x_base = *x.data;
>+     ...
>+     tmp1 = stride * S;
>+     tmp2 = tmp1 - stride;
>+     *x_base[tmp2] = 1.0e+2;
>+
>+   but in the common case that stride == 1, the last three statements
>+   simplify to:
>+
>+     tmp3 = S + -1;
>+     *x_base[tmp3] = 1.0e+2;
>+
>+   The optimization is in principle very simple.  The difficult parts
>are:
>+
>+   (a) deciding which parts of a general address calculation
>correspond
>+       to the inner dimension of an array, since this usually isn't
>explicit
>+       in the IL, and for C often isn't even explicit in the source
>code
>+
>+   (b) estimating when the transformation is worthwhile
>+
>+   Structure
>+   ---------
>+
>+   The pass has four phases:
>+
>+   (1) Walk through the statements looking for and recording potential
>+       versioning opportunities.  Stop if there are none.
>+
>+   (2) Use context-sensitive range information to see whether any
>versioning
>+       conditions are impossible in practice.  Remove them if so, and
>stop
>+       if no opportunities remain.
>+
>+       (We do this only after (1) to keep compile time down when no
>+       versioning opportunities exist.)
>+
>+   (3) Apply the cost model.  Decide which versioning opportunities
>are
>+       worthwhile and at which nesting level they should be applied.
>+
>+   (4) Attempt to version all the loops selected by (3), so that:
>+
>+       for (...)
>+         ...
>+
>+       becomes:
>+
>+       if (!cond)
>+         for (...) // Original loop
>+           ...
>+       else
>+         for (...) // New loop
>+           ...
>+
>+       Use the version condition COND to simplify the new loop.  */
>+
>+/* Enumerates the likelihood that a particular value indexes the inner
>+   dimension of an array.  */
>+enum inner_likelihood {
>+  INNER_UNLIKELY,
>+  INNER_DONT_KNOW,
>+  INNER_LIKELY
>+};
>+
>+/* Information about one term of an address_info.  */
>+struct address_term_info
>+{
>+  /* The value of the term is EXPR * MULTIPLIER.  */
>+  tree expr;
>+  unsigned HOST_WIDE_INT multiplier;
>+
>+  /* The stride applied by EXPR in each iteration of some unrecorded
>loop,
>+     or null if no stride has been identified.  */
>+  tree stride;
>+
>+  /* Enumerates the likelihood that EXPR indexes the inner dimension
>+     of an array.  */
>+  enum inner_likelihood inner_likelihood;
>+
>+  /* True if STRIDE == 1 is a versioning opportunity when considered
>+     in isolation.  */
>+  bool versioning_opportunity_p;
>+};
>+
>+/* Information about an address calculation, and the range of constant
>+   offsets applied to it.  */
>+struct address_info
>+{
>+  static const unsigned int MAX_TERMS = 8;
>+
>+  /* One statement that calculates the address.  If multiple
>statements
>+     share the same address, we only record the first.  */
>+  gimple *stmt;
>+
>+  /* The loop containing STMT (cached for convenience).  If multiple
>+     statements share the same address, they all belong to this loop. 
>*/
>+  struct loop *loop;
>+
>+  /* A decomposition of the calculation into a sum of terms plus an
>+     optional base.  When BASE is provided, it is never an SSA name.
>+     Once initialization is complete, all members of TERMs are SSA
>names.  */
>+  tree base;
>+  auto_vec<address_term_info, MAX_TERMS> terms;
>+
>+  /* All bytes accessed from the address fall in the offset range
>+     [MIN_OFFSET, MAX_OFFSET).  */
>+  HOST_WIDE_INT min_offset, max_offset;
>+};
>+
>+/* Stores addresses based on their base and terms (ignoring the
>offsets).  */
>+struct address_info_hasher : nofree_ptr_hash <address_info>
>+{
>+  static hashval_t hash (const address_info *);
>+  static bool equal (const address_info *, const address_info *);
>+};
>+
>+/* Information about the versioning we'd like to apply to a loop.  */
>+struct loop_info
>+{
>+  bool worth_versioning_p () const;
>+
>+  /* True if we've decided not to version this loop.  The remaining
>+     fields are meaningless if so.  */
>+  bool rejected_p;
>+
>+  /* True if at least one subloop of this loop benefits from
>versioning.  */
>+  bool subloops_benefit_p;
>+
>+  /* An estimate of the total number of instructions in the loop,
>+     excluding those in subloops that benefit from versioning.  */
>+  unsigned int num_insns;
>+
>+  /* The outermost loop that can handle all the version checks
>+     described below.  */
>+  struct loop *outermost;
>+
>+  /* The first entry in the list of blocks that belong to this loop
>+     (and not to subloops).  m_next_block_in_loop provides the chain
>+     pointers for the list.  */
>+  basic_block block_list;
>+
>+  /* We'd like to version the loop for the case in which these SSA
>names
>+     (keyed off their SSA_NAME_VERSION) are all equal to 1 at runtime.
> */
>+  bitmap_head unity_names;
>+
>+  /* If versioning succeeds, this points the version of the loop that
>+     assumes the version conditions holds.  */
>+  struct loop *optimized_loop;
>+};
>+
>+/* The main pass structure.  */
>+class loop_versioning
>+{
>+public:
>+  loop_versioning (function *);
>+  ~loop_versioning ();
>+  unsigned int run ();
>+
>+private:
>+  /* Used to walk the dominator tree to find loop versioning
>conditions
>+     that are always false.  */
>+  class lv_dom_walker : public dom_walker
>+  {
>+  public:
>+    lv_dom_walker (loop_versioning &);
>+
>+    edge before_dom_children (basic_block) FINAL OVERRIDE;
>+    void after_dom_children (basic_block) FINAL OVERRIDE;
>+
>+  private:
>+    /* The parent pass.  */
>+    loop_versioning &m_lv;
>+
>+    /* Used to build context-dependent range information.  */
>+    evrp_range_analyzer m_range_analyzer;
>+  };
>+
>+  /* Used to simplify statements based on conditions that are
>established
>+     by the version checks.  */
>+  class name_prop : public substitute_and_fold_engine
>+  {
>+  public:
>+    name_prop (loop_info &li) : m_li (li) {}
>+    tree get_value (tree) FINAL OVERRIDE;
>+
>+  private:
>+    /* Information about the versioning we've performed on the loop. 
>*/
>+    loop_info &m_li;
>+  };
>+
>+  loop_info &get_loop_info (struct loop *loop) { return
>m_loops[loop->num]; }
>+
>+  unsigned int max_insns_for_loop (struct loop *);
>+  bool expensive_stmt_p (gimple *);
>+
>+  void version_for_unity (gimple *, tree);
>+  bool acceptable_multiplier_p (tree, unsigned HOST_WIDE_INT,
>+                              unsigned HOST_WIDE_INT * = 0);
>+  bool acceptable_type_p (tree, unsigned HOST_WIDE_INT *);
>+  bool multiply_term_by (address_term_info &, tree);
>+  inner_likelihood get_inner_likelihood (tree, unsigned
>HOST_WIDE_INT);
>+  void analyze_stride (address_info &, address_term_info &,
>+                     tree, struct loop *);
>+  bool find_per_loop_multiplication (address_info &, address_term_info
>&);
>+  void analyze_term_using_scevs (address_info &, address_term_info &);
>+  void analyze_address_fragment (address_info &);
>+  void record_address_fragment (gimple *, unsigned HOST_WIDE_INT,
>+                              tree, unsigned HOST_WIDE_INT, HOST_WIDE_INT);
>+  void analyze_expr (gimple *, tree);
>+  bool analyze_block (basic_block);
>+  bool analyze_blocks ();
>+
>+  void prune_loop_conditions (struct loop *, vr_values *);
>+  bool prune_conditions ();
>+
>+  void merge_loop_info (struct loop *, struct loop *);
>+  void add_loop_to_queue (struct loop *);
>+  bool decide_whether_loop_is_versionable (struct loop *);
>+  bool make_versioning_decisions ();
>+
>+  bool version_loop (struct loop *);
>+  void implement_versioning_decisions ();
>+
>+  /* The function we're optimizing.  */
>+  function *m_fn;
>+
>+  /* The obstack to use for all pass-specific bitmaps.  */
>+  bitmap_obstack m_bitmap_obstack;
>+
>+  /* An obstack to use for general allocation.  */
>+  obstack m_obstack;
>+
>+  /* The number of loops in the function.  */
>+  unsigned int m_nloops;
>+
>+  /* The total number of loop version conditions we've found.  */
>+  unsigned int m_num_conditions;
>+
>+  /* Assume that an address fragment of the form i * stride * scale
>+     (for variable stride and constant scale) will not benefit from
>+     versioning for stride == 1 when scale is greater than this value.
> */
>+  unsigned HOST_WIDE_INT m_maximum_scale;
>+
>+  /* Information about each loop.  */
>+  auto_vec<loop_info> m_loops;
>+
>+  /* Used to form a linked list of blocks that belong to a loop,
>+     started by loop_info::block_list.  */
>+  auto_vec<basic_block> m_next_block_in_loop;
>+
>+  /* The list of loops that we've decided to version.  */
>+  auto_vec<struct loop *> m_loops_to_version;
>+
>+  /* A table of addresses in the current loop, keyed off their values
>+     but not their offsets.  */
>+  hash_table <address_info_hasher> m_address_table;
>+
>+  /* A list of all addresses in M_ADDRESS_TABLE, in a predictable
>order.  */
>+  auto_vec <address_info *, 32> m_address_list;
>+};
>+
>+/* If EXPR is an SSA name and not a default definition, return the
>+   defining statement, otherwise return null.  */
>+
>+static gimple *
>+maybe_get_stmt (tree expr)
>+{
>+  if (TREE_CODE (expr) == SSA_NAME && !SSA_NAME_IS_DEFAULT_DEF (expr))
>+    return SSA_NAME_DEF_STMT (expr);
>+  return NULL;
>+}
>+
>+/* Like maybe_get_stmt, but also return null if the defining
>+   statement isn't an assignment.  */
>+
>+static gassign *
>+maybe_get_assign (tree expr)
>+{
>+  return safe_dyn_cast <gassign *> (maybe_get_stmt (expr));
>+}
>+
>+/* Return true if this pass should look through a cast of expression
>FROM
>+   to type TYPE when analyzing pieces of an address.  */
>+
>+static bool
>+look_through_cast_p (tree type, tree from)
>+{
>+  return (INTEGRAL_TYPE_P (TREE_TYPE (from)) == INTEGRAL_TYPE_P (type)
>+        && POINTER_TYPE_P (TREE_TYPE (from)) == POINTER_TYPE_P (type));
>+}
>+
>+/* Strip all conversions of integers or pointers from EXPR, regardless
>+   of whether the conversions are nops.  This is useful in the context
>+   of this pass because we're not trying to fold or simulate the
>+   expression; we just want to see how it's structured.  */
>+
>+static tree
>+strip_casts (tree expr)
>+{
>+  const unsigned int MAX_NITERS = 4;
>+
>+  tree type = TREE_TYPE (expr);
>+  while (CONVERT_EXPR_P (expr)
>+       && look_through_cast_p (type, TREE_OPERAND (expr, 0)))
>+    expr = TREE_OPERAND (expr, 0);
>+
>+  for (unsigned int niters = 0; niters < MAX_NITERS; ++niters)
>+    {
>+      gassign *assign = maybe_get_assign (expr);
>+      if (assign
>+        && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign))
>+        && look_through_cast_p (type, gimple_assign_rhs1 (assign)))
>+      expr = gimple_assign_rhs1 (assign);
>+      else
>+      break;
>+    }
>+  return expr;
>+}
>+
>+/* Compare two address_term_infos in the same address_info.  */
>+
>+static int
>+compare_address_terms (const void *a_uncast, const void *b_uncast)
>+{
>+  const address_term_info *a = (const address_term_info *) a_uncast;
>+  const address_term_info *b = (const address_term_info *) b_uncast;
>+
>+  if (a->expr != b->expr)
>+    return SSA_NAME_VERSION (a->expr) < SSA_NAME_VERSION (b->expr) ?
>-1 : 1;
>+
>+  if (a->multiplier != b->multiplier)
>+    return a->multiplier < b->multiplier ? -1 : 1;
>+
>+  return 0;
>+}
>+
>+/* Dump ADDRESS using flags FLAGS.  */
>+
>+static void
>+dump_address_info (dump_flags_t flags, address_info &address)
>+{
>+  if (address.base)
>+    dump_printf (flags, "%T + ", address.base);
>+  for (unsigned int i = 0; i < address.terms.length (); ++i)
>+    {
>+      if (i != 0)
>+      dump_printf (flags, " + ");
>+      dump_printf (flags, "%T", address.terms[i].expr);
>+      if (address.terms[i].multiplier != 1)
>+      dump_printf (flags, " * %wd", address.terms[i].multiplier);
>+    }
>+  dump_printf (flags, " + [%wd, %wd]",
>+             address.min_offset, address.max_offset - 1);
>+}
>+
>+/* Hash an address_info based on its base and terms.  */
>+
>+hashval_t
>+address_info_hasher::hash (const address_info *info)
>+{
>+  inchash::hash hash;
>+  hash.add_int (info->base ? TREE_CODE (info->base) : 0);
>+  hash.add_int (info->terms.length ());
>+  for (unsigned int i = 0; i < info->terms.length (); ++i)
>+    {
>+      hash.add_int (SSA_NAME_VERSION (info->terms[i].expr));
>+      hash.add_hwi (info->terms[i].multiplier);
>+    }
>+  return hash.end ();
>+}
>+
>+/* Return true if two address_infos have equal bases and terms.  Other
>+   properties might be different (such as the statement or constant
>+   offset range).  */
>+
>+bool
>+address_info_hasher::equal (const address_info *a, const address_info
>*b)
>+{
>+  if (a->base != b->base
>+      && (!a->base || !b->base || !operand_equal_p (a->base, b->base,
>0)))
>+    return false;
>+
>+  if (a->terms.length () != b->terms.length ())
>+    return false;
>+
>+  for (unsigned int i = 0; i < a->terms.length (); ++i)
>+    if (a->terms[i].expr != b->terms[i].expr
>+      || a->terms[i].multiplier != b->terms[i].multiplier)
>+      return false;
>+
>+  return true;
>+}
>+
>+/* Return true if we want to version the loop, i.e. if we have a
>+   specific reason for doing so and no specific reason not to.  */
>+
>+bool
>+loop_info::worth_versioning_p () const
>+{
>+  return (!rejected_p
>+        && (!bitmap_empty_p (&unity_names) || subloops_benefit_p));
>+}
>+
>+loop_versioning::lv_dom_walker::lv_dom_walker (loop_versioning &lv)
>+  : dom_walker (CDI_DOMINATORS), m_lv (lv), m_range_analyzer (false)
>+{
>+}
>+
>+/* Process BB before processing the blocks it dominates.  */
>+
>+edge
>+loop_versioning::lv_dom_walker::before_dom

Re: Add a loop versioning pass

Reply via email to