[PATCH] Add vector cost model density heuristic

William J. Schmidt Fri, 08 Jun 2012 08:24:55 -0700

This patch adds a heuristic to the vectorizer when estimating the
minimum profitable number of iterations.  The heuristic is
target-dependent, and is currently disabled for all targets except
PowerPC.  However, the intent is to make it general enough to be useful
for other targets that want to opt in.


A previous patch addressed some PowerPC SPEC degradations by modifying
the vector cost model values for vec_perm and vec_promote_demote.  The
values were set a little higher than their natural values because the
natural values were not sufficient to prevent a poor vectorization
choice.  However, this is not the right long-term solution, since it can
unnecessarily constrain other vectorization choices involving permute
instructions.

Analysis of the badly vectorized loop (in sphinx3) showed that the
problem was overcommitment of vector resources -- too many vector
instructions issued without enough non-vector instructions available to
cover the delays.  The vector cost model assumes that instructions
always have a constant cost, and doesn't have a way of judging this kind
of "density" of vector instructions.

The present patch adds a heuristic to recognize when a loop is likely to
overcommit resources, and adds a small penalty to the inside-loop cost
to account for the expected stalls.  The heuristic is parameterized with
three target-specific values:

 * Density threshold: The heuristic will apply only when the
   percentage of inside-loop cost attributable to vectorized
   instructions exceeds this value.

 * Size threshold: The heuristic will apply only when the
   inside-loop cost exceeds this value.

 * Penalty: The inside-loop cost will be increased by this
   percentage value when the heuristic applies.

Thus only reasonably large loop bodies that are mostly vectorized
instructions will be affected.

By applying only a small percentage bump to the inside-loop cost, the
heuristic will only turn off vectorization for loops that were
considered "barely profitable" to begin with (such as the sphinx3 loop).
So the heuristic is quite conservative and should not affect the vast
majority of vectorization decisions.

Together with the new heuristic, this patch reduces the vec_perm and
vec_promote_demote costs for PowerPC to their natural values.

I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
and verified that no performance regressions occur on SPEC cpu2006.  Is
this ok for trunk?

Thanks,
Bill


2012-06-08  Bill Schmidt  <wschm...@linux.ibm.com>

        * doc/tm.texi.in: Add vectorization density hooks.
        * doc/tm.texi: Regenerate.
        * targhooks.c (default_density_pct_threshold): New.
        (default_density_size_threshold): New.
        (default_density_penalty): New.
        * targhooks.h: New decls for new targhooks.c functions.
        * target.def (density_pct_threshold): New DEF_HOOK.
        (density_size_threshold): Likewise.
        (density_penalty): Likewise.
        * tree-vect-loop.c (accum_stmt_cost): New.
        (vect_estimate_min_profitable_iters): Perform density test.
        * config/rs6000/rs6000.c (TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD):
        New macro definition.
        (TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD): Likewise.
        (TARGET_VECTORIZE_DENSITY_PENALTY): Likewise.
        (rs6000_builtin_vectorization_cost): Reduce costs of vec_perm and
        vec_promote_demote to correct values.
        (rs6000_density_pct_threshold): New.
        (rs6000_density_size_threshold): New.
        (rs6000_density_penalty): New.


Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi     (revision 188305)
+++ gcc/doc/tm.texi     (working copy)
@@ -5798,6 +5798,27 @@ The default is @code{NULL_TREE} which means to not
 loads.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD (void)
+This hook should return the maximum density, expressed in percent, for
+which autovectorization of loops with large bodies should be constrained.
+See also @code{TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD}.  The default
+is to return 100, which disables the density test.
+@end deftypefn
+
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD (void)
+This hook should return the minimum estimated size of a vectorized
+loop body for which the density test should apply.  See also
+@code{TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD}.  The default is set
+to the unreasonable value of 1000000, which effectively disables 
+the density test.
+@end deftypefn
+
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_PENALTY (void)
+This hook should return the penalty, expressed in percent, to be applied
+to the inside-of-loop vectorization costs for a loop failing the density
+test.  The default is 10.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in  (revision 188305)
+++ gcc/doc/tm.texi.in  (working copy)
@@ -5730,6 +5730,27 @@ The default is @code{NULL_TREE} which means to not
 loads.
 @end deftypefn
 
+@hook TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD
+This hook should return the maximum density, expressed in percent, for
+which autovectorization of loops with large bodies should be constrained.
+See also @code{TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD}.  The default
+is to return 100, which disables the density test.
+@end deftypefn
+
+@hook TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD
+This hook should return the minimum estimated size of a vectorized
+loop body for which the density test should apply.  See also
+@code{TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD}.  The default is set
+to the unreasonable value of 1000000, which effectively disables 
+the density test.
+@end deftypefn
+
+@hook TARGET_VECTORIZE_DENSITY_PENALTY
+This hook should return the penalty, expressed in percent, to be applied
+to the inside-of-loop vectorization costs for a loop failing the density
+test.  The default is 10.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c     (revision 188305)
+++ gcc/targhooks.c     (working copy)
@@ -990,6 +990,33 @@ default_autovectorize_vector_sizes (void)
   return 0;
 }
 
+/* By default, the density test for autovectorization is disabled by
+   setting the minimum percentage to 100.  */
+
+int
+default_density_pct_threshold (void)
+{
+  return 100;
+}
+
+/* By default, the density size threshold for autovectorization is
+   meaningless since the density test is disabled.  An unreasonably
+   large number is used to further inhibit the density test.  */
+
+int
+default_density_size_threshold (void)
+{
+  return 1000000;
+}
+
+/* By default, the density penalty for autovectorization is set to 10%.  */
+
+int
+default_density_penalty (void)
+{
+  return 10;
+}
+
 /* Determine whether or not a pointer mode is valid. Assume defaults
    of ptr_mode or Pmode - can be overridden.  */
 bool
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h     (revision 188305)
+++ gcc/targhooks.h     (working copy)
@@ -90,6 +90,9 @@ default_builtin_support_vector_misalignment (enum
                                             int, bool);
 extern enum machine_mode default_preferred_simd_mode (enum machine_mode mode);
 extern unsigned int default_autovectorize_vector_sizes (void);
+extern int default_density_pct_threshold (void);
+extern int default_density_size_threshold (void);
+extern int default_density_penalty (void);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
Index: gcc/target.def
===================================================================
--- gcc/target.def      (revision 188305)
+++ gcc/target.def      (working copy)
@@ -1054,6 +1054,32 @@ DEFHOOK
  (const_tree mem_vectype, const_tree index_type, int scale),
  NULL)
 
+/* Return the maximum density in percent for loop vectorization.  */
+DEFHOOK
+(density_pct_threshold,
+"",
+int,
+(void),
+default_density_pct_threshold)
+
+/* Return the minimum size of a loop iteration for applying the density
+   test for loop vectorization.  */
+DEFHOOK
+(density_size_threshold,
+"",
+int,
+(void),
+default_density_size_threshold)
+
+/* Return the penalty in percent for vectorizing a loop failing the
+   density test.  */
+DEFHOOK
+(density_penalty,
+"",
+int,
+(void),
+default_density_penalty)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c        (revision 188305)
+++ gcc/tree-vect-loop.c        (working copy)
@@ -2485,6 +2485,58 @@ vect_get_known_peeling_cost (loop_vec_info loop_vi
            + peel_guard_costs;
 }
 
+/* Add the inside-loop cost of STMT to either *REL_COST or *IRREL_COST,
+   depending on whether or not STMT will be vectorized.  For vectorized
+   statements, the inside-loop cost is as already computed.  For other
+   statements, assume a cost of one.  */
+
+static void
+accum_stmt_cost (gimple stmt, int *rel_cost, int *irrel_cost)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  gimple pattern_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
+  gimple_seq pattern_def_seq;
+
+  /* If the statement is irrelevant, but it has a related pattern
+     statement that is relevant, process just the related statement.
+     If the statement is relevant and it has a related pattern
+     statement that is also relevant, process them both.  */
+  if (!STMT_VINFO_RELEVANT_P (stmt_info)
+      && !STMT_VINFO_LIVE_P (stmt_info))
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+         && pattern_stmt
+         && (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_stmt))
+             || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_stmt))))
+       accum_stmt_cost (pattern_stmt, rel_cost, irrel_cost);
+      else
+       (*irrel_cost)++;
+    }
+  else if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+          && pattern_stmt
+          && (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_stmt))
+              || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_stmt))))
+    accum_stmt_cost (pattern_stmt, rel_cost, irrel_cost);
+
+  /* If we're looking at a pattern that has additional statements,
+     count them as well.  */
+  if (is_pattern_stmt_p (stmt_info)
+      && (pattern_def_seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_info)))
+    {
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start (pattern_def_seq); !gsi_end_p (gsi); gsi_next 
(&gsi))
+       {
+         gimple pattern_def_stmt = gsi_stmt (gsi);
+         if (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_def_stmt))
+             || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_def_stmt)))
+           accum_stmt_cost (pattern_def_stmt, rel_cost, irrel_cost);
+       }
+    }
+
+  /* Accumulate the inside-loop cost of this vectorizable statement.  */
+  *rel_cost += STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info);
+}
+
 /* Function vect_estimate_min_profitable_iters
 
    Return the number of iterations required for the vector version of the
@@ -2743,6 +2795,45 @@ vect_estimate_min_profitable_iters (loop_vec_info
       vec_inside_cost += SLP_INSTANCE_INSIDE_OF_LOOP_COST (instance);
     }
 
+  /* Test for likely overcommitment of vector hardware resources.  If a
+     loop iteration is relatively large, and too large a percentage of
+     instructions in the loop are vectorized, the cost model may not
+     adequately reflect delays from unavailable vector resources.
+     Penalize vec_inside_cost for this case, using target-specific
+     parameters.  */
+  if (targetm.vectorize.density_pct_threshold () < 100)
+    {
+      int rel_cost = 0, irrel_cost = 0;
+      int density_pct;
+
+      for (i = 0; i < nbbs; i++)
+       {
+         basic_block bb = bbs[i];
+         gimple_stmt_iterator gsi;
+
+         for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+           {
+             gimple stmt = gsi_stmt (gsi);
+             accum_stmt_cost (stmt, &rel_cost, &irrel_cost);
+           }
+       }
+
+      density_pct = (rel_cost * 100) / (rel_cost + irrel_cost);
+
+      if (density_pct > targetm.vectorize.density_pct_threshold ()
+         && (rel_cost + irrel_cost
+             > targetm.vectorize.density_size_threshold ()))
+       {
+         int penalty = targetm.vectorize.density_penalty ();
+         vec_inside_cost = vec_inside_cost * (100 + penalty) / 100;
+         if (vect_print_dump_info (REPORT_DETAILS))
+           fprintf (vect_dump,
+                    "density %d%%, cost %d exceeds threshold"
+                    ", penalizing inside-loop cost by %d%%.",
+                    density_pct, rel_cost + irrel_cost, penalty);
+       }
+    }
+
   /* Calculate number of iterations required to make the vector version
      profitable, relative to the loop bodies only.  The following condition
      must hold true:
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c  (revision 188305)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -1289,6 +1289,15 @@ static const struct attribute_spec rs6000_attribut
 #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE
 #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE \
   rs6000_preferred_simd_mode
+#undef TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD
+#define TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD \
+  rs6000_density_pct_threshold
+#undef TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD
+#define TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD \
+  rs6000_density_size_threshold
+#undef TARGET_VECTORIZE_DENSITY_PENALTY
+#define TARGET_VECTORIZE_DENSITY_PENALTY \
+ rs6000_density_penalty
 
 #undef TARGET_INIT_BUILTINS
 #define TARGET_INIT_BUILTINS rs6000_init_builtins
@@ -3421,13 +3430,13 @@ rs6000_builtin_vectorization_cost (enum vect_cost_
 
       case vec_perm:
        if (TARGET_VSX)
-         return 4;
+         return 3;
        else
          return 1;
 
       case vec_promote_demote:
         if (TARGET_VSX)
-          return 5;
+          return 4;
         else
           return 1;
 
@@ -3551,6 +3560,30 @@ rs6000_preferred_simd_mode (enum machine_mode mode
   return word_mode;
 }
 
+/* Implement targetm.vectorize.density_pct_threshold.  */
+
+static int
+rs6000_density_pct_threshold (void)
+{
+  return 85;
+}
+
+/* Implement targetm.vectorize.density_size_threshold.  */
+
+static int
+rs6000_density_size_threshold (void)
+{
+  return 70;
+}
+
+/* Implement targetm.vectorize.density_penalty.  */
+
+static int
+rs6000_density_penalty (void)
+{
+  return 10;
+}
+
 /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a
    library with vectorized intrinsics.  */

[PATCH] Add vector cost model density heuristic

Reply via email to