Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, Nov 16, 2020 at 10:58 AM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>> > Does the patch also vectorize with SVE loops that have
>> > unknown loop bound?  The documentation isn't entirely
>> > conclusive there.
>>
>> Yeah, for SVE it vectorises.  How about changing:
>>
>>   For example, if each iteration of a vectorized loop would handle
>>   exactly four iterations, …
>>
>> to:
>>
>>   For example, if each iteration of a vectorized loop could only
>>   handle exactly four iterations of the original scalar loop, …
>>
>> ?
>
> Yeah, guess that's better.
>
>>
>> > Iff the iteration count is a multiple of two and the target can
>> > vectorize the loop with both VF 2 and VF 4 but VF 4 would be better if
>> > we'd use the 'cheap' cost model, does 'very-cheap' not vectorize the
>> > loop or does it choose VF 2?
>>
>> It would choose VF 2, if that's still a win over scalar code.
>
> OK, that's what I expected.  The VF iteration is one source of
> compile-time that we might want to avoid somehow ... on
> x86_64 knowing the precise number of constant iterations
> should allow to only pick a subset of vector modes based on
> largest_pow2_factor or so?  Or maybe just use the preferred
> SIMD mode for cheap/very-cheap?  (maybe pass down
> the cost model kind to the target hook so targets can decide
> for themselves here)

On the preferred simd mode thing: TBH, I'd prefer to get rid
of that hook one day and just rely on autovectorize_vector_modes.

The difficulty with adding an early check is that we don't know ahead
of time which types of scalar element a loop operates on: we only find
that out on the fly during the first analysis of the loop.  The check
would also depend on SLP grouping: we can use a vector of 4 ints to
handle 2 iterations of the scalar loop if the ints are in an SLP
group of size 2.

So I agree it would be nice to have early-outs, but I think we'd have
to restructure things first.  E.g. maybe we could do some “cheap” initial
analysis that checks for basic vectorisability, records which scalar
elements are used by the loop, and records how big the containing SLP
groups might be (based on optimistic assumptions).  Then we can use
that to prefilter the modes we try (perhaps all the way down to no modes).
I guess that's conceptually similar to building an SLP graph though.

Does the attached look OK?  I've included a version of the updated
wording above.  I also changed this condition to use “>” rather
than “>=”:

  /* If the vector loop needs multiple iterations to be beneficial then
     things are probably too close to call, and the conservative thing
     would be to stick with the scalar code.  */
  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
      && min_profitable_estimate > (int) vect_vf_for_cost (loop_vinfo))

since when min_profitable_estimate == min_profitable_iters
we'll have done:

  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
      && min_profitable_iters < (assumed_vf + peel_iters_prologue))
    /* We want the vectorized loop to execute at least once.  */
    min_profitable_iters = assumed_vf + peel_iters_prologue;

I also tried to make vect-cost-model-4.c more resilient on targets
that require alignment.

Thanks,
Richard


gcc/
        * doc/invoke.texi (-fvect-cost-model): Add a very-cheap model.
        * common.opt (fvect-cost-model=): Add very-cheap as a possible option.
        (fsimd-cost-model=): Likewise.
        (vect_cost_model): Add very-cheap.
        * flag-types.h (vect_cost_model): Add VECT_COST_MODEL_VERY_CHEAP.
        Put the values in order of increasing aggressiveness.
        * tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Use
        range checks when comparing against VECT_COST_MODEL_CHEAP.
        (vect_prune_runtime_alias_test_list): Do not allow any alias
        checks for the very-cheap cost model.
        * tree-vect-loop.c (vect_analyze_loop_costing): Do not allow
        any peeling for the very-cheap cost model.  Also require one
        iteration of the vector loop to pay for itself.

gcc/testsuite/
        * gcc.dg/vect/vect-cost-model-1.c: New test.
        * gcc.dg/vect/vect-cost-model-2.c: Likewise.
        * gcc.dg/vect/vect-cost-model-3.c: Likewise.
        * gcc.dg/vect/vect-cost-model-4.c: Likewise.
        * gcc.dg/vect/vect-cost-model-5.c: Likewise.
        * gcc.dg/vect/vect-cost-model-6.c: Likewise.
---
 gcc/common.opt                                |  7 +++--
 gcc/doc/invoke.texi                           | 12 +++++++--
 gcc/flag-types.h                              | 10 ++++---
 gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c | 13 +++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c | 12 +++++++++
 gcc/tree-vect-data-refs.c                     |  8 ++++--
 gcc/tree-vect-loop.c                          | 27 +++++++++++++++++++
 11 files changed, 123 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c

diff --git a/gcc/common.opt b/gcc/common.opt
index fe39b3dee9f..ca8a2690799 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3020,11 +3020,11 @@ Enable basic block vectorization (SLP) on trees.
 
 fvect-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) 
Init(VECT_COST_MODEL_DEFAULT) Optimization
--fvect-cost-model=[unlimited|dynamic|cheap]    Specifies the cost model for 
vectorization.
+-fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the cost 
model for vectorization.
 
 fsimd-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) 
Init(VECT_COST_MODEL_UNLIMITED) Optimization
--fsimd-cost-model=[unlimited|dynamic|cheap]    Specifies the vectorization 
cost model for code marked with a simd directive.
+-fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the 
vectorization cost model for code marked with a simd directive.
 
 Enum
 Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown 
vectorizer cost model %qs)
@@ -3038,6 +3038,9 @@ Enum(vect_cost_model) String(dynamic) 
Value(VECT_COST_MODEL_DYNAMIC)
 EnumValue
 Enum(vect_cost_model) String(cheap) Value(VECT_COST_MODEL_CHEAP)
 
+EnumValue
+Enum(vect_cost_model) String(very-cheap) Value(VECT_COST_MODEL_VERY_CHEAP)
+
 fvect-cost-model
 Common Alias(fvect-cost-model=,dynamic,unlimited)
 Enables the dynamic vectorizer cost model.  Preserved for backward 
compatibility.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 3510a54c6c4..07232c6b33d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -11440,7 +11440,8 @@ and @option{-fauto-profile}.
 @item -fvect-cost-model=@var{model}
 @opindex fvect-cost-model
 Alter the cost model used for vectorization.  The @var{model} argument
-should be one of @samp{unlimited}, @samp{dynamic} or @samp{cheap}.
+should be one of @samp{unlimited}, @samp{dynamic}, @samp{cheap} or
+@samp{very-cheap}.
 With the @samp{unlimited} model the vectorized code-path is assumed
 to be profitable while with the @samp{dynamic} model a runtime check
 guards the vectorized code-path to enable it only for iteration
@@ -11448,7 +11449,14 @@ counts that will likely execute faster than when 
executing the original
 scalar loop.  The @samp{cheap} model disables vectorization of
 loops where doing so would be cost prohibitive for example due to
 required runtime checks for data dependence or alignment but otherwise
-is equal to the @samp{dynamic} model.
+is equal to the @samp{dynamic} model.  The @samp{very-cheap} model only
+allows vectorization if the vector code would entirely replace the
+scalar code that is being vectorized.  For example, if each iteration
+of a vectorized loop would only be able to handle exactly four iterations
+of the scalar loop, the @samp{very-cheap} model would only allow
+vectorization if the scalar iteration count is known to be a multiple
+of four.
+
 The default cost model depends on other optimization flags and is
 either @samp{dynamic} or @samp{cheap}.
 
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index 648ed096e30..0dbab19943c 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -232,12 +232,14 @@ enum scalar_storage_order_kind {
   SSO_LITTLE_ENDIAN
 };
 
-/* Vectorizer cost-model.  */
+/* Vectorizer cost-model.  Except for DEFAULT, the values are ordered from
+   the most conservative to the least conservative.  */
 enum vect_cost_model {
+  VECT_COST_MODEL_VERY_CHEAP = -3,
+  VECT_COST_MODEL_CHEAP = -2,
+  VECT_COST_MODEL_DYNAMIC = -1,
   VECT_COST_MODEL_UNLIMITED = 0,
-  VECT_COST_MODEL_CHEAP = 1,
-  VECT_COST_MODEL_DYNAMIC = 2,
-  VECT_COST_MODEL_DEFAULT = 3
+  VECT_COST_MODEL_DEFAULT = 1
 };
 
 /* Different instrumentation modes.  */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
new file mode 100644
index 00000000000..0737da5d671
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *x, int *y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } 
*/
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c 
b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
new file mode 100644
index 00000000000..fa9bdb607b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" 
} */
+
+void
+f (int *x, int *y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target vect_int } 
} } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c 
b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
new file mode 100644
index 00000000000..d7c6cfd2049
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } 
*/
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
new file mode 100644
index 00000000000..bb018ad99fe
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" 
} */
+
+int x[1024], y[1024];
+
+void
+f (void)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } 
*/
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c 
b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
new file mode 100644
index 00000000000..536ec0a3cda
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1023; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } 
*/
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c 
b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
new file mode 100644
index 00000000000..552febb5fee
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" 
} */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1023; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target { vect_int && 
vect_partial_vectors_usage_2 } } } } */
+/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target { vect_int 
&& { ! vect_partial_vectors_usage_2 } } } } } */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 0efab495407..18e36c89d14 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -2161,7 +2161,7 @@ vect_enhance_data_refs_alignment (loop_vec_info 
loop_vinfo)
         {
           unsigned max_allowed_peel
            = param_vect_max_peeling_for_alignment;
-         if (flag_vect_cost_model == VECT_COST_MODEL_CHEAP)
+         if (flag_vect_cost_model <= VECT_COST_MODEL_CHEAP)
            max_allowed_peel = 0;
           if (max_allowed_peel != (unsigned)-1)
             {
@@ -2259,7 +2259,7 @@ vect_enhance_data_refs_alignment (loop_vec_info 
loop_vinfo)
   do_versioning
     = (optimize_loop_nest_for_speed_p (loop)
        && !loop->inner /* FORNOW */
-       && flag_vect_cost_model != VECT_COST_MODEL_CHEAP);
+       && flag_vect_cost_model > VECT_COST_MODEL_CHEAP);
 
   if (do_versioning)
     {
@@ -3682,6 +3682,10 @@ vect_prune_runtime_alias_test_list (loop_vec_info 
loop_vinfo)
   unsigned int count = (comp_alias_ddrs.length ()
                        + check_unequal_addrs.length ());
 
+  if (count && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP)
+    return opt_result::failure_at
+      (vect_location, "would need a runtime alias check\n");
+
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
                     "improved number of alias checks from %d to %d\n",
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 856bbfebf7c..48dfb4df00e 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1827,6 +1827,19 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
        }
     }
 
+  /* If using the "very cheap" model. reject cases in which we'd keep
+     a copy of the scalar code (even if we might be able to vectorize it).  */
+  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
+      && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
+    {
+      if (dump_enabled_p ())
+       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                        "some scalar iterations would need to be peeled\n");
+      return 0;
+    }
+
   int min_profitable_iters, min_profitable_estimate;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
                                      &min_profitable_estimate);
@@ -1885,6 +1898,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       min_profitable_estimate = min_profitable_iters;
     }
 
+  /* If the vector loop needs multiple iterations to be beneficial then
+     things are probably too close to call, and the conservative thing
+     would be to stick with the scalar code.  */
+  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
+      && min_profitable_estimate > (int) vect_vf_for_cost (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                        "one iteration of the vector loop would be"
+                        " more expensive than the equivalent number of"
+                        " iterations of the scalar loop\n");
+      return 0;
+    }
+
   HOST_WIDE_INT estimated_niter;
 
   /* If we are vectorizing an epilogue then we know the maximum number of

Reply via email to