[PATCH v4] rs6000: Add load density heuristic

Kewen.Lin via Gcc-patches Tue, 07 Sep 2021 23:57:34 -0700

Hi Bill,

Thanks for the review comments!


on 2021/9/3 下午11:57, Bill Schmidt wrote:
> Hi Kewen,
> 
> Sorry that we lost track of this patch!  The heuristic approach looks good.  
> It is limited in scope and won't kick in often, and the case you're trying to 
> account for is important.
> 
> At the time you submitted this, I think reliable P10 testing wasn't possible. 
>  Now that it is, could you please do a quick sniff test to make sure there 
> aren't any adjustments that need to be made for P10?  I doubt it, but worth 
> checking.
> 

Good point, thanks for the reminder!  I did one SPEC2017 full run on Power10 
with Ofast unroll, this patch is neutral,
one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r 
degradation existed or not and if it can fixed by
this patch.  The result shows the degradation did exist and got fixed by this 
patch, besides got extra 3.93% speedup
against O2 and another bmk 554.roms_r got 3.24% speed up.

In short, the Power10 evaluation result shows this patch is positive.

> Otherwise I have one comment below...
> 
> On 7/28/21 12:22 AM, Kewen.Lin wrote:
>> Hi,
>>
>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>
>> This v3 addressed William's review comments in
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>>
>> It's mainly to deal with the bwaves_r degradation due to vector
>> construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
>>
>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations (with v2, as v3 is NFC against v2):
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
>>

...

>>+  /* Gather some information when we are costing the vectorized instruction
>>+     for the statements located in a loop body.  */
>>+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>+    {
>>+      data->nstmts += orig_count;
>>+
>>+      if (kind == scalar_load || kind == vector_load
>>+       || kind == unaligned_load || kind == vector_gather_load)
>>+     data->nloads += orig_count;
>>+
>>+      /* If we have strided or elementwise loads into a vector, it's
>>+      possible to be bounded by latency and execution resources for
>>+      many scalar loads.  Try to account for this by scaling the
>>+      construction cost by the number of elements involved, when
>>+      handling each matching statement we record the possible extra
>>+      penalized cost into target cost, in the end of costing for
>>+      the whole loop, we do the actual penalization once some load
>>+      density heuristics are satisfied.  */
> 
> The above comment is quite hard to read.  Can you please break up the last
> sentence into at least two sentences?
> 

How about the below:

+      /* If we have strided or elementwise loads into a vector, it's
+        possible to be bounded by latency and execution resources for
+        many scalar loads.  Try to account for this by scaling the
+        construction cost by the number of elements involved.  For
+        each matching statement, we record the possible extra
+        penalized cost into the relevant field in target cost.  When
+        we want to finalize the whole loop costing, we will check if
+        those related load density heuristics are satisfied, and add
+        this accumulated penalized cost if yes.  */


> Otherwise this looks good to me, and I recommend maintainers approve with
> that clarified.
> 

Thanks again!

The whole updated patch is attached, it also addressed Segher's comments on 
formatting.

Is it ok for trunk?

BR,
Kewen
-----
gcc/ChangeLog:

        * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
        nstmts, nloads and extra_ctor_cost.
        (rs6000_density_test): Add load density related heuristics and the
        checks, do extra costing on vector construction statements if need.
        (rs6000_init_cost): Init new members.
        (rs6000_update_target_cost_per_stmt): New function.
        (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
        rs6000_update_target_cost_per_stmt and call it.

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index e073b26b430..a8872cd6d8d 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
 {
   struct loop *loop_info;
   unsigned cost[3];
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned nstmts;
+  /* Total number of loads (loop only).  */
+  unsigned nloads;
+  /* Possible extra penalized cost on vector construction (loop only).  */
+  unsigned extra_ctor_cost;
   /* For each vectorized loop, this var holds TRUE iff a non-memory vector
      instruction is needed by the vectorization.  */
   bool vect_nonmem;
@@ -5323,9 +5329,45 @@ rs6000_density_test (rs6000_cost_data *data)
       if (dump_enabled_p ())
        dump_printf_loc (MSG_NOTE, vect_location,
                         "density %d%%, cost %d exceeds threshold, penalizing "
-                        "loop body cost by %d%%", density_pct,
+                        "loop body cost by %d%%\n", density_pct,
                         vec_cost + not_vec_cost, DENSITY_PENALTY);
     }
+
+  /* Check if we need to penalize the body cost for latency and
+     execution resources bound from strided or elementwise loads
+     into a vector.  */
+  if (data->extra_ctor_cost > 0)
+    {
+      /* Threshold for load stmts percentage in all vectorized stmts.  */
+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
+      /* Threshold for total number of load stmts.  */
+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
+
+      gcc_assert (data->nloads <= data->nstmts);
+      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
+
+      /* It's likely to be bounded by latency and execution resources
+        from many scalar loads which are strided or elementwise loads
+        into a vector if both conditions below are found:
+          1. there are many loads, it's easy to result in a long wait
+             for load units;
+          2. load has a big proportion of all vectorized statements,
+             it's not easy to schedule other statements to spread among
+             the loads.
+        One typical case is the innermost loop of the hotspot of SPEC2017
+        503.bwaves_r without loop interchange.  */
+      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
+         && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
+       {
+         data->cost[vect_body] += data->extra_ctor_cost;
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_NOTE, vect_location,
+                            "Found %u loads and load pct. %u%% exceed "
+                            "the threshold, penalizing loop body "
+                            "cost by extra cost %u for ctor.\n",
+                            data->nloads, load_pct, data->extra_ctor_cost);
+       }
+    }
 }
 
 /* Implement targetm.vectorize.init_cost.  */
@@ -5339,6 +5381,9 @@ rs6000_init_cost (struct loop *loop_info, bool 
costing_for_scalar)
   data->cost[vect_body]     = 0;
   data->cost[vect_epilogue] = 0;
   data->vect_nonmem = false;
+  data->nstmts = 0;
+  data->nloads = 0;
+  data->extra_ctor_cost = 0;
   data->costing_for_scalar = costing_for_scalar;
   return data;
 }
@@ -5366,6 +5411,70 @@ rs6000_adjust_vect_cost_per_stmt (enum 
vect_cost_for_stmt kind,
   return 0;
 }
 
+/* Helper function for add_stmt_cost.  Check each statement cost
+   entry, gather information and update the target_cost fields
+   accordingly.  */
+static void
+rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
+                                   enum vect_cost_for_stmt kind,
+                                   struct _stmt_vec_info *stmt_info,
+                                   enum vect_cost_model_location where,
+                                   int stmt_cost, unsigned int orig_count)
+{
+
+  /* Check whether we're doing something other than just a copy loop.
+     Not all such loops may be profitably vectorized; see
+     rs6000_finish_cost.  */
+  if (kind == vec_to_scalar
+      || kind == vec_perm
+      || kind == vec_promote_demote
+      || kind == vec_construct
+      || kind == scalar_to_vec
+      || (where == vect_body && kind == vector_stmt))
+    data->vect_nonmem = true;
+
+  /* Gather some information when we are costing the vectorized instruction
+     for the statements located in a loop body.  */
+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
+    {
+      data->nstmts += orig_count;
+
+      if (kind == scalar_load || kind == vector_load
+         || kind == unaligned_load || kind == vector_gather_load)
+       data->nloads += orig_count;
+
+      /* If we have strided or elementwise loads into a vector, it's
+        possible to be bounded by latency and execution resources for
+        many scalar loads.  Try to account for this by scaling the
+        construction cost by the number of elements involved.  For
+        each matching statement, we record the possible extra
+        penalized cost into the relevant field in target cost.  When
+        we want to finalize the whole loop costing, we will check if
+        those related load density heuristics are satisfied, and add
+        this accumulated penalized cost if yes.  */
+      if (kind == vec_construct && stmt_info
+         && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+         && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
+             || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
+       {
+         tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+         unsigned int nunits = vect_nunits_for_cost (vectype);
+         unsigned int extra_cost = nunits * stmt_cost;
+         /* As function rs6000_builtin_vectorization_cost shows, we have
+            priced much on V16QI/V8HI vector construction as their units,
+            if we penalize them with nunits * stmt_cost, it can result in
+            an unreliable body cost, eg: for V16QI on Power8, stmt_cost
+            is 20 and nunits is 16, the extra cost is 320 which looks
+            much exaggerated.  So let's use one maximum bound for the
+            extra penalized cost for vector construction here.  */
+         const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
+         if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
+           extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+         data->extra_ctor_cost += extra_cost;
+       }
+    }
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
 static unsigned
@@ -5385,6 +5494,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, 
int count,
       /* Statements in an inner loop relative to the loop being
         vectorized are weighted more heavily.  The value here is
         arbitrary and could potentially be improved with analysis.  */
+      unsigned int orig_count = count;
       if (where == vect_body && stmt_info
          && stmt_in_inner_loop_p (vinfo, stmt_info))
        {
@@ -5396,14 +5506,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, 
int count,
       retval = (unsigned) (count * stmt_cost);
       cost_data->cost[where] += retval;
 
-      /* Check whether we're doing something other than just a copy loop.
-        Not all such loops may be profitably vectorized; see
-        rs6000_finish_cost.  */
-      if ((kind == vec_to_scalar || kind == vec_perm
-          || kind == vec_promote_demote || kind == vec_construct
-          || kind == scalar_to_vec)
-         || (where == vect_body && kind == vector_stmt))
-       cost_data->vect_nonmem = true;
+      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
+                                         stmt_cost, orig_count);
     }
 
   return retval;

[PATCH v4] rs6000: Add load density heuristic

Reply via email to