Hi, Gentle ping this:
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html BR, Kewen >> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote: >>> Hi, >>> >>> This patch follows the discussions here[1][2], where Segher >>> pointed out the existing way to guard the extra penalized >>> cost for strided/elementwise loads with a magic bound does >>> not scale. >>> >>> The way with nunits * stmt_cost can get one much >>> exaggerated penalized cost, such as: for V16QI on P8, it's >>> 16 * 20 = 320, that's why we need one bound. To make it >>> better and more readable, the penalized cost is simplified >>> as: >>> >>> unsigned adjusted_cost = (nunits == 2) ? 2 : 1; >>> unsigned extra_cost = nunits * adjusted_cost; >>> >>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load >>> while for the other modes, it uses 1. It's mainly concluded >>> from the performance evaluations. One thing might be >>> related is that: More units vector gets constructed, more >>> instructions are used. It has more chances to schedule them >>> better (even run in parallelly when enough available units >>> at that time), so it seems reasonable not to penalize more >>> for them. >>> >>> The SPEC2017 evaluations on Power8/Power9/Power10 at option >>> sets O2-vect and Ofast-unroll show this change is neutral. >>> >>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9. >>> >>> Is it ok for trunk? >>> >>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html >>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html >>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html >>> >>> BR, >>> Kewen >>> ----- >>> gcc/ChangeLog: >>> >>> * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust >>> the way to compute extra penalized cost. Remove useless parameter. >>> (rs6000_add_stmt_cost): Adjust the call to function >>> rs6000_update_target_cost_per_stmt. >>> >>> >>> --- >>> gcc/config/rs6000/rs6000.c | 31 ++++++++++++++++++------------- >>> 1 file changed, 18 insertions(+), 13 deletions(-) >>> >>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >>> index dd42b0964f1..8200e1152c2 100644 >>> --- a/gcc/config/rs6000/rs6000.c >>> +++ b/gcc/config/rs6000/rs6000.c >>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data >>> *data, >>> enum vect_cost_for_stmt kind, >>> struct _stmt_vec_info *stmt_info, >>> enum vect_cost_model_location where, >>> - int stmt_cost, >>> unsigned int orig_count) >>> { >>> >>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt >>> (rs6000_cost_data *data, >>> { >>> tree vectype = STMT_VINFO_VECTYPE (stmt_info); >>> unsigned int nunits = vect_nunits_for_cost (vectype); >>> - unsigned int extra_cost = nunits * stmt_cost; >>> - /* As function rs6000_builtin_vectorization_cost shows, we have >>> - priced much on V16QI/V8HI vector construction as their units, >>> - if we penalize them with nunits * stmt_cost, it can result in >>> - an unreliable body cost, eg: for V16QI on Power8, stmt_cost >>> - is 20 and nunits is 16, the extra cost is 320 which looks >>> - much exaggerated. So let's use one maximum bound for the >>> - extra penalized cost for vector construction here. */ >>> - const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >>> - if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >>> - extra_cost = MAX_PENALIZED_COST_FOR_CTOR; >>> + /* Don't expect strided/elementwise loads for just 1 nunit. */ >>> + gcc_assert (nunits > 1); >>> + /* i386 port adopts nunits * stmt_cost as the penalized cost >>> + for this kind of penalization, we used to follow it but >>> + found it could result in an unreliable body cost especially >>> + for V16QI/V8HI modes. To make it better, we choose this >>> + new heuristic: for each scalar load, we use 2 as penalized >>> + cost for the case with 2 nunits and use 1 for the other >>> + cases. It's without much supporting theory, mainly >>> + concluded from the broad performance evaluations on Power8, >>> + Power9 and Power10. One possibly related point is that: >>> + vector construction for more units would use more insns, >>> + it has more chances to schedule them better (even run in >>> + parallelly when enough available units at that time), so >>> + it seems reasonable not to penalize that much for them. */ >>> + unsigned int adjusted_cost = (nunits == 2) ? 2 : 1; >>> + unsigned int extra_cost = nunits * adjusted_cost; >>> data->extra_ctor_cost += extra_cost; >>> } >>> } >>> @@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void >>> *data, int count, >>> cost_data->cost[where] += retval; >>> >>> rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, >>> where, >>> - stmt_cost, orig_count); >>> + orig_count); >>> } >>> >>> return retval; >>> -- >>> 2.27.0 >>> >