Hi William, Thanks for the review comments!
on 2021/7/28 上午6:25, will schmidt wrote: > On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote: >> Hi, >> > > > Hi, > > >> This is the updated version of patch to deal with the bwaves_r >> degradation due to vector construction fed by strided loads. >> >> As Richi's comments [1], this follows the similar idea to over >> price the vector construction fed by VMAT_ELEMENTWISE or >> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector >> construction costing immediately, it firstly records how many >> loads and vectorized statements in the given loop, later in >> rs6000_density_test (called by finish_cost) it computes the >> load density ratio against all vectorized stmts, and check >> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD >> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing >> if both thresholds are exceeded. > > ok > >> >> Note that this new load density heuristic check is based on >> some fields in target cost which are updated as needed when >> scanning each add_stmt_cost entry, it's independent of the >> current function rs6000_density_test which requires to scan >> non_vect stmts. Since it's checking the load stmts count >> vs. all vectorized stmts, it's kind of density, so I put >> it in function rs6000_density_test. With the same reason to >> keep it independent, I didn't put it as an else arm of the >> current existing density threshold check hunk or before this >> hunk. > > ok > >> >> In the investigation of -1.04% degradation from 526.blender_r >> on Power8, I noticed that the extra penalized cost 320 on one >> single vector construction with type V16QI is much exaggerated, >> which makes the final body cost unreliable, so this patch adds >> one maximum bound for the extra penalized cost for each vector >> construction statement. > > ok > >> >> Bootstrapped/regtested on powerpc64le-linux-gnu P9. >> >> Full SPEC2017 performance evaluation on Power8/Power9 with >> option combinations: >> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} >> * {-O3, -Ofast} {,-funroll-loops} >> >> bwaves_r degradations on P8/P9 have been fixed, nothing else >> remarkable was observed. > > So, this fixes the "-1.04% degradation from 526.blender_r on Power8" > degredation with no additional regressions. that sounds good. > Sorry for the confusion, the original intention of this patch is to fix the -8% degradation at -O2 -ftree-vectorize vs. -O2 on bwaves_r. (From the last evaluation based on r12-1442, P8 is about -10% while P9 is about -9%). The mentioned -1.04% degradation from 526.blender_r on P8 was expected to be a reason why we need the bound for the extra penalized cost adjustment. >> >> Is it ok for trunk? >> >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html >> >> BR, >> Kewen >> ----- >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members >> nstmts, nloads and extra_ctor_cost. >> (rs6000_density_test): Add load density related heuristics and the >> checks, do extra costing on vector construction statements if need. >> (rs6000_init_cost): Init new members. >> (rs6000_update_target_cost_per_stmt): New function. >> (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function >> rs6000_update_target_cost_per_stmt and call it. >> > >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> index 83d29cbfac1..806c3335cbc 100644 >> --- a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> >> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data >> { >> struct loop *loop_info; >> unsigned cost[3]; >> + /* Total number of vectorized stmts (loop only). */ >> + unsigned nstmts; >> + /* Total number of loads (loop only). */ >> + unsigned nloads; >> + /* Possible extra penalized cost on vector construction (loop only). */ >> + unsigned extra_ctor_cost; >> >> /* For each vectorized loop, this var holds TRUE iff a non-memory vector >> instruction is needed by the vectorization. */ >> bool vect_nonmem; >> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data) >> if (dump_enabled_p ()) >> dump_printf_loc (MSG_NOTE, vect_location, >> "density %d%%, cost %d exceeds threshold, penalizing " >> - "loop body cost by %d%%", density_pct, >> + "loop body cost by %d%%\n", density_pct, >> vec_cost + not_vec_cost, DENSITY_PENALTY); >> } >> + >> + /* Check if we need to penalize the body cost for latency and >> + execution resources bound from strided or elementwise loads >> + into a vector. */ >> + if (data->extra_ctor_cost > 0) >> + { >> + /* Threshold for load stmts percentage in all vectorized stmts. */ >> + const int DENSITY_LOAD_PCT_THRESHOLD = 45; >> + /* Threshold for total number of load stmts. */ >> + const int DENSITY_LOAD_NUM_THRESHOLD = 20; >> + >> + gcc_assert (data->nloads <= data->nstmts); >> + unsigned int load_pct = (data->nloads * 100) / (data->nstmts); >> + >> + /* It's likely to be bounded by latency and execution resources >> + from many scalar loads which are strided or elementwise loads >> + into a vector if both conditions below are found: >> + 1. there are many loads, it's easy to result in a long wait >> + for load units; >> + 2. load has a big proportion of all vectorized statements, >> + it's not easy to schedule other statements to spread among >> + the loads. >> + One typical case is the innermost loop of the hotspot of SPEC2017 >> + 503.bwaves_r without loop interchange. */ >> + if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD >> + && load_pct > DENSITY_LOAD_PCT_THRESHOLD) >> + { >> + data->cost[vect_body] += data->extra_ctor_cost; >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_NOTE, vect_location, >> + "Found %u loads and load pct. %u%% exceed " >> + "the threshold, penalizing loop body " >> + "cost by extra cost %u for ctor.\n", >> + data->nloads, load_pct, data->extra_ctor_cost); >> + } >> + } >> >> } >> >> /* Implement targetm.vectorize.init_cost. */ >> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool >> costing_for_scalar) >> data->cost[vect_body] = 0; >> data->cost[vect_epilogue] = 0; >> data->vect_nonmem = false; >> + data->nstmts = 0; >> + data->nloads = 0; >> + data->extra_ctor_cost = 0; >> >> data->costing_for_scalar = costing_for_scalar; >> return data; >> } >> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum >> vect_cost_for_stmt kind, >> return 0; >> } >> >> +/* As a visitor function for each statement cost entry handled in >> + function add_stmt_cost, gather some information and update its >> + relevant fields in target cost accordingly. */ > > I got lost trying to parse that.. (could be just me :-) > > Possibly instead something like > /* Helper function for add_stmt_cost ; gather information and update > the target_cost fields accordingly. */ > > OK, will update. I was thinking for each entry handled in function add_stmt_cost, this helper acts like a visitor, trying to visit each entry and take some actions if some conditions are satisifed. > >> +static void >> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, >> + enum vect_cost_for_stmt kind, >> + struct _stmt_vec_info *stmt_info, >> + enum vect_cost_model_location where, >> + int stmt_cost, unsigned int orig_count) >> +{ >> + >> + /* Check whether we're doing something other than just a copy loop. >> + Not all such loops may be profitably vectorized; see >> + rs6000_finish_cost. */ >> + if ((kind == vec_to_scalar || kind == vec_perm || kind == >> vec_promote_demote >> + || kind == vec_construct || kind == scalar_to_vec) >> + || (where == vect_body && kind == vector_stmt)) > > I thought I saw an alignment issue, then noticed the "(" that was > hiding on me.. :-) > > Maybe clearer to read if you rearrange slightly and flatten it ? I > defer to others on that.. > > if ((kind == vec_to_scalar > || kind == vec_perm > || kind == vec_promote_demote > || kind == > vec_construct > || kind == scalar_to_vec) > || (kind == vector_stmt && where == vect_body) > > This hunk is factored out from function rs6000_add_stmt_cost, maybe I can keep the original formatting? The formatting tool isn't so smart, and sometimes rearrange things to become unexpected (although it meets the basic rule, not so elegant), sigh. >> + data->vect_nonmem = true; >> + >> + /* Gather some information when we are costing the vector version for >> + the statements locate in a loop body. */ > s/version/instruction? operation?/ ? ? > s/locate/located/ > Will fix. >> + if (!data->costing_for_scalar && data->loop_info && where == vect_body) >> + { >> + data->nstmts += orig_count; >> + >> + if (kind == scalar_load || kind == vector_load || kind == >> unaligned_load >> + || kind == vector_gather_load) > > Cosmetically, I'd move the second "||" to the next line to balance > those two lines a bit more. > Will fix. >> + data->nloads += orig_count; >> + >> + /* If we have strided or elementwise loads into a vector, it's >> + possible to be bounded by latency and execution resources for >> + many scalar loads. Try to account for this by scaling the >> + construction cost by the number of elements involved, when >> + handling each matching statement we record the possible extra >> + penalized cost into target cost, in the end of costing for >> + the whole loop, we do the actual penalization once some load >> + density heuristics are satisfied. */ >> + if (kind == vec_construct && stmt_info >> + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type >> + && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE >> + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP)) >> + { >> + tree vectype = STMT_VINFO_VECTYPE (stmt_info); >> + unsigned int nunits = vect_nunits_for_cost (vectype); >> + unsigned int extra_cost = nunits * stmt_cost; >> + /* As function rs6000_builtin_vectorization_cost shows, we have >> + priced much on V16QI/V8HI vector construction as their units, >> + if we penalize them with nunits * stmt_cost, it can result in >> + a unreliable body cost, eg: for V16QI on Power8, stmt_cost is > s/a/an/ Will fix. >> + 20 and nunits is 16, the extra cost is 320 which looks much >> + exaggerated. So let's use one maximum bound for the extra >> + penalized cost for vector construction here. */ >> + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >> + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >> + extra_cost = MAX_PENALIZED_COST_FOR_CTOR; >> + data->extra_ctor_cost += extra_cost; >> + } >> + } >> +} > ok > >> + >> >> /* Implement targetm.vectorize.add_stmt_cost. */ >> >> static unsigned >> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void >> *data, int count, >> /* Statements in an inner loop relative to the loop being >> vectorized are weighted more heavily. The value here is >> arbitrary and could potentially be improved with analysis. */ >> + unsigned int orig_count = count; > > I don't see any other assignments to orig_count. Does 'count' get > updated elsewhere in rs6000_add_stmt_cost() that the new orig_count > variable is necessary? > Yeah, the 'count' could get updated but the default "git diff" doesn't show it, the codes omitted look as below: if (where == vect_body && stmt_info && stmt_in_inner_loop_p (vinfo, stmt_info)) { loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo); gcc_assert (loop_vinfo); count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /* FIXME. */ } BR, Kewen >> >> if (where == vect_body && stmt_info >> && stmt_in_inner_loop_p (vinfo, stmt_info)) >> { >> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void >> *data, int count, >> retval = (unsigned) (count * stmt_cost); >> cost_data->cost[where] += retval; >> >> - /* Check whether we're doing something other than just a copy loop. >> - Not all such loops may be profitably vectorized; see >> - rs6000_finish_cost. */ >> - if ((kind == vec_to_scalar || kind == vec_perm >> - || kind == vec_promote_demote || kind == vec_construct >> - || kind == scalar_to_vec) >> - || (where == vect_body && kind == vector_stmt)) >> - cost_data->vect_nonmem = true; >> + rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where, >> + stmt_cost, orig_count); >> >> } >> >> return retval; >> > >