on 2021/9/8 下午2:57, Kewen.Lin via Gcc-patches wrote: > Hi Bill, > > Thanks for the review comments! > > on 2021/9/3 下午11:57, Bill Schmidt wrote: >> Hi Kewen, >> >> Sorry that we lost track of this patch! The heuristic approach looks good. >> It is limited in scope and won't kick in often, and the case you're trying >> to account for is important. >> >> At the time you submitted this, I think reliable P10 testing wasn't >> possible. Now that it is, could you please do a quick sniff test to make >> sure there aren't any adjustments that need to be made for P10? I doubt it, >> but worth checking. >> > > Good point, thanks for the reminder! I did one SPEC2017 full run on Power10 > with Ofast unroll, this patch is neutral, > one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r > degradation existed or not and if it can fixed by > this patch. The result shows the degradation did exist and got fixed by this > patch, besides got extra 3.93% speedup > against O2 and another bmk 554.roms_r got 3.24% speed up. >
hmm, sorry that this improvement on 554.roms_r looks not reliable, I just ran it with 10 iterations for both w/ and w/o the patch, both suffer the jitters and the best scores of them are close. But note that bwaves_r scores are quite stable so it's reliable. BR, Kewen > In short, the Power10 evaluation result shows this patch is positive. > >> Otherwise I have one comment below... >> >> On 7/28/21 12:22 AM, Kewen.Lin wrote: >>> Hi, >>> >>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html >>> >>> This v3 addressed William's review comments in >>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html >>> >>> It's mainly to deal with the bwaves_r degradation due to vector >>> construction fed by strided loads. >>> >>> As Richi's comments [1], this follows the similar idea to over >>> price the vector construction fed by VMAT_ELEMENTWISE or >>> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector >>> construction costing immediately, it firstly records how many >>> loads and vectorized statements in the given loop, later in >>> rs6000_density_test (called by finish_cost) it computes the >>> load density ratio against all vectorized stmts, and check >>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD >>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing >>> if both thresholds are exceeded. >>> >>> Note that this new load density heuristic check is based on >>> some fields in target cost which are updated as needed when >>> scanning each add_stmt_cost entry, it's independent of the >>> current function rs6000_density_test which requires to scan >>> non_vect stmts. Since it's checking the load stmts count >>> vs. all vectorized stmts, it's kind of density, so I put >>> it in function rs6000_density_test. With the same reason to >>> keep it independent, I didn't put it as an else arm of the >>> current existing density threshold check hunk or before this >>> hunk. >>> >>> In the investigation of -1.04% degradation from 526.blender_r >>> on Power8, I noticed that the extra penalized cost 320 on one >>> single vector construction with type V16QI is much exaggerated, >>> which makes the final body cost unreliable, so this patch adds >>> one maximum bound for the extra penalized cost for each vector >>> construction statement. >>> >>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9. >>> >>> Full SPEC2017 performance evaluation on Power8/Power9 with >>> option combinations (with v2, as v3 is NFC against v2): >>> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} >>> * {-O3, -Ofast} {,-funroll-loops} >>> >>> bwaves_r degradations on P8/P9 have been fixed, nothing else >>> remarkable was observed. >>> > > ... > >>> + /* Gather some information when we are costing the vectorized instruction >>> + for the statements located in a loop body. */ >>> + if (!data->costing_for_scalar && data->loop_info && where == vect_body) >>> + { >>> + data->nstmts += orig_count; >>> + >>> + if (kind == scalar_load || kind == vector_load >>> + || kind == unaligned_load || kind == vector_gather_load) >>> + data->nloads += orig_count; >>> + >>> + /* If we have strided or elementwise loads into a vector, it's >>> + possible to be bounded by latency and execution resources for >>> + many scalar loads. Try to account for this by scaling the >>> + construction cost by the number of elements involved, when >>> + handling each matching statement we record the possible extra >>> + penalized cost into target cost, in the end of costing for >>> + the whole loop, we do the actual penalization once some load >>> + density heuristics are satisfied. */ >> >> The above comment is quite hard to read. Can you please break up the last >> sentence into at least two sentences? >> > > How about the below: > > + /* If we have strided or elementwise loads into a vector, it's > + possible to be bounded by latency and execution resources for > + many scalar loads. Try to account for this by scaling the > + construction cost by the number of elements involved. For > + each matching statement, we record the possible extra > + penalized cost into the relevant field in target cost. When > + we want to finalize the whole loop costing, we will check if > + those related load density heuristics are satisfied, and add > + this accumulated penalized cost if yes. */ > > >> Otherwise this looks good to me, and I recommend maintainers approve with >> that clarified. >> > > Thanks again! > > The whole updated patch is attached, it also addressed Segher's comments on > formatting. > > Is it ok for trunk? > > BR, > Kewen > ----- > gcc/ChangeLog: > > * config/rs6000/rs6000.c (struct rs6000_cost_data): New members > nstmts, nloads and extra_ctor_cost. > (rs6000_density_test): Add load density related heuristics and the > checks, do extra costing on vector construction statements if need. > (rs6000_init_cost): Init new members. > (rs6000_update_target_cost_per_stmt): New function. > (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function > rs6000_update_target_cost_per_stmt and call it. >