Hi Segher and Bill, Thanks a lot for your reviews and helps!
on 2021/9/10 上午1:19, Bill Schmidt wrote: > On 9/9/21 11:11 AM, Segher Boessenkool wrote: >> Hi! >> >> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote: >>>>> + /* If we have strided or elementwise loads into a vector, it's >>>>> + possible to be bounded by latency and execution resources for >>>>> + many scalar loads. Try to account for this by scaling the >>>>> + construction cost by the number of elements involved, when >>>>> + handling each matching statement we record the possible extra >>>>> + penalized cost into target cost, in the end of costing for >>>>> + the whole loop, we do the actual penalization once some load >>>>> + density heuristics are satisfied. */ >>>> The above comment is quite hard to read. Can you please break up the last >>>> sentence into at least two sentences? >>> How about the below: >>> >>> + /* If we have strided or elementwise loads into a vector, it's >> "strided" is not a word: it properly is "stridden", which does not read >> very well either. "Have loads by stride, or by element, ..."? Is that >> good English, and easier to understand? > > No, this is OK. "Strided loads" is a term of art used by the vectorizer; > whether or not it was the Queen's English, it's what we have... (And I think > you might only find "bestridden" in some 18th or 19th century English > poetry... :-) >> >>> + possible to be bounded by latency and execution resources for >>> + many scalar loads. Try to account for this by scaling the >>> + construction cost by the number of elements involved. For >>> + each matching statement, we record the possible extra >>> + penalized cost into the relevant field in target cost. When >>> + we want to finalize the whole loop costing, we will check if >>> + those related load density heuristics are satisfied, and add >>> + this accumulated penalized cost if yes. */ >>> >>>> Otherwise this looks good to me, and I recommend maintainers approve with >>>> that clarified. >> Does that text look good to you now Bill? It is still kinda complex, >> maybe you see a way to make it simpler. > > I think it's OK now. The complexity at least matches the code now instead of > exceeding it. :-P j/k... > Just noticed Bill helped to revise it, I will use that nice paragraph. (Thanks, Bill!) >> >>> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members >>> nstmts, nloads and extra_ctor_cost. >>> (rs6000_density_test): Add load density related heuristics and the >>> checks, do extra costing on vector construction statements if need. >> "and the checks"? Oh, "and checks"? It is probably fine to just leave >> out this whole phrase part :-) >> >> Don't use commas like this in changelogs. s/, do/. Do/ Yes this is a >> bit boring text that way, but that is the purpose: it makes it simpler >> to read (and read quickly, even merely scan). >> Thanks for the explanation, will fix. >>> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data >> [ Btw, you can get rid of the typedef now, just have a struct with the >> non-underscore name, we have C++ now. Such a mechanical change (as >> separate patch!) is pre-approved. ] >> >>> + /* Check if we need to penalize the body cost for latency and >>> + execution resources bound from strided or elementwise loads >>> + into a vector. */ >> Bill, is that clear enough? I'm sure something nicer would help here, >> but it's hard for me to write anything :-) > > Perhaps: "Check whether we need to penalize the body cost to account for > excess strided or elementwise loads." Thanks, will update. >> >>> + if (data->extra_ctor_cost > 0) >>> + { >>> + /* Threshold for load stmts percentage in all vectorized stmts. */ >>> + const int DENSITY_LOAD_PCT_THRESHOLD = 45; >> Threshold for what? >> >> 45% is awfully exact. Can you make this a param? >> >>> + /* Threshold for total number of load stmts. */ >>> + const int DENSITY_LOAD_NUM_THRESHOLD = 20; >> Same. > > > We have similar magic constants in here already. Parameterizing is possible, > but I'm more interested in making sure the numbers are appropriate for each > processor. Given that Kewen reports they work well for both P9 and P10, I'm > pretty happy with what we have here. (Kewen, thanks for running the P10 > experiments!) > > Perhaps a follow-up patch to add params for the magic constants would be > reasonable, but I'd personally consider it pretty low priority. > As Bill mentioned, there are some other thresholds which can be parameterized together. Segher, if you don't mind, I will parameterize all of them in a follow up patch. :) >> >>> + unsigned int load_pct = (data->nloads * 100) / (data->nstmts); >> No parens around the last thing please. The other pair of parens is >> unneeded as well, but perhaps it is easier to read like that. >> Will fix. >>> + if (dump_enabled_p ()) >>> + dump_printf_loc (MSG_NOTE, vect_location, >>> + "Found %u loads and load pct. %u%% exceed " >>> + "the threshold, penalizing loop body " >>> + "cost by extra cost %u for ctor.\n", >>> + data->nloads, load_pct, data->extra_ctor_cost); >> That line does not fit. Make it more lines? >> Will fix. >> It is a pity that using these interfaces at all takes up 45 chars >> of noise already. >> >>> +/* Helper function for add_stmt_cost. Check each statement cost >>> + entry, gather information and update the target_cost fields >>> + accordingly. */ >>> +static void >>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, >>> + enum vect_cost_for_stmt kind, >>> + struct _stmt_vec_info *stmt_info, >>> + enum vect_cost_model_location where, >>> + int stmt_cost, unsigned int orig_count) >> Please put those last two on separate lines as well? >> Will fix. >>> + /* As function rs6000_builtin_vectorization_cost shows, we have >>> + priced much on V16QI/V8HI vector construction as their units, >>> + if we penalize them with nunits * stmt_cost, it can result in >>> + an unreliable body cost, eg: for V16QI on Power8, stmt_cost >>> + is 20 and nunits is 16, the extra cost is 320 which looks >>> + much exaggerated. So let's use one maximum bound for the >>> + extra penalized cost for vector construction here. */ >>> + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >>> + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >>> + extra_cost = MAX_PENALIZED_COST_FOR_CTOR; >> That is a pretty gross hack. Can you think of any saner way to not have >> those out of scale costs in the first place? > > In Kewen's defense, the whole business of "finish_cost" for these vectorized > loops is to tweak things that don't work quite right with the hooks currently > provided to the vectorizer to add costs on a per-stmt basis without looking > at the overall set of statements. It gives the back end a chance to massage > things and exercise veto power over otherwise bad decisions. By nature, > that's going to be very much a heuristic exercise. Personally I think the > heuristics used here are pretty reasonable, and importantly they are designed > to only be employed in pretty rare circumstances. It doesn't look easy to me > to avoid the need for a cap here without making the rest of the heuristics > harder to understand. But sure, he can try! :) > Thanks for the explanation, Bill! I agree the current hacking isn't good, one alternative to avoid this bound check seems not to use nunits * stmt_cost, which was referred from the existing practice for the similar case. Instead, we can use log2(nunits) and then V16QI => 4, V8HI => 3 ..., the cost will not be exaggerated much as before. Not sure if it can work expectedly, I'll re-evaluate this idea on Power8/9/10 at Ofast unroll and O2 vect (at least) and see if it works well. >> >> Okay for trunk with such tweaks. Thanks! (And please consult with Bill >> for the wordsmithing :-) ) >> Thanks again. BR, Kewen