Re: [PATCH v4] rs6000: Add load density heuristic

Kewen.Lin via Gcc-patches Thu, 09 Sep 2021 20:23:19 -0700

Hi Segher and Bill,

Thanks a lot for your reviews and helps!


on 2021/9/10 上午1:19, Bill Schmidt wrote:
> On 9/9/21 11:11 AM, Segher Boessenkool wrote:
>> Hi!
>>
>> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
>>>>> +      /* If we have strided or elementwise loads into a vector, it's
>>>>> +     possible to be bounded by latency and execution resources for
>>>>> +     many scalar loads.  Try to account for this by scaling the
>>>>> +     construction cost by the number of elements involved, when
>>>>> +     handling each matching statement we record the possible extra
>>>>> +     penalized cost into target cost, in the end of costing for
>>>>> +     the whole loop, we do the actual penalization once some load
>>>>> +     density heuristics are satisfied.  */
>>>> The above comment is quite hard to read.  Can you please break up the last
>>>> sentence into at least two sentences?
>>> How about the below:
>>>
>>> +      /* If we have strided or elementwise loads into a vector, it's
>> "strided" is not a word: it properly is "stridden", which does not read
>> very well either.  "Have loads by stride, or by element, ..."?  Is that
>> good English, and easier to understand?
> 
> No, this is OK.  "Strided loads" is a term of art used by the vectorizer; 
> whether or not it was the Queen's English, it's what we have...  (And I think 
> you might only find "bestridden" in some 18th or 19th century English 
> poetry... :-)
>>
>>> +        possible to be bounded by latency and execution resources for
>>> +        many scalar loads.  Try to account for this by scaling the
>>> +        construction cost by the number of elements involved.  For
>>> +        each matching statement, we record the possible extra
>>> +        penalized cost into the relevant field in target cost.  When
>>> +        we want to finalize the whole loop costing, we will check if
>>> +        those related load density heuristics are satisfied, and add
>>> +        this accumulated penalized cost if yes.  */
>>>
>>>> Otherwise this looks good to me, and I recommend maintainers approve with
>>>> that clarified.
>> Does that text look good to you now Bill?  It is still kinda complex,
>> maybe you see a way to make it simpler.
> 
> I think it's OK now.  The complexity at least matches the code now instead of 
> exceeding it. :-P  j/k...
> 

Just noticed Bill helped to revise it, I will use that nice paragraph.
(Thanks, Bill!)

>>
>>>     * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>>>     nstmts, nloads and extra_ctor_cost.
>>>     (rs6000_density_test): Add load density related heuristics and the
>>>     checks, do extra costing on vector construction statements if need.
>> "and the checks"?  Oh, "and checks"?  It is probably fine to just leave
>> out this whole phrase part :-)
>>
>> Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
>> bit boring text that way, but that is the purpose: it makes it simpler
>> to read (and read quickly, even merely scan).
>>

Thanks for the explanation, will fix.

>>> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
>> [ Btw, you can get rid of the typedef now, just have a struct with the
>> non-underscore name, we have C++ now.  Such a mechanical change (as
>> separate patch!) is pre-approved. ]
>>
>>> +  /* Check if we need to penalize the body cost for latency and
>>> +     execution resources bound from strided or elementwise loads
>>> +     into a vector.  */
>> Bill, is that clear enough?  I'm sure something nicer would help here,
>> but it's hard for me to write anything :-)
> 
> Perhaps:  "Check whether we need to penalize the body cost to account for 
> excess strided or elementwise loads."

Thanks, will update.

>>
>>> +  if (data->extra_ctor_cost > 0)
>>> +    {
>>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> Threshold for what?
>>
>> 45% is awfully exact.  Can you make this a param?
>>
>>> +      /* Threshold for total number of load stmts.  */
>>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> Same.
> 
> 
> We have similar magic constants in here already.  Parameterizing is possible, 
> but I'm more interested in making sure the numbers are appropriate for each 
> processor.  Given that Kewen reports they work well for both P9 and P10, I'm 
> pretty happy with what we have here.  (Kewen, thanks for running the P10 
> experiments!)
> 
> Perhaps a follow-up patch to add params for the magic constants would be 
> reasonable, but I'd personally consider it pretty low priority.
> 

As Bill mentioned, there are some other thresholds which can be parameterized 
together.
Segher, if you don't mind, I will parameterize all of them in a follow up 
patch.  :)

>>
>>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>> No parens around the last thing please.  The other pair of parens is
>> unneeded as well, but perhaps it is easier to read like that.
>>

Will fix.

>>> +      if (dump_enabled_p ())
>>> +        dump_printf_loc (MSG_NOTE, vect_location,
>>> +                 "Found %u loads and load pct. %u%% exceed "
>>> +                 "the threshold, penalizing loop body "
>>> +                 "cost by extra cost %u for ctor.\n",
>>> +                 data->nloads, load_pct, data->extra_ctor_cost);
>> That line does not fit.  Make it more lines?
>>

Will fix.

>> It is a pity that using these interfaces at all takes up 45 chars
>> of noise already.
>>
>>> +/* Helper function for add_stmt_cost.  Check each statement cost
>>> +   entry, gather information and update the target_cost fields
>>> +   accordingly.  */
>>> +static void
>>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>> +                    enum vect_cost_for_stmt kind,
>>> +                    struct _stmt_vec_info *stmt_info,
>>> +                    enum vect_cost_model_location where,
>>> +                    int stmt_cost, unsigned int orig_count)
>> Please put those last two on separate lines as well?
>>

Will fix.

>>> +      /* As function rs6000_builtin_vectorization_cost shows, we have
>>> +         priced much on V16QI/V8HI vector construction as their units,
>>> +         if we penalize them with nunits * stmt_cost, it can result in
>>> +         an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>> +         is 20 and nunits is 16, the extra cost is 320 which looks
>>> +         much exaggerated.  So let's use one maximum bound for the
>>> +         extra penalized cost for vector construction here.  */
>>> +      const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>> +      if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>> +        extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> That is a pretty gross hack.  Can you think of any saner way to not have
>> those out of scale costs in the first place?
> 
> In Kewen's defense, the whole business of "finish_cost" for these vectorized 
> loops is to tweak things that don't work quite right with the hooks currently 
> provided to the vectorizer to add costs on a per-stmt basis without looking 
> at the overall set of statements.  It gives the back end a chance to massage 
> things and exercise veto power over otherwise bad decisions.  By nature, 
> that's going to be very much a heuristic exercise.  Personally I think the 
> heuristics used here are pretty reasonable, and importantly they are designed 
> to only be employed in pretty rare circumstances.  It doesn't look easy to me 
> to avoid the need for a cap here without making the rest of the heuristics 
> harder to understand.  But sure, he can try! :)
> 

Thanks for the explanation, Bill!  I agree the current hacking isn't good, one 
alternative to avoid this bound check
seems not to use nunits * stmt_cost, which was referred from the existing 
practice for the similar case.  Instead,
we can use log2(nunits) and then V16QI => 4, V8HI => 3 ..., the cost will not 
be exaggerated much as before.
Not sure if it can work expectedly, I'll re-evaluate this idea on Power8/9/10 
at Ofast unroll and O2 vect (at least)
and see if it works well.
>>
>> Okay for trunk with such tweaks.  Thanks!  (And please consult with Bill
>> for the wordsmithing :-) )
>>

Thanks again.

BR,
Kewen

Re: [PATCH v4] rs6000: Add load density heuristic

Reply via email to