on 2021/9/8 下午2:57, Kewen.Lin via Gcc-patches wrote:
> Hi Bill,
> 
> Thanks for the review comments!
> 
> on 2021/9/3 下午11:57, Bill Schmidt wrote:
>> Hi Kewen,
>>
>> Sorry that we lost track of this patch!  The heuristic approach looks good.  
>> It is limited in scope and won't kick in often, and the case you're trying 
>> to account for is important.
>>
>> At the time you submitted this, I think reliable P10 testing wasn't 
>> possible.  Now that it is, could you please do a quick sniff test to make 
>> sure there aren't any adjustments that need to be made for P10?  I doubt it, 
>> but worth checking.
>>
> 
> Good point, thanks for the reminder!  I did one SPEC2017 full run on Power10 
> with Ofast unroll, this patch is neutral,
> one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r 
> degradation existed or not and if it can fixed by
> this patch.  The result shows the degradation did exist and got fixed by this 
> patch, besides got extra 3.93% speedup
> against O2 and another bmk 554.roms_r got 3.24% speed up.
> 

hmm, sorry that this improvement on 554.roms_r looks not reliable, I just ran 
it with 10 iterations for both w/ and w/o
the patch, both suffer the jitters and the best scores of them are close.  But 
note that bwaves_r scores are quite stable
so it's reliable.

BR,
Kewen

> In short, the Power10 evaluation result shows this patch is positive.
> 
>> Otherwise I have one comment below...
>>
>> On 7/28/21 12:22 AM, Kewen.Lin wrote:
>>> Hi,
>>>
>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>>
>>> This v3 addressed William's review comments in
>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>>>
>>> It's mainly to deal with the bwaves_r degradation due to vector
>>> construction fed by strided loads.
>>>
>>> As Richi's comments [1], this follows the similar idea to over
>>> price the vector construction fed by VMAT_ELEMENTWISE or
>>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>>> construction costing immediately, it firstly records how many
>>> loads and vectorized statements in the given loop, later in
>>> rs6000_density_test (called by finish_cost) it computes the
>>> load density ratio against all vectorized stmts, and check
>>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>>> if both thresholds are exceeded.
>>>
>>> Note that this new load density heuristic check is based on
>>> some fields in target cost which are updated as needed when
>>> scanning each add_stmt_cost entry, it's independent of the
>>> current function rs6000_density_test which requires to scan
>>> non_vect stmts.  Since it's checking the load stmts count
>>> vs. all vectorized stmts, it's kind of density, so I put
>>> it in function rs6000_density_test.  With the same reason to
>>> keep it independent, I didn't put it as an else arm of the
>>> current existing density threshold check hunk or before this
>>> hunk.
>>>
>>> In the investigation of -1.04% degradation from 526.blender_r
>>> on Power8, I noticed that the extra penalized cost 320 on one
>>> single vector construction with type V16QI is much exaggerated,
>>> which makes the final body cost unreliable, so this patch adds
>>> one maximum bound for the extra penalized cost for each vector
>>> construction statement.
>>>
>>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>>>
>>> Full SPEC2017 performance evaluation on Power8/Power9 with
>>> option combinations (with v2, as v3 is NFC against v2):
>>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>>   * {-O3, -Ofast} {,-funroll-loops}
>>>
>>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>>> remarkable was observed.
>>>
> 
> ...
> 
>>> +  /* Gather some information when we are costing the vectorized instruction
>>> +     for the statements located in a loop body.  */
>>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>> +    {
>>> +      data->nstmts += orig_count;
>>> +
>>> +      if (kind == scalar_load || kind == vector_load
>>> +     || kind == unaligned_load || kind == vector_gather_load)
>>> +   data->nloads += orig_count;
>>> +
>>> +      /* If we have strided or elementwise loads into a vector, it's
>>> +    possible to be bounded by latency and execution resources for
>>> +    many scalar loads.  Try to account for this by scaling the
>>> +    construction cost by the number of elements involved, when
>>> +    handling each matching statement we record the possible extra
>>> +    penalized cost into target cost, in the end of costing for
>>> +    the whole loop, we do the actual penalization once some load
>>> +    density heuristics are satisfied.  */
>>
>> The above comment is quite hard to read.  Can you please break up the last
>> sentence into at least two sentences?
>>
> 
> How about the below:
> 
> +      /* If we have strided or elementwise loads into a vector, it's
> +        possible to be bounded by latency and execution resources for
> +        many scalar loads.  Try to account for this by scaling the
> +        construction cost by the number of elements involved.  For
> +        each matching statement, we record the possible extra
> +        penalized cost into the relevant field in target cost.  When
> +        we want to finalize the whole loop costing, we will check if
> +        those related load density heuristics are satisfied, and add
> +        this accumulated penalized cost if yes.  */
> 
> 
>> Otherwise this looks good to me, and I recommend maintainers approve with
>> that clarified.
>>
> 
> Thanks again!
> 
> The whole updated patch is attached, it also addressed Segher's comments on 
> formatting.
> 
> Is it ok for trunk?
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
>       * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>       nstmts, nloads and extra_ctor_cost.
>       (rs6000_density_test): Add load density related heuristics and the
>       checks, do extra costing on vector construction statements if need.
>       (rs6000_init_cost): Init new members.
>       (rs6000_update_target_cost_per_stmt): New function.
>       (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>       rs6000_update_target_cost_per_stmt and call it.
> 

Reply via email to