Re: [Patch, tree-optimization, predcom] Improve unroll factor for predictive commoning

Ajit Agarwal Sat, 13 Jul 2024 07:36:22 -0700

Hello Richard:

On 12/07/24 6:20 pm, Richard Biener wrote:
> On Fri, Jul 12, 2024 at 12:09 PM Ajit Agarwal <aagar...@linux.ibm.com> wrote:
>>
>> Hello Richard:
>>
>> On 11/07/24 2:21 pm, Richard Biener wrote:
>>> On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal <aagar...@linux.ibm.com> 
>>> wrote:
>>>>
>>>> Hello All:
>>>>
>>>> Unroll factor is determined with max distance across loop iterations.
>>>> The logic for determining the loop unroll factor is based on
>>>> data dependency across loop iterations.
>>>>
>>>> The max distance across loop iterations is the unrolling factor
>>>> that helps in predictive commoning.
>>>
>>> The old comment in the code says
>>>
>>>> -      /* The best unroll factor for this chain is equal to the number of
>>>> -        temporary variables that we create for it.  */
>>>
>>> why is that wrong and why is the max dependence distance more correct?
>>>
>>> Do you have a testcase that shows how this makes a (positive) difference?
>>>
>>
>> There is nothing wrong in the existing implementation of unroll
>> factor for predictive commoning.
>>
>> But with max dependence distance we get performance improvement
>> with spec 2017 benchmarks (INT) of 0.01% (Geomean) with and without
>> changes. Improvement in benchmarks with max dependence distance
>> changes.
>>
>> I have used the following flags:
>> -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning 
>> -fno-tree-pre
>>
>> With above flags I ran with and without changes.
> 
> A 0.01% geomean improvement is noise.  Why did you disable PRE?
>


I have changed the flags used. Now I change the flags to -O3 and
-fpredictive-commoning -funroll-loops.
I am measuring the performance  with the above flags for spec 2017
benchmarks and would let you know the performance by Monday.

>> There is no degradation with spec 2017 (FP benchmarks).
>>
>> Because in predictive commoning we reuse values computed in
>> earlier iterations of a loop in the later ones, max distance is the
>> better choice.
> 
> The re-use distance is the same though.  So your change merely increases
> the unroll factor?  Or can you explain why there is more re-use with
> your change.
>

With -O3 flag and -fpredictive-commoning -funroll-loops in spec 2017 
benchmarks many benchmarks increases the unroll factor with my changes
in predictive commoning determine_unroll_factor.

I have traversed the data dependence relation vector and get the distance
and the max data dependence distance is the unroll factor.

> Richard.
>

Thanks & Regards
Ajit
 
>>> Richard.
>>>
>>
>> Thanks & Regards
>> Ajit
>>
>>>> Bootstrapped and regtested on powerpc64-linux-gnu.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>> tree-optimization, predcom: Improve unroll factor for predictive commoning
>>>>
>>>> Unroll factor is determined with max distance across loop iterations.
>>>> The logic for determining the loop unroll factor is based on
>>>> data dependency across loop iterations.
>>>>
>>>> The max distance across loop iterations is the unrolling factor
>>>> that helps in predictive commoning.
>>>>
>>>> 2024-07-11  Ajit Kumar Agarwal  <aagar...@linux.ibm.com>
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>         * tree-predcom.cc: Change in determining unroll factor with
>>>>         data dependence across loop iterations.
>>>> ---
>>>>  gcc/tree-predcom.cc | 51 ++++++++++++++++++++++++++++++++++-----------
>>>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
>>>> index 9844fee1e97..029b02f5990 100644
>>>> --- a/gcc/tree-predcom.cc
>>>> +++ b/gcc/tree-predcom.cc
>>>> @@ -409,6 +409,7 @@ public:
>>>>    /* Perform the predictive commoning optimization for chains, make this
>>>>       public for being called in callback execute_pred_commoning_cbck.  */
>>>>    void execute_pred_commoning (bitmap tmp_vars);
>>>> +  unsigned determine_unroll_factor (const vec<chain_p> &chains);
>>>>
>>>>  private:
>>>>    /* The pointer to the given loop.  */
>>>> @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p 
>>>> chain,
>>>>     copies as possible.  CHAINS is the list of chains that will be
>>>>     optimized.  */
>>>>
>>>> -static unsigned
>>>> -determine_unroll_factor (const vec<chain_p> &chains)
>>>> +unsigned
>>>> +pcom_worker::determine_unroll_factor (const vec<chain_p> &chains)
>>>>  {
>>>>    chain_p chain;
>>>> -  unsigned factor = 1, af, nfactor, i;
>>>> +  unsigned factor = 1, i;
>>>>    unsigned max = param_max_unroll_times;
>>>> +  struct data_dependence_relation *ddr;
>>>> +  unsigned nfactor = 0;
>>>> +  int nzfactor = 0;
>>>> +
>>>> +  /* Best unroll factor is the maximum distance across loop
>>>> +     iterations.  */
>>>> +  FOR_EACH_VEC_ELT (m_dependences, i, ddr)
>>>> +    {
>>>> +      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++)
>>>> +       {
>>>> +         lambda_vector vec = DDR_DIST_VECT (ddr, j);
>>>> +         widest_int distance = vec[j];
>>>> +         unsigned offset = distance.to_uhwi ();
>>>> +         if (offset == 0)
>>>> +           continue;
>>>> +
>>>> +         int dist = offset - nzfactor;
>>>> +         if (dist  == 0)
>>>> +           continue;
>>>>
>>>> +         if (nfactor == 0)
>>>> +           {
>>>> +             nfactor = offset;
>>>> +             nzfactor = offset;
>>>> +           }
>>>> +         else if (dist <= nzfactor)
>>>> +           nfactor = offset;
>>>> +
>>>> +         if (nfactor > 0 && nfactor <= max)
>>>> +           factor = nfactor;
>>>> +       }
>>>> +    }
>>>> +
>>>> +  int max_use = 0;
>>>>    FOR_EACH_VEC_ELT (chains, i, chain)
>>>>      {
>>>>        if (chain->type == CT_INVARIANT)
>>>> @@ -2427,17 +2461,10 @@ determine_unroll_factor (const vec<chain_p> 
>>>> &chains)
>>>>           continue;
>>>>         }
>>>>
>>>> -      /* The best unroll factor for this chain is equal to the number of
>>>> -        temporary variables that we create for it.  */
>>>> -      af = chain->length;
>>>>        if (chain->has_max_use_after)
>>>> -       af++;
>>>> -
>>>> -      nfactor = factor * af / gcd (factor, af);
>>>> -      if (nfactor <= max)
>>>> -       factor = nfactor;
>>>> +       max_use++;
>>>>      }
>>>> -
>>>> +  factor += max_use;
>>>>    return factor;
>>>>  }
>>>>
>>>> --
>>>> 2.43.5
>>>>

Re: [Patch, tree-optimization, predcom] Improve unroll factor for predictive commoning

Reply via email to