Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check

Richard Biener Mon, 03 Jul 2017 04:50:40 -0700

On Fri, Jun 30, 2017 at 12:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>>>> x86_64 and aarch64.
>>>>>>>
>>>>>>> why is ldist-12.c no longer distributed?  your comment says it doesn't 
>>>>>>> expose
>>>>>>> more "parallelism" but the point is to reduce memory bandwith 
>>>>>>> requirements
>>>>>>> which it clearly does.
>>>>>>>
>>>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the 
>>>>>>> wording
>>>>>>> of "parallelism" still confuses me.
>>>>>>>
>>>>>>> Can you elaborate on that.  Now onto the patch:
>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>> etc., appropriate for distribution.
>>>>>
>>>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into 
>>>>> two loops
>>>>> can be executed "in parallel"
>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>> paralleled), we tend to no distribute it into small ones.  But there
>>>> is one exception here, if the distributed small loops are recognized
>>>> as builtin functions, we still distribute it.  I assume it's generally
>>>> better to call builtin memory functions than vectorize it by GCC?
>>>
>>> Yes.
>>>
>>>>>
>>>>>>>
>>>>>>> +   Loop distribution is the dual of loop fusion.  It separates 
>>>>>>> statements
>>>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with 
>>>>>>> the
>>>>>>> +   same loop header.  The major goal is to separate statements which 
>>>>>>> may
>>>>>>> +   be vectorized from those that can't.  This pass implements 
>>>>>>> distribution
>>>>>>> +   in the following steps:
>>>>>>>
>>>>>>> misses the goal of being a memory stream optimization, not only a 
>>>>>>> vectorization
>>>>>>> enabler.  distributing a loop can also reduce register pressure.
>>>>>> I will revise the comment, but as explained, enabling more
>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>> now.
>>>>>
>>>>> Yes, I agree -- originally it was written to optimize the stream 
>>>>> benchmark IIRC.
>>>> Let's see if any performance drop will be reported against this patch.
>>>> Let's see if we can create a cost model for it.
>>>
>>> Fine.
>> I will run some benchmarks to see if there is breakage.
>>>
>>>>>
>>>>>>>
>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>> didn't look
>>>>>>> into yet).  If you don't use that please introduce it separately.
>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>
>>>>>>>
>>>>>>> +             /* Be conservative.  If data references are not well 
>>>>>>> analyzed,
>>>>>>> +                or the two data references have the same base address 
>>>>>>> and
>>>>>>> +                offset, add dependence and consider it alias to each 
>>>>>>> other.
>>>>>>> +                In other words, the dependence can not be resolved by
>>>>>>> +                runtime alias check.  */
>>>>>>> +             if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>> +                 || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>> +                 || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>> +                 || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>> +                 || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>> +                 || res == 0)
>>>>>>>
>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check 
>>>>>>> for
>>>>>>> a specific case?
>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>  Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>> here straightforwardly.  Shall I try to further generalize the
>>>>>> interface as independence patch to this one?
>>>>>
>>>>> That would be nice.
>>>>>
>>>>>>>
>>>>>>> +  /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
>>>>>>> +  if (flag_tree_loop_vectorize)
>>>>>>> +    {
>>>>>>>
>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>> on flag_tree_loop_vectorize.  You seem to support versioning w/o
>>>>>>> that here but in other places disable versioning w/o 
>>>>>>> flag_tree_loop_vectorize.
>>>>>>> That looks somewhat inconsistent...
>>>>>> It is a bit complicated.  In function version_for_distribution_p, we have
>>>>>> +
>>>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>>>> +  if (alias_ddrs->length () > 0)
>>>>>> +    return true;
>>>>>> +
>>>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>>>> +  if (!flag_tree_loop_vectorize)
>>>>>> +    return false;
>>>>>> +
>>>>>>
>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>> unnecessary.  I did this because we lack cost model and current
>>>>>> distribution may result in too many distribution?  If that's the case,
>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>> to the original one.  Hmm, shall I change it into below code:
>>>>>
>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>> builtin recognition.  So what happens if the vectorizer can vectorize
>>>>> both the distributed and non-distributed loops?
>>>> Hmm, there is no such cases.  Under condition no builtins is
>>>> recognized, we wouldn't distribute loop if by itself can be
>>>> vectorized.  Does this make sense?  Of course, cost model for memory
>>>> behavior can change this behavior and is wanted.
>>>
>>> So which cases _do_ we split loops then?  "more parallelism" -- but what
>>> does that mean exactly?  Is there any testcase that shows the desired
>>> splits for vectorization?
>> At least one of distributed loop can be executed paralleled while the
>> original loop can't.
>> Not many, but ldist-26.c is added by one of patch.
>>>
>>>>>
>>>>>> +
>>>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>>>> +  if (alias_ddrs->length () == 0)
>>>>>> +    return false;
>>>>>> +
>>>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>>>> +  if (!flag_tree_loop_vectorize)
>>>>>> +    return false;
>>>>>> +
>>>>>>
>>>>>> Then I can remove the check you mentioned in function
>>>>>> version_loop_by_alias_check?
>>>>>
>>>>> Yeah, I guess that would be easier to understand.  Need to update
>>>>> the comment above the alias_ddrs check though.
>>>> Yes the logic here is complicated.  On one hand, I want to be
>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>> can undo all "unnecessary" distribution before memory behavior is
>>>> modeled.
>>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> +  /* Don't version loop if any partition is builtin.  */
>>>>>>> +  for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>> +    {
>>>>>>> +      if (partition->kind != PKIND_NORMAL)
>>>>>>> +       break;
>>>>>>> +    }
>>>>>>>
>>>>>>> why's that?  Do you handle the case where only a subset of partitions
>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>> win, thus distribution won't be canceled later in vectorizer.  Another
>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>> can't really version with current logic because the
>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>
>>>>> Ah, ok.  I guess the maze was too twisted for me to see what
>>>>> version_for_distribution_p
>>>>> does ;)
>>>>>
>>>>>>> require a runtime alias check to be generated?  Thus from a loop
>>>>>>> generate three loops, only two of them being versioned?  The
>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>> versions would be "transformed".  Or do we rely on recursively
>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>> and recurse on those?
>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>> loop once.  Though I think it's not very difficult to do two stages
>>>>>> distribution, I am not sure how useful it is.
>>>>>
>>>>> Not sure either.
>>>>>
>>>>> So to understand you're looking at loop-distribution as vectorization 
>>>>> enabler
>>>>> and pattern detector.  I think that is reasonable without a better cost 
>>>>> model.
>>>>>
>>>>> Note that I think the state before your patches had the sensible 
>>>>> cost-model
>>>>> that it fused very conservatively and just produced the maximum 
>>>>> distribution
>>>>> with the idea that the looping overhead itself is cheap.  Note that with
>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>> do "partial vectorization" in case profitability is different.  Of course 
>>>>> the
>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>> Ideally, we have cost model for memory behavior in distribution.  If
>>>> we know distribution is beneficial in loop distribution, we can simply
>>>> distribute it; otherwise we pass distribution cost (including memory
>>>> cost as well as runtime alias check cost as an argument to
>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>> together with vectorization.
>>>
>>> Yes.  The old cost model wasn't really one thus loop distribution was never
>>> enabled by default.
>>>
>
> Hi,
> This is updated patch.  It makes below changes as well as renaming
> ldist_alias_id to orig_loop_num.
> 1) Simplifies relation with flag_tree_loop_vectorization.  Now it only
> versions loop if runtime alias check is needed.
> 2) Record the new versioned loop as original loop in order to simplify
> dominance working routine.  It also makes sense since versioned loop
> is the one same with the original loop.
>
> No change for ChangeLog entry.  Bootstrap and test.  Is it OK?


Ok.

The profile stuff might need updating now, please double-check.

Thanks,
Richard.

> Thanks,
> bin

Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check

Reply via email to