On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>> wrote:
>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>>>> Hi,
>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>> x86_64 and aarch64.
>>>>>
>>>>> why is ldist-12.c no longer distributed?  your comment says it doesn't 
>>>>> expose
>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>> which it clearly does.
>>>>>
>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the 
>>>>> wording
>>>>> of "parallelism" still confuses me.
>>>>>
>>>>> Can you elaborate on that.  Now onto the patch:
>>>> Given we don't model data locality or memory bandwidth, whether
>>>> distribution enables loops that can be executed paralleled becomes the
>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>> etc., appropriate for distribution.
>>>
>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into two 
>>> loops
>>> can be executed "in parallel"
>> So if a loop by itself can be vectorized (or so called can be executed
>> paralleled), we tend to no distribute it into small ones.  But there
>> is one exception here, if the distributed small loops are recognized
>> as builtin functions, we still distribute it.  I assume it's generally
>> better to call builtin memory functions than vectorize it by GCC?
>
> Yes.
>
>>>
>>>>>
>>>>> +   Loop distribution is the dual of loop fusion.  It separates statements
>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>> +   same loop header.  The major goal is to separate statements which may
>>>>> +   be vectorized from those that can't.  This pass implements 
>>>>> distribution
>>>>> +   in the following steps:
>>>>>
>>>>> misses the goal of being a memory stream optimization, not only a 
>>>>> vectorization
>>>>> enabler.  distributing a loop can also reduce register pressure.
>>>> I will revise the comment, but as explained, enabling more
>>>> vectorization is the major criteria for distribution to some extend
>>>> now.
>>>
>>> Yes, I agree -- originally it was written to optimize the stream benchmark 
>>> IIRC.
>> Let's see if any performance drop will be reported against this patch.
>> Let's see if we can create a cost model for it.
>
> Fine.
I will run some benchmarks to see if there is breakage.
>
>>>
>>>>>
>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>> didn't look
>>>>> into yet).  If you don't use that please introduce it separately.
>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>
>>>>>
>>>>> +             /* Be conservative.  If data references are not well 
>>>>> analyzed,
>>>>> +                or the two data references have the same base address and
>>>>> +                offset, add dependence and consider it alias to each 
>>>>> other.
>>>>> +                In other words, the dependence can not be resolved by
>>>>> +                runtime alias check.  */
>>>>> +             if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>> +                 || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>> +                 || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>> +                 || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>> +                 || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>> +                 || res == 0)
>>>>>
>>>>> ISTR a helper that computes whether we can handle a runtime alias check 
>>>>> for
>>>>> a specific case?
>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>  Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>> here straightforwardly.  Shall I try to further generalize the
>>>> interface as independence patch to this one?
>>>
>>> That would be nice.
>>>
>>>>>
>>>>> +  /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
>>>>> +  if (flag_tree_loop_vectorize)
>>>>> +    {
>>>>>
>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>> on flag_tree_loop_vectorize.  You seem to support versioning w/o
>>>>> that here but in other places disable versioning w/o 
>>>>> flag_tree_loop_vectorize.
>>>>> That looks somewhat inconsistent...
>>>> It is a bit complicated.  In function version_for_distribution_p, we have
>>>> +
>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>> +  if (alias_ddrs->length () > 0)
>>>> +    return true;
>>>> +
>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>> +  if (!flag_tree_loop_vectorize)
>>>> +    return false;
>>>> +
>>>>
>>>> It means we also versioning loops even if runtime alias check is
>>>> unnecessary.  I did this because we lack cost model and current
>>>> distribution may result in too many distribution?  If that's the case,
>>>> at least vectorizer will remove distributed version loop and fall back
>>>> to the original one.  Hmm, shall I change it into below code:
>>>
>>> Hmm, ok - that really ties things to vectorization apart from the
>>> builtin recognition.  So what happens if the vectorizer can vectorize
>>> both the distributed and non-distributed loops?
>> Hmm, there is no such cases.  Under condition no builtins is
>> recognized, we wouldn't distribute loop if by itself can be
>> vectorized.  Does this make sense?  Of course, cost model for memory
>> behavior can change this behavior and is wanted.
>
> So which cases _do_ we split loops then?  "more parallelism" -- but what
> does that mean exactly?  Is there any testcase that shows the desired
> splits for vectorization?
At least one of distributed loop can be executed paralleled while the
original loop can't.
Not many, but ldist-26.c is added by one of patch.
>
>>>
>>>> +
>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>> +  if (alias_ddrs->length () == 0)
>>>> +    return false;
>>>> +
>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>> +  if (!flag_tree_loop_vectorize)
>>>> +    return false;
>>>> +
>>>>
>>>> Then I can remove the check you mentioned in function
>>>> version_loop_by_alias_check?
>>>
>>> Yeah, I guess that would be easier to understand.  Need to update
>>> the comment above the alias_ddrs check though.
>> Yes the logic here is complicated.  On one hand, I want to be
>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>> can undo all "unnecessary" distribution before memory behavior is
>> modeled.
>>
>>
>>>
>>>>>
>>>>> +  /* Don't version loop if any partition is builtin.  */
>>>>> +  for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>> +    {
>>>>> +      if (partition->kind != PKIND_NORMAL)
>>>>> +       break;
>>>>> +    }
>>>>>
>>>>> why's that?  Do you handle the case where only a subset of partitions
>>>> One reason is I generally consider distributed builtin functions as a
>>>> win, thus distribution won't be canceled later in vectorizer.  Another
>>>> reason is if all distributed loops are recognized as builtins, we
>>>> can't really version with current logic because the
>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>
>>> Ah, ok.  I guess the maze was too twisted for me to see what
>>> version_for_distribution_p
>>> does ;)
>>>
>>>>> require a runtime alias check to be generated?  Thus from a loop
>>>>> generate three loops, only two of them being versioned?  The
>>>>> complication would be that both the runtime alias and non-alias
>>>>> versions would be "transformed".  Or do we rely on recursively
>>>>> distributing the distribution result, thus if we have partitions that
>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>> and recurse on those?
>>>> No, this is not precisely handled now, the pass will version the whole
>>>> loop once.  Though I think it's not very difficult to do two stages
>>>> distribution, I am not sure how useful it is.
>>>
>>> Not sure either.
>>>
>>> So to understand you're looking at loop-distribution as vectorization 
>>> enabler
>>> and pattern detector.  I think that is reasonable without a better cost 
>>> model.
>>>
>>> Note that I think the state before your patches had the sensible cost-model
>>> that it fused very conservatively and just produced the maximum distribution
>>> with the idea that the looping overhead itself is cheap.  Note that with
>>> a more "maximum" distribution the vectorizer also gets the chance to
>>> do "partial vectorization" in case profitability is different.  Of course 
>>> the
>>> setup cost may offset that in the case all loops end up vectorized...
>> Ideally, we have cost model for memory behavior in distribution.  If
>> we know distribution is beneficial in loop distribution, we can simply
>> distribute it; otherwise we pass distribution cost (including memory
>> cost as well as runtime alias check cost as an argument to
>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>> together with vectorization.
>
> Yes.  The old cost model wasn't really one thus loop distribution was never
> enabled by default.
>
>>>
>>> So in reality the vectorizer itself should "distribute away"
>>> non-vectorizable parts
>>> of the loop ... :/
>> Do we really want to put more and more things into vectorizer? :)
>> Michael's opinion (IIUC, about introducing different loop
>> optimizations as building blocks for loop optimization infrastructure)
>> is quite inspiring to me.  In this way, we can at least make different
>> optimization conservative.  In general, it's much worse to do
>> aggressive (unnecessary) transformation than simply miss the
>> opportunity?
>
> Heh, but with increasing number of "different loop optimizations" we
> miss an overall goal.  Like either each pass knows what the other
> would do (vectorize or predcom for example) or we need to have
> a meta-engine driving the building blocks and the building blocks
> divided into analysis + costing and a transform phase (and if ever
> two transforms shall apply their analysis may not be invalidated
> by an earlier one).  Integrating the simple building blocks into one
> framework is what is sexy about GRAPHITE ...
>
> For example predcom might be more beneficial than vectorization
> with v2df vectors for example.
We need to be careful/conservative for transformations with
complicated impact.  But..., as talk is easy, I think I better to stop
here. :)

Thanks,
bin
>
> Richard.
>
>>
>> Thanks,
>> bin
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> bin

Reply via email to