On 17 July 2017 at 12:06, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
> <christophe.l...@linaro.org> wrote:
>> Hi Bin,
>>
>> On 30 June 2017 at 12:43, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>>>>>> x86_64 and aarch64.
>>>>>>>>>
>>>>>>>>> why is ldist-12.c no longer distributed?  your comment says it 
>>>>>>>>> doesn't expose
>>>>>>>>> more "parallelism" but the point is to reduce memory bandwith 
>>>>>>>>> requirements
>>>>>>>>> which it clearly does.
>>>>>>>>>
>>>>>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the 
>>>>>>>>> wording
>>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>>
>>>>>>>>> Can you elaborate on that.  Now onto the patch:
>>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>>> etc., appropriate for distribution.
>>>>>>>
>>>>>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into 
>>>>>>> two loops
>>>>>>> can be executed "in parallel"
>>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>>> paralleled), we tend to no distribute it into small ones.  But there
>>>>>> is one exception here, if the distributed small loops are recognized
>>>>>> as builtin functions, we still distribute it.  I assume it's generally
>>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>>
>>>>> Yes.
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> +   Loop distribution is the dual of loop fusion.  It separates 
>>>>>>>>> statements
>>>>>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with 
>>>>>>>>> the
>>>>>>>>> +   same loop header.  The major goal is to separate statements which 
>>>>>>>>> may
>>>>>>>>> +   be vectorized from those that can't.  This pass implements 
>>>>>>>>> distribution
>>>>>>>>> +   in the following steps:
>>>>>>>>>
>>>>>>>>> misses the goal of being a memory stream optimization, not only a 
>>>>>>>>> vectorization
>>>>>>>>> enabler.  distributing a loop can also reduce register pressure.
>>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>>> now.
>>>>>>>
>>>>>>> Yes, I agree -- originally it was written to optimize the stream 
>>>>>>> benchmark IIRC.
>>>>>> Let's see if any performance drop will be reported against this patch.
>>>>>> Let's see if we can create a cost model for it.
>>>>>
>>>>> Fine.
>>>> I will run some benchmarks to see if there is breakage.
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>>>> didn't look
>>>>>>>>> into yet).  If you don't use that please introduce it separately.
>>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> +             /* Be conservative.  If data references are not well 
>>>>>>>>> analyzed,
>>>>>>>>> +                or the two data references have the same base 
>>>>>>>>> address and
>>>>>>>>> +                offset, add dependence and consider it alias to each 
>>>>>>>>> other.
>>>>>>>>> +                In other words, the dependence can not be resolved by
>>>>>>>>> +                runtime alias check.  */
>>>>>>>>> +             if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>>>> +                 || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>>>> +                 || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>>>> +                 || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP 
>>>>>>>>> (dr1))
>>>>>>>>> +                 || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP 
>>>>>>>>> (dr2))
>>>>>>>>> +                 || res == 0)
>>>>>>>>>
>>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias 
>>>>>>>>> check for
>>>>>>>>> a specific case?
>>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>>>  Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>>>> here straightforwardly.  Shall I try to further generalize the
>>>>>>>> interface as independence patch to this one?
>>>>>>>
>>>>>>> That would be nice.
>>>>>>>
>>>>>>>>>
>>>>>>>>> +  /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
>>>>>>>>> +  if (flag_tree_loop_vectorize)
>>>>>>>>> +    {
>>>>>>>>>
>>>>>>>>> so at this point I'd condition the whole runtime-alias check 
>>>>>>>>> generating
>>>>>>>>> on flag_tree_loop_vectorize.  You seem to support versioning w/o
>>>>>>>>> that here but in other places disable versioning w/o 
>>>>>>>>> flag_tree_loop_vectorize.
>>>>>>>>> That looks somewhat inconsistent...
>>>>>>>> It is a bit complicated.  In function version_for_distribution_p, we 
>>>>>>>> have
>>>>>>>> +
>>>>>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>>>>>> +  if (alias_ddrs->length () > 0)
>>>>>>>> +    return true;
>>>>>>>> +
>>>>>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>>>>>> +  if (!flag_tree_loop_vectorize)
>>>>>>>> +    return false;
>>>>>>>> +
>>>>>>>>
>>>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>>>> unnecessary.  I did this because we lack cost model and current
>>>>>>>> distribution may result in too many distribution?  If that's the case,
>>>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>>>> to the original one.  Hmm, shall I change it into below code:
>>>>>>>
>>>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>>>> builtin recognition.  So what happens if the vectorizer can vectorize
>>>>>>> both the distributed and non-distributed loops?
>>>>>> Hmm, there is no such cases.  Under condition no builtins is
>>>>>> recognized, we wouldn't distribute loop if by itself can be
>>>>>> vectorized.  Does this make sense?  Of course, cost model for memory
>>>>>> behavior can change this behavior and is wanted.
>>>>>
>>>>> So which cases _do_ we split loops then?  "more parallelism" -- but what
>>>>> does that mean exactly?  Is there any testcase that shows the desired
>>>>> splits for vectorization?
>>>> At least one of distributed loop can be executed paralleled while the
>>>> original loop can't.
>>>> Not many, but ldist-26.c is added by one of patch.
>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>>>>>> +  if (alias_ddrs->length () == 0)
>>>>>>>> +    return false;
>>>>>>>> +
>>>>>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>>>>>> +  if (!flag_tree_loop_vectorize)
>>>>>>>> +    return false;
>>>>>>>> +
>>>>>>>>
>>>>>>>> Then I can remove the check you mentioned in function
>>>>>>>> version_loop_by_alias_check?
>>>>>>>
>>>>>>> Yeah, I guess that would be easier to understand.  Need to update
>>>>>>> the comment above the alias_ddrs check though.
>>>>>> Yes the logic here is complicated.  On one hand, I want to be
>>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>>>> can undo all "unnecessary" distribution before memory behavior is
>>>>>> modeled.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> +  /* Don't version loop if any partition is builtin.  */
>>>>>>>>> +  for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>>>> +    {
>>>>>>>>> +      if (partition->kind != PKIND_NORMAL)
>>>>>>>>> +       break;
>>>>>>>>> +    }
>>>>>>>>>
>>>>>>>>> why's that?  Do you handle the case where only a subset of partitions
>>>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>>>> win, thus distribution won't be canceled later in vectorizer.  Another
>>>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>>>> can't really version with current logic because the
>>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>>>
>>>>>>> Ah, ok.  I guess the maze was too twisted for me to see what
>>>>>>> version_for_distribution_p
>>>>>>> does ;)
>>>>>>>
>>>>>>>>> require a runtime alias check to be generated?  Thus from a loop
>>>>>>>>> generate three loops, only two of them being versioned?  The
>>>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>>>> versions would be "transformed".  Or do we rely on recursively
>>>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>>>> and recurse on those?
>>>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>>>> loop once.  Though I think it's not very difficult to do two stages
>>>>>>>> distribution, I am not sure how useful it is.
>>>>>>>
>>>>>>> Not sure either.
>>>>>>>
>>>>>>> So to understand you're looking at loop-distribution as vectorization 
>>>>>>> enabler
>>>>>>> and pattern detector.  I think that is reasonable without a better cost 
>>>>>>> model.
>>>>>>>
>>>>>>> Note that I think the state before your patches had the sensible 
>>>>>>> cost-model
>>>>>>> that it fused very conservatively and just produced the maximum 
>>>>>>> distribution
>>>>>>> with the idea that the looping overhead itself is cheap.  Note that with
>>>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>>>> do "partial vectorization" in case profitability is different.  Of 
>>>>>>> course the
>>>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>>>> Ideally, we have cost model for memory behavior in distribution.  If
>>>>>> we know distribution is beneficial in loop distribution, we can simply
>>>>>> distribute it; otherwise we pass distribution cost (including memory
>>>>>> cost as well as runtime alias check cost as an argument to
>>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>>>> together with vectorization.
>>>>>
>>>>> Yes.  The old cost model wasn't really one thus loop distribution was 
>>>>> never
>>>>> enabled by default.
>>>>>
>>>
>>> Hi,
>>> This is updated patch.  It makes below changes as well as renaming
>>> ldist_alias_id to orig_loop_num.
>>> 1) Simplifies relation with flag_tree_loop_vectorization.  Now it only
>>> versions loop if runtime alias check is needed.
>>> 2) Record the new versioned loop as original loop in order to simplify
>>> dominance working routine.  It also makes sense since versioned loop
>>> is the one same with the original loop.
>>>
>>> No change for ChangeLog entry.  Bootstrap and test.  Is it OK?
>>>
>>
>> I've noticed that this patch introduces regressions on armeb:
>> FAIL:    gcc.dg/torture/pr52028.c   -O3 -fomit-frame-pointer
>> -funroll-loops -fpeel-loops -ftracer -finline-functions  execution
>> test
>> FAIL:    gcc.dg/torture/pr52028.c   -O3 -g  execution test
>>
>> For instance for
>> armeb-none-linux-gnueabihf
>> --with-cpu=cortex-a9
>> --with-fpu=neon-fp16
> Hi Christophe,
> I am having difficulty in reproducing this one.  According to you test
> results at
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/report-build-info.html
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/armeb-none-linux-gnueabihf/fail-gcc-rh60-armeb-none-linux-gnueabihf-arm-cortex-a9-neon-fp16.txt
>
> It started at least from r248356?  Also I didn't get any difference in
> generated assembly between r248356/r248342 for my local toolchain.
> Maybe because I used different configuration options?
>

Hi Bin,

Sorry I should have shared more details.
I noticed the regression when you committed this patch (r249994):
http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/249994/report-build-info.html

Checking in my validation logs, it was also present
between r248356 and r249990 (your patch 9/13 Simply cost model merges
partitions with the same references)
and is failing since r249994 (this patch)

Was it "fixed" by accident by r249990 and latent until r249994 ?

HTH

Thanks,

Christophe

> Thanks,
> bin
>>
>> Christophe
>>
>>> Thanks,
>>> bin

Reply via email to