On Fri, Jun 30, 2017 at 12:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener >>>> <richard.guent...@gmail.com> wrote: >>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener >>>>>> <richard.guent...@gmail.com> wrote: >>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> >>>>>>> wrote: >>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> >>>>>>>> wrote: >>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>>>>>>> Hi, >>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on >>>>>>>> x86_64 and aarch64. >>>>>>> >>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't >>>>>>> expose >>>>>>> more "parallelism" but the point is to reduce memory bandwith >>>>>>> requirements >>>>>>> which it clearly does. >>>>>>> >>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the >>>>>>> wording >>>>>>> of "parallelism" still confuses me. >>>>>>> >>>>>>> Can you elaborate on that. Now onto the patch: >>>>>> Given we don't model data locality or memory bandwidth, whether >>>>>> distribution enables loops that can be executed paralleled becomes the >>>>>> major criteria for distribution. BTW, I think a good memory stream >>>>>> optimization model shouldn't consider small loops as in ldist-12.c, >>>>>> etc., appropriate for distribution. >>>>> >>>>> True. But what means "parallel" here? ldist-13.c if partitioned into >>>>> two loops >>>>> can be executed "in parallel" >>>> So if a loop by itself can be vectorized (or so called can be executed >>>> paralleled), we tend to no distribute it into small ones. But there >>>> is one exception here, if the distributed small loops are recognized >>>> as builtin functions, we still distribute it. I assume it's generally >>>> better to call builtin memory functions than vectorize it by GCC? >>> >>> Yes. >>> >>>>> >>>>>>> >>>>>>> + Loop distribution is the dual of loop fusion. It separates >>>>>>> statements >>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with >>>>>>> the >>>>>>> + same loop header. The major goal is to separate statements which >>>>>>> may >>>>>>> + be vectorized from those that can't. This pass implements >>>>>>> distribution >>>>>>> + in the following steps: >>>>>>> >>>>>>> misses the goal of being a memory stream optimization, not only a >>>>>>> vectorization >>>>>>> enabler. distributing a loop can also reduce register pressure. >>>>>> I will revise the comment, but as explained, enabling more >>>>>> vectorization is the major criteria for distribution to some extend >>>>>> now. >>>>> >>>>> Yes, I agree -- originally it was written to optimize the stream >>>>> benchmark IIRC. >>>> Let's see if any performance drop will be reported against this patch. >>>> Let's see if we can create a cost model for it. >>> >>> Fine. >> I will run some benchmarks to see if there is breakage. >>> >>>>> >>>>>>> >>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I >>>>>>> didn't look >>>>>>> into yet). If you don't use that please introduce it separately. >>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch. >>>>>> >>>>>>> >>>>>>> + /* Be conservative. If data references are not well >>>>>>> analyzed, >>>>>>> + or the two data references have the same base address >>>>>>> and >>>>>>> + offset, add dependence and consider it alias to each >>>>>>> other. >>>>>>> + In other words, the dependence can not be resolved by >>>>>>> + runtime alias check. */ >>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2) >>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2) >>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2) >>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1)) >>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2)) >>>>>>> + || res == 0) >>>>>>> >>>>>>> ISTR a helper that computes whether we can handle a runtime alias check >>>>>>> for >>>>>>> a specific case? >>>>>> I guess you mean runtime_alias_check_p that I factored out previously? >>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit >>>>>> here straightforwardly. Shall I try to further generalize the >>>>>> interface as independence patch to this one? >>>>> >>>>> That would be nice. >>>>> >>>>>>> >>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */ >>>>>>> + if (flag_tree_loop_vectorize) >>>>>>> + { >>>>>>> >>>>>>> so at this point I'd condition the whole runtime-alias check generating >>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o >>>>>>> that here but in other places disable versioning w/o >>>>>>> flag_tree_loop_vectorize. >>>>>>> That looks somewhat inconsistent... >>>>>> It is a bit complicated. In function version_for_distribution_p, we have >>>>>> + >>>>>> + /* Need to version loop if runtime alias check is necessary. */ >>>>>> + if (alias_ddrs->length () > 0) >>>>>> + return true; >>>>>> + >>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>>>> + vectorizer is not enable because no other pass can fold it. */ >>>>>> + if (!flag_tree_loop_vectorize) >>>>>> + return false; >>>>>> + >>>>>> >>>>>> It means we also versioning loops even if runtime alias check is >>>>>> unnecessary. I did this because we lack cost model and current >>>>>> distribution may result in too many distribution? If that's the case, >>>>>> at least vectorizer will remove distributed version loop and fall back >>>>>> to the original one. Hmm, shall I change it into below code: >>>>> >>>>> Hmm, ok - that really ties things to vectorization apart from the >>>>> builtin recognition. So what happens if the vectorizer can vectorize >>>>> both the distributed and non-distributed loops? >>>> Hmm, there is no such cases. Under condition no builtins is >>>> recognized, we wouldn't distribute loop if by itself can be >>>> vectorized. Does this make sense? Of course, cost model for memory >>>> behavior can change this behavior and is wanted. >>> >>> So which cases _do_ we split loops then? "more parallelism" -- but what >>> does that mean exactly? Is there any testcase that shows the desired >>> splits for vectorization? >> At least one of distributed loop can be executed paralleled while the >> original loop can't. >> Not many, but ldist-26.c is added by one of patch. >>> >>>>> >>>>>> + >>>>>> + /* Need to version loop if runtime alias check is necessary. */ >>>>>> + if (alias_ddrs->length () == 0) >>>>>> + return false; >>>>>> + >>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>>>> + vectorizer is not enable because no other pass can fold it. */ >>>>>> + if (!flag_tree_loop_vectorize) >>>>>> + return false; >>>>>> + >>>>>> >>>>>> Then I can remove the check you mentioned in function >>>>>> version_loop_by_alias_check? >>>>> >>>>> Yeah, I guess that would be easier to understand. Need to update >>>>> the comment above the alias_ddrs check though. >>>> Yes the logic here is complicated. On one hand, I want to be >>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer >>>> can undo all "unnecessary" distribution before memory behavior is >>>> modeled. >>>> >>>> >>>>> >>>>>>> >>>>>>> + /* Don't version loop if any partition is builtin. */ >>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i) >>>>>>> + { >>>>>>> + if (partition->kind != PKIND_NORMAL) >>>>>>> + break; >>>>>>> + } >>>>>>> >>>>>>> why's that? Do you handle the case where only a subset of partitions >>>>>> One reason is I generally consider distributed builtin functions as a >>>>>> win, thus distribution won't be canceled later in vectorizer. Another >>>>>> reason is if all distributed loops are recognized as builtins, we >>>>>> can't really version with current logic because the >>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer. >>>>> >>>>> Ah, ok. I guess the maze was too twisted for me to see what >>>>> version_for_distribution_p >>>>> does ;) >>>>> >>>>>>> require a runtime alias check to be generated? Thus from a loop >>>>>>> generate three loops, only two of them being versioned? The >>>>>>> complication would be that both the runtime alias and non-alias >>>>>>> versions would be "transformed". Or do we rely on recursively >>>>>>> distributing the distribution result, thus if we have partitions that >>>>>>> can be handled w/o runtime alias check fuse the remaining partitions >>>>>>> and recurse on those? >>>>>> No, this is not precisely handled now, the pass will version the whole >>>>>> loop once. Though I think it's not very difficult to do two stages >>>>>> distribution, I am not sure how useful it is. >>>>> >>>>> Not sure either. >>>>> >>>>> So to understand you're looking at loop-distribution as vectorization >>>>> enabler >>>>> and pattern detector. I think that is reasonable without a better cost >>>>> model. >>>>> >>>>> Note that I think the state before your patches had the sensible >>>>> cost-model >>>>> that it fused very conservatively and just produced the maximum >>>>> distribution >>>>> with the idea that the looping overhead itself is cheap. Note that with >>>>> a more "maximum" distribution the vectorizer also gets the chance to >>>>> do "partial vectorization" in case profitability is different. Of course >>>>> the >>>>> setup cost may offset that in the case all loops end up vectorized... >>>> Ideally, we have cost model for memory behavior in distribution. If >>>> we know distribution is beneficial in loop distribution, we can simply >>>> distribute it; otherwise we pass distribution cost (including memory >>>> cost as well as runtime alias check cost as an argument to >>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit >>>> together with vectorization. >>> >>> Yes. The old cost model wasn't really one thus loop distribution was never >>> enabled by default. >>> > > Hi, > This is updated patch. It makes below changes as well as renaming > ldist_alias_id to orig_loop_num. > 1) Simplifies relation with flag_tree_loop_vectorization. Now it only > versions loop if runtime alias check is needed. > 2) Record the new versioned loop as original loop in order to simplify > dominance working routine. It also makes sense since versioned loop > is the one same with the original loop. > > No change for ChangeLog entry. Bootstrap and test. Is it OK?
Ok. The profile stuff might need updating now, please double-check. Thanks, Richard. > Thanks, > bin