On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon <christophe.l...@linaro.org> wrote: > Hi Bin, > > On 30 June 2017 at 12:43, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener >>>>> <richard.guent...@gmail.com> wrote: >>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener >>>>>>> <richard.guent...@gmail.com> wrote: >>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> >>>>>>>> wrote: >>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>>>>>>>> Hi, >>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on >>>>>>>>> x86_64 and aarch64. >>>>>>>> >>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't >>>>>>>> expose >>>>>>>> more "parallelism" but the point is to reduce memory bandwith >>>>>>>> requirements >>>>>>>> which it clearly does. >>>>>>>> >>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the >>>>>>>> wording >>>>>>>> of "parallelism" still confuses me. >>>>>>>> >>>>>>>> Can you elaborate on that. Now onto the patch: >>>>>>> Given we don't model data locality or memory bandwidth, whether >>>>>>> distribution enables loops that can be executed paralleled becomes the >>>>>>> major criteria for distribution. BTW, I think a good memory stream >>>>>>> optimization model shouldn't consider small loops as in ldist-12.c, >>>>>>> etc., appropriate for distribution. >>>>>> >>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into >>>>>> two loops >>>>>> can be executed "in parallel" >>>>> So if a loop by itself can be vectorized (or so called can be executed >>>>> paralleled), we tend to no distribute it into small ones. But there >>>>> is one exception here, if the distributed small loops are recognized >>>>> as builtin functions, we still distribute it. I assume it's generally >>>>> better to call builtin memory functions than vectorize it by GCC? >>>> >>>> Yes. >>>> >>>>>> >>>>>>>> >>>>>>>> + Loop distribution is the dual of loop fusion. It separates >>>>>>>> statements >>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with >>>>>>>> the >>>>>>>> + same loop header. The major goal is to separate statements which >>>>>>>> may >>>>>>>> + be vectorized from those that can't. This pass implements >>>>>>>> distribution >>>>>>>> + in the following steps: >>>>>>>> >>>>>>>> misses the goal of being a memory stream optimization, not only a >>>>>>>> vectorization >>>>>>>> enabler. distributing a loop can also reduce register pressure. >>>>>>> I will revise the comment, but as explained, enabling more >>>>>>> vectorization is the major criteria for distribution to some extend >>>>>>> now. >>>>>> >>>>>> Yes, I agree -- originally it was written to optimize the stream >>>>>> benchmark IIRC. >>>>> Let's see if any performance drop will be reported against this patch. >>>>> Let's see if we can create a cost model for it. >>>> >>>> Fine. >>> I will run some benchmarks to see if there is breakage. >>>> >>>>>> >>>>>>>> >>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I >>>>>>>> didn't look >>>>>>>> into yet). If you don't use that please introduce it separately. >>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch. >>>>>>> >>>>>>>> >>>>>>>> + /* Be conservative. If data references are not well >>>>>>>> analyzed, >>>>>>>> + or the two data references have the same base address >>>>>>>> and >>>>>>>> + offset, add dependence and consider it alias to each >>>>>>>> other. >>>>>>>> + In other words, the dependence can not be resolved by >>>>>>>> + runtime alias check. */ >>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2) >>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2) >>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2) >>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP >>>>>>>> (dr1)) >>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP >>>>>>>> (dr2)) >>>>>>>> + || res == 0) >>>>>>>> >>>>>>>> ISTR a helper that computes whether we can handle a runtime alias >>>>>>>> check for >>>>>>>> a specific case? >>>>>>> I guess you mean runtime_alias_check_p that I factored out previously? >>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit >>>>>>> here straightforwardly. Shall I try to further generalize the >>>>>>> interface as independence patch to this one? >>>>>> >>>>>> That would be nice. >>>>>> >>>>>>>> >>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */ >>>>>>>> + if (flag_tree_loop_vectorize) >>>>>>>> + { >>>>>>>> >>>>>>>> so at this point I'd condition the whole runtime-alias check generating >>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o >>>>>>>> that here but in other places disable versioning w/o >>>>>>>> flag_tree_loop_vectorize. >>>>>>>> That looks somewhat inconsistent... >>>>>>> It is a bit complicated. In function version_for_distribution_p, we >>>>>>> have >>>>>>> + >>>>>>> + /* Need to version loop if runtime alias check is necessary. */ >>>>>>> + if (alias_ddrs->length () > 0) >>>>>>> + return true; >>>>>>> + >>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>>>>> + vectorizer is not enable because no other pass can fold it. */ >>>>>>> + if (!flag_tree_loop_vectorize) >>>>>>> + return false; >>>>>>> + >>>>>>> >>>>>>> It means we also versioning loops even if runtime alias check is >>>>>>> unnecessary. I did this because we lack cost model and current >>>>>>> distribution may result in too many distribution? If that's the case, >>>>>>> at least vectorizer will remove distributed version loop and fall back >>>>>>> to the original one. Hmm, shall I change it into below code: >>>>>> >>>>>> Hmm, ok - that really ties things to vectorization apart from the >>>>>> builtin recognition. So what happens if the vectorizer can vectorize >>>>>> both the distributed and non-distributed loops? >>>>> Hmm, there is no such cases. Under condition no builtins is >>>>> recognized, we wouldn't distribute loop if by itself can be >>>>> vectorized. Does this make sense? Of course, cost model for memory >>>>> behavior can change this behavior and is wanted. >>>> >>>> So which cases _do_ we split loops then? "more parallelism" -- but what >>>> does that mean exactly? Is there any testcase that shows the desired >>>> splits for vectorization? >>> At least one of distributed loop can be executed paralleled while the >>> original loop can't. >>> Not many, but ldist-26.c is added by one of patch. >>>> >>>>>> >>>>>>> + >>>>>>> + /* Need to version loop if runtime alias check is necessary. */ >>>>>>> + if (alias_ddrs->length () == 0) >>>>>>> + return false; >>>>>>> + >>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>>>>> + vectorizer is not enable because no other pass can fold it. */ >>>>>>> + if (!flag_tree_loop_vectorize) >>>>>>> + return false; >>>>>>> + >>>>>>> >>>>>>> Then I can remove the check you mentioned in function >>>>>>> version_loop_by_alias_check? >>>>>> >>>>>> Yeah, I guess that would be easier to understand. Need to update >>>>>> the comment above the alias_ddrs check though. >>>>> Yes the logic here is complicated. On one hand, I want to be >>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer >>>>> can undo all "unnecessary" distribution before memory behavior is >>>>> modeled. >>>>> >>>>> >>>>>> >>>>>>>> >>>>>>>> + /* Don't version loop if any partition is builtin. */ >>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i) >>>>>>>> + { >>>>>>>> + if (partition->kind != PKIND_NORMAL) >>>>>>>> + break; >>>>>>>> + } >>>>>>>> >>>>>>>> why's that? Do you handle the case where only a subset of partitions >>>>>>> One reason is I generally consider distributed builtin functions as a >>>>>>> win, thus distribution won't be canceled later in vectorizer. Another >>>>>>> reason is if all distributed loops are recognized as builtins, we >>>>>>> can't really version with current logic because the >>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer. >>>>>> >>>>>> Ah, ok. I guess the maze was too twisted for me to see what >>>>>> version_for_distribution_p >>>>>> does ;) >>>>>> >>>>>>>> require a runtime alias check to be generated? Thus from a loop >>>>>>>> generate three loops, only two of them being versioned? The >>>>>>>> complication would be that both the runtime alias and non-alias >>>>>>>> versions would be "transformed". Or do we rely on recursively >>>>>>>> distributing the distribution result, thus if we have partitions that >>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions >>>>>>>> and recurse on those? >>>>>>> No, this is not precisely handled now, the pass will version the whole >>>>>>> loop once. Though I think it's not very difficult to do two stages >>>>>>> distribution, I am not sure how useful it is. >>>>>> >>>>>> Not sure either. >>>>>> >>>>>> So to understand you're looking at loop-distribution as vectorization >>>>>> enabler >>>>>> and pattern detector. I think that is reasonable without a better cost >>>>>> model. >>>>>> >>>>>> Note that I think the state before your patches had the sensible >>>>>> cost-model >>>>>> that it fused very conservatively and just produced the maximum >>>>>> distribution >>>>>> with the idea that the looping overhead itself is cheap. Note that with >>>>>> a more "maximum" distribution the vectorizer also gets the chance to >>>>>> do "partial vectorization" in case profitability is different. Of >>>>>> course the >>>>>> setup cost may offset that in the case all loops end up vectorized... >>>>> Ideally, we have cost model for memory behavior in distribution. If >>>>> we know distribution is beneficial in loop distribution, we can simply >>>>> distribute it; otherwise we pass distribution cost (including memory >>>>> cost as well as runtime alias check cost as an argument to >>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit >>>>> together with vectorization. >>>> >>>> Yes. The old cost model wasn't really one thus loop distribution was never >>>> enabled by default. >>>> >> >> Hi, >> This is updated patch. It makes below changes as well as renaming >> ldist_alias_id to orig_loop_num. >> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only >> versions loop if runtime alias check is needed. >> 2) Record the new versioned loop as original loop in order to simplify >> dominance working routine. It also makes sense since versioned loop >> is the one same with the original loop. >> >> No change for ChangeLog entry. Bootstrap and test. Is it OK? >> > > I've noticed that this patch introduces regressions on armeb: > FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer > -funroll-loops -fpeel-loops -ftracer -finline-functions execution > test > FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test Thanks for reporting this, I will create a PR and investigate.
Thanks, bin > > For instance for > armeb-none-linux-gnueabihf > --with-cpu=cortex-a9 > --with-fpu=neon-fp16 > > Christophe > >> Thanks, >> bin