On Mon, Jul 17, 2017 at 1:09 PM, Christophe Lyon <christophe.l...@linaro.org> wrote: > On 17 July 2017 at 12:06, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon >> <christophe.l...@linaro.org> wrote: >>> Hi Bin, >>> >>> On 30 June 2017 at 12:43, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener >>>>> <richard.guent...@gmail.com> wrote: >>>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener >>>>>>> <richard.guent...@gmail.com> wrote: >>>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> >>>>>>>> wrote: >>>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener >>>>>>>>> <richard.guent...@gmail.com> wrote: >>>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on >>>>>>>>>>> x86_64 and aarch64. >>>>>>>>>> >>>>>>>>>> why is ldist-12.c no longer distributed? your comment says it >>>>>>>>>> doesn't expose >>>>>>>>>> more "parallelism" but the point is to reduce memory bandwith >>>>>>>>>> requirements >>>>>>>>>> which it clearly does. >>>>>>>>>> >>>>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the >>>>>>>>>> wording >>>>>>>>>> of "parallelism" still confuses me. >>>>>>>>>> >>>>>>>>>> Can you elaborate on that. Now onto the patch: >>>>>>>>> Given we don't model data locality or memory bandwidth, whether >>>>>>>>> distribution enables loops that can be executed paralleled becomes the >>>>>>>>> major criteria for distribution. BTW, I think a good memory stream >>>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c, >>>>>>>>> etc., appropriate for distribution. >>>>>>>> >>>>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into >>>>>>>> two loops >>>>>>>> can be executed "in parallel" >>>>>>> So if a loop by itself can be vectorized (or so called can be executed >>>>>>> paralleled), we tend to no distribute it into small ones. But there >>>>>>> is one exception here, if the distributed small loops are recognized >>>>>>> as builtin functions, we still distribute it. I assume it's generally >>>>>>> better to call builtin memory functions than vectorize it by GCC? >>>>>> >>>>>> Yes. >>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>> + Loop distribution is the dual of loop fusion. It separates >>>>>>>>>> statements >>>>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) >>>>>>>>>> with the >>>>>>>>>> + same loop header. The major goal is to separate statements >>>>>>>>>> which may >>>>>>>>>> + be vectorized from those that can't. This pass implements >>>>>>>>>> distribution >>>>>>>>>> + in the following steps: >>>>>>>>>> >>>>>>>>>> misses the goal of being a memory stream optimization, not only a >>>>>>>>>> vectorization >>>>>>>>>> enabler. distributing a loop can also reduce register pressure. >>>>>>>>> I will revise the comment, but as explained, enabling more >>>>>>>>> vectorization is the major criteria for distribution to some extend >>>>>>>>> now. >>>>>>>> >>>>>>>> Yes, I agree -- originally it was written to optimize the stream >>>>>>>> benchmark IIRC. >>>>>>> Let's see if any performance drop will be reported against this patch. >>>>>>> Let's see if we can create a cost model for it. >>>>>> >>>>>> Fine. >>>>> I will run some benchmarks to see if there is breakage. >>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I >>>>>>>>>> didn't look >>>>>>>>>> into yet). If you don't use that please introduce it separately. >>>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> + /* Be conservative. If data references are not well >>>>>>>>>> analyzed, >>>>>>>>>> + or the two data references have the same base >>>>>>>>>> address and >>>>>>>>>> + offset, add dependence and consider it alias to >>>>>>>>>> each other. >>>>>>>>>> + In other words, the dependence can not be resolved >>>>>>>>>> by >>>>>>>>>> + runtime alias check. */ >>>>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2) >>>>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2) >>>>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2) >>>>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP >>>>>>>>>> (dr1)) >>>>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP >>>>>>>>>> (dr2)) >>>>>>>>>> + || res == 0) >>>>>>>>>> >>>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias >>>>>>>>>> check for >>>>>>>>>> a specific case? >>>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously? >>>>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit >>>>>>>>> here straightforwardly. Shall I try to further generalize the >>>>>>>>> interface as independence patch to this one? >>>>>>>> >>>>>>>> That would be nice. >>>>>>>> >>>>>>>>>> >>>>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */ >>>>>>>>>> + if (flag_tree_loop_vectorize) >>>>>>>>>> + { >>>>>>>>>> >>>>>>>>>> so at this point I'd condition the whole runtime-alias check >>>>>>>>>> generating >>>>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o >>>>>>>>>> that here but in other places disable versioning w/o >>>>>>>>>> flag_tree_loop_vectorize. >>>>>>>>>> That looks somewhat inconsistent... >>>>>>>>> It is a bit complicated. In function version_for_distribution_p, we >>>>>>>>> have >>>>>>>>> + >>>>>>>>> + /* Need to version loop if runtime alias check is necessary. */ >>>>>>>>> + if (alias_ddrs->length () > 0) >>>>>>>>> + return true; >>>>>>>>> + >>>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>>>>>>> + vectorizer is not enable because no other pass can fold it. */ >>>>>>>>> + if (!flag_tree_loop_vectorize) >>>>>>>>> + return false; >>>>>>>>> + >>>>>>>>> >>>>>>>>> It means we also versioning loops even if runtime alias check is >>>>>>>>> unnecessary. I did this because we lack cost model and current >>>>>>>>> distribution may result in too many distribution? If that's the case, >>>>>>>>> at least vectorizer will remove distributed version loop and fall back >>>>>>>>> to the original one. Hmm, shall I change it into below code: >>>>>>>> >>>>>>>> Hmm, ok - that really ties things to vectorization apart from the >>>>>>>> builtin recognition. So what happens if the vectorizer can vectorize >>>>>>>> both the distributed and non-distributed loops? >>>>>>> Hmm, there is no such cases. Under condition no builtins is >>>>>>> recognized, we wouldn't distribute loop if by itself can be >>>>>>> vectorized. Does this make sense? Of course, cost model for memory >>>>>>> behavior can change this behavior and is wanted. >>>>>> >>>>>> So which cases _do_ we split loops then? "more parallelism" -- but what >>>>>> does that mean exactly? Is there any testcase that shows the desired >>>>>> splits for vectorization? >>>>> At least one of distributed loop can be executed paralleled while the >>>>> original loop can't. >>>>> Not many, but ldist-26.c is added by one of patch. >>>>>> >>>>>>>> >>>>>>>>> + >>>>>>>>> + /* Need to version loop if runtime alias check is necessary. */ >>>>>>>>> + if (alias_ddrs->length () == 0) >>>>>>>>> + return false; >>>>>>>>> + >>>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>>>>>>> + vectorizer is not enable because no other pass can fold it. */ >>>>>>>>> + if (!flag_tree_loop_vectorize) >>>>>>>>> + return false; >>>>>>>>> + >>>>>>>>> >>>>>>>>> Then I can remove the check you mentioned in function >>>>>>>>> version_loop_by_alias_check? >>>>>>>> >>>>>>>> Yeah, I guess that would be easier to understand. Need to update >>>>>>>> the comment above the alias_ddrs check though. >>>>>>> Yes the logic here is complicated. On one hand, I want to be >>>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer >>>>>>> can undo all "unnecessary" distribution before memory behavior is >>>>>>> modeled. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>> + /* Don't version loop if any partition is builtin. */ >>>>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i) >>>>>>>>>> + { >>>>>>>>>> + if (partition->kind != PKIND_NORMAL) >>>>>>>>>> + break; >>>>>>>>>> + } >>>>>>>>>> >>>>>>>>>> why's that? Do you handle the case where only a subset of partitions >>>>>>>>> One reason is I generally consider distributed builtin functions as a >>>>>>>>> win, thus distribution won't be canceled later in vectorizer. Another >>>>>>>>> reason is if all distributed loops are recognized as builtins, we >>>>>>>>> can't really version with current logic because the >>>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer. >>>>>>>> >>>>>>>> Ah, ok. I guess the maze was too twisted for me to see what >>>>>>>> version_for_distribution_p >>>>>>>> does ;) >>>>>>>> >>>>>>>>>> require a runtime alias check to be generated? Thus from a loop >>>>>>>>>> generate three loops, only two of them being versioned? The >>>>>>>>>> complication would be that both the runtime alias and non-alias >>>>>>>>>> versions would be "transformed". Or do we rely on recursively >>>>>>>>>> distributing the distribution result, thus if we have partitions that >>>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions >>>>>>>>>> and recurse on those? >>>>>>>>> No, this is not precisely handled now, the pass will version the whole >>>>>>>>> loop once. Though I think it's not very difficult to do two stages >>>>>>>>> distribution, I am not sure how useful it is. >>>>>>>> >>>>>>>> Not sure either. >>>>>>>> >>>>>>>> So to understand you're looking at loop-distribution as vectorization >>>>>>>> enabler >>>>>>>> and pattern detector. I think that is reasonable without a better >>>>>>>> cost model. >>>>>>>> >>>>>>>> Note that I think the state before your patches had the sensible >>>>>>>> cost-model >>>>>>>> that it fused very conservatively and just produced the maximum >>>>>>>> distribution >>>>>>>> with the idea that the looping overhead itself is cheap. Note that >>>>>>>> with >>>>>>>> a more "maximum" distribution the vectorizer also gets the chance to >>>>>>>> do "partial vectorization" in case profitability is different. Of >>>>>>>> course the >>>>>>>> setup cost may offset that in the case all loops end up vectorized... >>>>>>> Ideally, we have cost model for memory behavior in distribution. If >>>>>>> we know distribution is beneficial in loop distribution, we can simply >>>>>>> distribute it; otherwise we pass distribution cost (including memory >>>>>>> cost as well as runtime alias check cost as an argument to >>>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit >>>>>>> together with vectorization. >>>>>> >>>>>> Yes. The old cost model wasn't really one thus loop distribution was >>>>>> never >>>>>> enabled by default. >>>>>> >>>> >>>> Hi, >>>> This is updated patch. It makes below changes as well as renaming >>>> ldist_alias_id to orig_loop_num. >>>> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only >>>> versions loop if runtime alias check is needed. >>>> 2) Record the new versioned loop as original loop in order to simplify >>>> dominance working routine. It also makes sense since versioned loop >>>> is the one same with the original loop. >>>> >>>> No change for ChangeLog entry. Bootstrap and test. Is it OK? >>>> >>> >>> I've noticed that this patch introduces regressions on armeb: >>> FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer >>> -funroll-loops -fpeel-loops -ftracer -finline-functions execution >>> test >>> FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test >>> >>> For instance for >>> armeb-none-linux-gnueabihf >>> --with-cpu=cortex-a9 >>> --with-fpu=neon-fp16 >> Hi Christophe, >> I am having difficulty in reproducing this one. According to you test >> results at >> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/report-build-info.html >> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/armeb-none-linux-gnueabihf/fail-gcc-rh60-armeb-none-linux-gnueabihf-arm-cortex-a9-neon-fp16.txt >> >> It started at least from r248356? Also I didn't get any difference in >> generated assembly between r248356/r248342 for my local toolchain. >> Maybe because I used different configuration options? >> > > Hi Bin, > > Sorry I should have shared more details. > I noticed the regression when you committed this patch (r249994): > http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/249994/report-build-info.html > > Checking in my validation logs, it was also present > between r248356 and r249990 (your patch 9/13 Simply cost model merges > partitions with the same references) > and is failing since r249994 (this patch) > > Was it "fixed" by accident by r249990 and latent until r249994 ? Ah, thanks very much for the information. Most likely it was latent because of temporarily conservative loop distribution. I filed PR81472 for tracking.
Thanks, bin