On Tue, Nov 13, 2018 at 3:33 PM Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Nov 13, 2018 at 10:48 AM Kyrill Tkachov > <kyrylo.tkac...@foss.arm.com> wrote: > > > > > > On 13/11/18 09:28, Richard Biener wrote: > > > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov > > > <kyrylo.tkac...@foss.arm.com> wrote: > > >> Hi Richard, > > >> > > >> On 13/11/18 08:24, Richard Biener wrote: > > >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov > > >>> <kyrylo.tkac...@foss.arm.com> wrote: > > >>>> On 12/11/18 14:10, Richard Biener wrote: > > >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov > > >>>>> <kyrylo.tkac...@foss.arm.com> wrote: > > >>>>>> On 09/11/18 12:18, Richard Biener wrote: > > >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > > >>>>>>> <kyrylo.tkac...@foss.arm.com> wrote: > > >>>>>>>> Hi all, > > >>>>>>>> > > >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be > > >>>>>>>> due to unrolling: > > >>>>>>>> > > >>>>>>>> fully_peel_me: > > >>>>>>>> mov x1, 5 > > >>>>>>>> ptrue p1.d, all > > >>>>>>>> whilelo p0.d, xzr, x1 > > >>>>>>>> ld1d z0.d, p0/z, [x0] > > >>>>>>>> fadd z0.d, z0.d, z0.d > > >>>>>>>> st1d z0.d, p0, [x0] > > >>>>>>>> cntd x2 > > >>>>>>>> addvl x3, x0, #1 > > >>>>>>>> whilelo p0.d, x2, x1 > > >>>>>>>> beq .L1 > > >>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] > > >>>>>>>> fadd z0.d, z0.d, z0.d > > >>>>>>>> st1d z0.d, p0, [x3] > > >>>>>>>> cntw x2 > > >>>>>>>> incb x0, all, mul #2 > > >>>>>>>> whilelo p0.d, x2, x1 > > >>>>>>>> beq .L1 > > >>>>>>>> ld1d z0.d, p0/z, [x0] > > >>>>>>>> fadd z0.d, z0.d, z0.d > > >>>>>>>> st1d z0.d, p0, [x0] > > >>>>>>>> .L1: > > >>>>>>>> ret > > >>>>>>>> > > >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the > > >>>>>>>> compiler doesn't know the loop iteration count. > > >>>>>>>> For such loops we don't want to unroll if we don't end up > > >>>>>>>> eliminating branches as this just bloats code size > > >>>>>>>> and hurts icache performance. > > >>>>>>>> > > >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only > > >>>>>>>> param that disables cunroll when the loop iteration > > >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more > > >>>>>>>> often for SVE VLA code, but it does help some > > >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration > > >>>>>>>> count are not unrolled when it doesn't eliminate > > >>>>>>>> the branches. > > >>>>>>>> > > >>>>>>>> So for the above testcase we generate now: > > >>>>>>>> fully_peel_me: > > >>>>>>>> mov x2, 5 > > >>>>>>>> mov x3, x2 > > >>>>>>>> mov x1, 0 > > >>>>>>>> whilelo p0.d, xzr, x2 > > >>>>>>>> ptrue p1.d, all > > >>>>>>>> .L2: > > >>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] > > >>>>>>>> fadd z0.d, z0.d, z0.d > > >>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3] > > >>>>>>>> incd x1 > > >>>>>>>> whilelo p0.d, x1, x3 > > >>>>>>>> bne .L2 > > >>>>>>>> ret > > >>>>>>>> > > >>>>>>>> Not perfect still, but it's preferable to the original code. > > >>>>>>>> The new param is enabled by default on aarch64 but disabled for > > >>>>>>>> other targets, leaving their behaviour unchanged > > >>>>>>>> (until other target people experiment with it and set it, if > > >>>>>>>> appropriate). > > >>>>>>>> > > >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. > > >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no > > >>>>>>>> differences in performance. > > >>>>>>>> > > >>>>>>>> Ok for trunk? > > >>>>>>> Hum. Why introduce a new --param and not simply key on > > >>>>>>> flag_peel_loops instead? That is > > >>>>>>> enabled by default at -O3 and with FDO but you of course can control > > >>>>>>> that in your targets > > >>>>>>> post-option-processing hook. > > >>>>>> You mean like this? > > >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of > > >>>>>> making this change for all targets :) > > >>>>>> But I suppose it's a reasonable change. > > >>>>> No, that change is backward. What I said is that peeling is already > > >>>>> conditional on > > >>>>> flag_peel_loops and that is enabled by -O3. So you want to disable > > >>>>> flag_peel_loops for > > >>>>> SVE instead in the target. > > >>>> Sorry, I got confused by the similarly named functions. > > >>>> I'm talking about try_unroll_loop_completely when run as part of > > >>>> canonicalize_induction_variables i.e. the "ivcanon" pass > > >>>> (sorry about blaming cunroll here). This doesn't get called through > > >>>> the try_unroll_loops_completely path. > > >>> Well, peeling gets disabled. From your patch I see you want to > > >>> disable "unrolling" when > > >>> the number of loop iteration is not constant. That is called peeling > > >>> where we need to > > >>> emit the loop exit test N times. > > >>> > > >>> Did you check your testcases with -fno-peel-loops? > > >> -fno-peel-loops doesn't help in the testcases. The code that does this > > >> peeling (try_unroll_loop_completely) > > >> can be called through two paths, only one of which is gated on > > >> flag_peel_loops. > > > I don't see the obvious here so I have to either sit down with a > > > non-SVE specific testcase > > > showing this, or I am misunderstanding the actual transform that you > > > want to avoid. > > > allow_peel is false when called from canonicalize_induction_variables. > > > There's the slight > > > chance that UL_NO_GROWTH lets through cases - is your case one of > > > that? That is, > > > does the following help? > > > > > > Index: gcc/tree-ssa-loop-ivcanon.c > > > =================================================================== > > > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) > > > +++ gcc/tree-ssa-loop-ivcanon.c (working copy) > > > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop > > > exit = NULL; > > > > > > /* See if we can improve our estimate by using recorded loop bounds. > > > */ > > > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) > > > + if ((allow_peel || maxiter == 0) > > > && maxiter >= 0 > > > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < > > > n_unroll)) > > > { > > > > > > IIRC I allowed that case when adding allow_peel simply because it avoided > > > some > > > testsuite regressions. This means you eventually want to work on the > > > size estimate > > > of SVE style loops? > > > > This doesn't help. > > > > Sorry if we're talking over each other here, I'm not very familiar with > > this area :( > > For this loop: > > for (int i = 0; i < 5; i++) > > x[i] = x[i] * 2; > > > > For normal vectorisation (e.g. AArch64 NEON) we know the exact number of > > executions of the loop latch. > > This gets fully unrolled as: > > ldp q2, q1, [x0] > > ldr d0, [x0, 32] > > fadd v2.2d, v2.2d, v2.2d > > fadd v1.2d, v1.2d, v1.2d > > fadd d0, d0, d0 > > stp q2, q1, [x0] > > str d0, [x0, 32] > > > > For vector length-agnostic SVE vectorisation we don't as we don't know the > > number of elements we process > > with each loop iteration. So the NEON unrolling becomes SVE peeling I guess. > > Note that the number of iterations in SVE is still "constant", just not > > known at compile-time. > > In this case peeling doesn't eliminate any branches and only serves to > > bloat code size. > > So I sat down with a cross and indeed for the late unrolling pass we > simply pass in allow_peel == true > given that try_unroll_loop_completely also performs peeling (and not > just try_peel_loops which guards > itself with flag_peel_loops). That means instead of the above the > below fixes this (with -fno-peel-loops): > > Index: gcc/tree-ssa-loop-ivcanon.c > =================================================================== > --- gcc/tree-ssa-loop-ivcanon.c (revision 266072) > +++ gcc/tree-ssa-loop-ivcanon.c (working copy) > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop > exit = NULL; > > /* See if we can improve our estimate by using recorded loop bounds. */ > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) > + if (((allow_peel && flag_peel_loops) || maxiter == 0 || ul == UL_NO_GROWTH) > && maxiter >= 0 > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) > { > > there might be quite some testsuite fallout since flag_peel_loops is > only enabled at -O3+, > but one has to double-check. As said, a per-loop target control > whether loop peeling is > desirable would be an improvement I guess (apart from generally > disabling peeling for aarch64). > I suppose you can benchmark that together with the above fix.
Oh - and I completely forgot about loop->unroll which the vectorizer could set (for SVE loops) to 1 which disables any unrolling. Richard. > Richard. > > > Kyrill > > > > > Richard. > > > > > >> Thanks, > > >> Kyrill > > >> > > >> > > >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops > > >>>> or -fno-unroll-loops. > > >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when > > >>>> !flag_peel_loops is viable? > > >>>> > > >>>> Thanks, > > >>>> Kyrill > > >>>> > > >>>>>>> It might also make sense to have more fine-grained control for this > > >>>>>>> and allow a target > > >>>>>>> to say whether it wants to peel a specific loop or not when the > > >>>>>>> middle-end thinks that > > >>>>>>> would be profitable. > > >>>>>> Can be worth looking at as a follow-up. Do you envisage the target > > >>>>>> analysing > > >>>>>> the gimple statements of the loop to figure out its cost? > > >>>>> Kind-of. Sth like > > >>>>> > > >>>>> bool targetm.peel_loop (struct loop *); > > >>>>> > > >>>>> I have no idea whether you can easily detect a SVE vectorized loop > > >>>>> though. > > >>>>> Maybe there's always a special IV or so (the mask?) > > >>>>> > > >>>>> Richard. > > >>>>> > > >>>>>> Thanks, > > >>>>>> Kyrill > > >>>>>> > > >>>>>> > > >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > >>>>>> > > >>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): > > >>>>>> Do not unroll > > >>>>>> loop when number of iterations is not known and > > >>>>>> flag_peel_loops is in > > >>>>>> effect. > > >>>>>> > > >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > >>>>>> > > >>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test. > > >>>>>> > >