On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: > > Hi Richard, > > On 13/11/18 08:24, Richard Biener wrote: > > On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov > > <kyrylo.tkac...@foss.arm.com> wrote: > >> > >> On 12/11/18 14:10, Richard Biener wrote: > >>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov > >>> <kyrylo.tkac...@foss.arm.com> wrote: > >>>> On 09/11/18 12:18, Richard Biener wrote: > >>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > >>>>> <kyrylo.tkac...@foss.arm.com> wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> In this testcase the codegen for VLA SVE is worse than it could be due > >>>>>> to unrolling: > >>>>>> > >>>>>> fully_peel_me: > >>>>>> mov x1, 5 > >>>>>> ptrue p1.d, all > >>>>>> whilelo p0.d, xzr, x1 > >>>>>> ld1d z0.d, p0/z, [x0] > >>>>>> fadd z0.d, z0.d, z0.d > >>>>>> st1d z0.d, p0, [x0] > >>>>>> cntd x2 > >>>>>> addvl x3, x0, #1 > >>>>>> whilelo p0.d, x2, x1 > >>>>>> beq .L1 > >>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] > >>>>>> fadd z0.d, z0.d, z0.d > >>>>>> st1d z0.d, p0, [x3] > >>>>>> cntw x2 > >>>>>> incb x0, all, mul #2 > >>>>>> whilelo p0.d, x2, x1 > >>>>>> beq .L1 > >>>>>> ld1d z0.d, p0/z, [x0] > >>>>>> fadd z0.d, z0.d, z0.d > >>>>>> st1d z0.d, p0, [x0] > >>>>>> .L1: > >>>>>> ret > >>>>>> > >>>>>> In this case, due to the vector-length-agnostic nature of SVE the > >>>>>> compiler doesn't know the loop iteration count. > >>>>>> For such loops we don't want to unroll if we don't end up eliminating > >>>>>> branches as this just bloats code size > >>>>>> and hurts icache performance. > >>>>>> > >>>>>> This patch introduces a new unroll-known-loop-iterations-only param > >>>>>> that disables cunroll when the loop iteration > >>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often > >>>>>> for SVE VLA code, but it does help some > >>>>>> Advanced SIMD cases as well where loops with an unknown iteration > >>>>>> count are not unrolled when it doesn't eliminate > >>>>>> the branches. > >>>>>> > >>>>>> So for the above testcase we generate now: > >>>>>> fully_peel_me: > >>>>>> mov x2, 5 > >>>>>> mov x3, x2 > >>>>>> mov x1, 0 > >>>>>> whilelo p0.d, xzr, x2 > >>>>>> ptrue p1.d, all > >>>>>> .L2: > >>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] > >>>>>> fadd z0.d, z0.d, z0.d > >>>>>> st1d z0.d, p0, [x0, x1, lsl 3] > >>>>>> incd x1 > >>>>>> whilelo p0.d, x1, x3 > >>>>>> bne .L2 > >>>>>> ret > >>>>>> > >>>>>> Not perfect still, but it's preferable to the original code. > >>>>>> The new param is enabled by default on aarch64 but disabled for other > >>>>>> targets, leaving their behaviour unchanged > >>>>>> (until other target people experiment with it and set it, if > >>>>>> appropriate). > >>>>>> > >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. > >>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences > >>>>>> in performance. > >>>>>> > >>>>>> Ok for trunk? > >>>>> Hum. Why introduce a new --param and not simply key on > >>>>> flag_peel_loops instead? That is > >>>>> enabled by default at -O3 and with FDO but you of course can control > >>>>> that in your targets > >>>>> post-option-processing hook. > >>>> You mean like this? > >>>> It's certainly a simpler patch, but I was just a bit hesitant of making > >>>> this change for all targets :) > >>>> But I suppose it's a reasonable change. > >>> No, that change is backward. What I said is that peeling is already > >>> conditional on > >>> flag_peel_loops and that is enabled by -O3. So you want to disable > >>> flag_peel_loops for > >>> SVE instead in the target. > >> Sorry, I got confused by the similarly named functions. > >> I'm talking about try_unroll_loop_completely when run as part of > >> canonicalize_induction_variables i.e. the "ivcanon" pass > >> (sorry about blaming cunroll here). This doesn't get called through the > >> try_unroll_loops_completely path. > > Well, peeling gets disabled. From your patch I see you want to > > disable "unrolling" when > > the number of loop iteration is not constant. That is called peeling > > where we need to > > emit the loop exit test N times. > > > > Did you check your testcases with -fno-peel-loops? > > -fno-peel-loops doesn't help in the testcases. The code that does this > peeling (try_unroll_loop_completely) > can be called through two paths, only one of which is gated on > flag_peel_loops.
I don't see the obvious here so I have to either sit down with a non-SVE specific testcase showing this, or I am misunderstanding the actual transform that you want to avoid. allow_peel is false when called from canonicalize_induction_variables. There's the slight chance that UL_NO_GROWTH lets through cases - is your case one of that? That is, does the following help? Index: gcc/tree-ssa-loop-ivcanon.c =================================================================== --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) +++ gcc/tree-ssa-loop-ivcanon.c (working copy) @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop exit = NULL; /* See if we can improve our estimate by using recorded loop bounds. */ - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) + if ((allow_peel || maxiter == 0) && maxiter >= 0 && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) { IIRC I allowed that case when adding allow_peel simply because it avoided some testsuite regressions. This means you eventually want to work on the size estimate of SVE style loops? Richard. > Thanks, > Kyrill > > > >> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or > >> -fno-unroll-loops. > >> Maybe disabling peeling inside try_unroll_loop_completely itself when > >> !flag_peel_loops is viable? > >> > >> Thanks, > >> Kyrill > >> > >>>>> It might also make sense to have more fine-grained control for this > >>>>> and allow a target > >>>>> to say whether it wants to peel a specific loop or not when the > >>>>> middle-end thinks that > >>>>> would be profitable. > >>>> Can be worth looking at as a follow-up. Do you envisage the target > >>>> analysing > >>>> the gimple statements of the loop to figure out its cost? > >>> Kind-of. Sth like > >>> > >>> bool targetm.peel_loop (struct loop *); > >>> > >>> I have no idea whether you can easily detect a SVE vectorized loop though. > >>> Maybe there's always a special IV or so (the mask?) > >>> > >>> Richard. > >>> > >>>> Thanks, > >>>> Kyrill > >>>> > >>>> > >>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > >>>> > >>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not > >>>> unroll > >>>> loop when number of iterations is not known and > >>>> flag_peel_loops is in > >>>> effect. > >>>> > >>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > >>>> > >>>> * gcc.target/aarch64/sve/unroll-1.c: New test. > >>>> >