On Tue, Nov 13, 2018 at 3:33 PM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Tue, Nov 13, 2018 at 10:48 AM Kyrill Tkachov
> <kyrylo.tkac...@foss.arm.com> wrote:
> >
> >
> > On 13/11/18 09:28, Richard Biener wrote:
> > > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov
> > > <kyrylo.tkac...@foss.arm.com> wrote:
> > >> Hi Richard,
> > >>
> > >> On 13/11/18 08:24, Richard Biener wrote:
> > >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
> > >>> <kyrylo.tkac...@foss.arm.com> wrote:
> > >>>> On 12/11/18 14:10, Richard Biener wrote:
> > >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
> > >>>>> <kyrylo.tkac...@foss.arm.com> wrote:
> > >>>>>> On 09/11/18 12:18, Richard Biener wrote:
> > >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> > >>>>>>> <kyrylo.tkac...@foss.arm.com> wrote:
> > >>>>>>>> Hi all,
> > >>>>>>>>
> > >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be 
> > >>>>>>>> due to unrolling:
> > >>>>>>>>
> > >>>>>>>> fully_peel_me:
> > >>>>>>>>             mov     x1, 5
> > >>>>>>>>             ptrue   p1.d, all
> > >>>>>>>>             whilelo p0.d, xzr, x1
> > >>>>>>>>             ld1d    z0.d, p0/z, [x0]
> > >>>>>>>>             fadd    z0.d, z0.d, z0.d
> > >>>>>>>>             st1d    z0.d, p0, [x0]
> > >>>>>>>>             cntd    x2
> > >>>>>>>>             addvl   x3, x0, #1
> > >>>>>>>>             whilelo p0.d, x2, x1
> > >>>>>>>>             beq     .L1
> > >>>>>>>>             ld1d    z0.d, p0/z, [x0, #1, mul vl]
> > >>>>>>>>             fadd    z0.d, z0.d, z0.d
> > >>>>>>>>             st1d    z0.d, p0, [x3]
> > >>>>>>>>             cntw    x2
> > >>>>>>>>             incb    x0, all, mul #2
> > >>>>>>>>             whilelo p0.d, x2, x1
> > >>>>>>>>             beq     .L1
> > >>>>>>>>             ld1d    z0.d, p0/z, [x0]
> > >>>>>>>>             fadd    z0.d, z0.d, z0.d
> > >>>>>>>>             st1d    z0.d, p0, [x0]
> > >>>>>>>> .L1:
> > >>>>>>>>             ret
> > >>>>>>>>
> > >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the 
> > >>>>>>>> compiler doesn't know the loop iteration count.
> > >>>>>>>> For such loops we don't want to unroll if we don't end up 
> > >>>>>>>> eliminating branches as this just bloats code size
> > >>>>>>>> and hurts icache performance.
> > >>>>>>>>
> > >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only 
> > >>>>>>>> param that disables cunroll when the loop iteration
> > >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more 
> > >>>>>>>> often for SVE VLA code, but it does help some
> > >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration 
> > >>>>>>>> count are not unrolled when it doesn't eliminate
> > >>>>>>>> the branches.
> > >>>>>>>>
> > >>>>>>>> So for the above testcase we generate now:
> > >>>>>>>> fully_peel_me:
> > >>>>>>>>             mov     x2, 5
> > >>>>>>>>             mov     x3, x2
> > >>>>>>>>             mov     x1, 0
> > >>>>>>>>             whilelo p0.d, xzr, x2
> > >>>>>>>>             ptrue   p1.d, all
> > >>>>>>>> .L2:
> > >>>>>>>>             ld1d    z0.d, p0/z, [x0, x1, lsl 3]
> > >>>>>>>>             fadd    z0.d, z0.d, z0.d
> > >>>>>>>>             st1d    z0.d, p0, [x0, x1, lsl 3]
> > >>>>>>>>             incd    x1
> > >>>>>>>>             whilelo p0.d, x1, x3
> > >>>>>>>>             bne     .L2
> > >>>>>>>>             ret
> > >>>>>>>>
> > >>>>>>>> Not perfect still, but it's preferable to the original code.
> > >>>>>>>> The new param is enabled by default on aarch64 but disabled for 
> > >>>>>>>> other targets, leaving their behaviour unchanged
> > >>>>>>>> (until other target people experiment with it and set it, if 
> > >>>>>>>> appropriate).
> > >>>>>>>>
> > >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
> > >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no 
> > >>>>>>>> differences in performance.
> > >>>>>>>>
> > >>>>>>>> Ok for trunk?
> > >>>>>>> Hum.  Why introduce a new --param and not simply key on
> > >>>>>>> flag_peel_loops instead?  That is
> > >>>>>>> enabled by default at -O3 and with FDO but you of course can control
> > >>>>>>> that in your targets
> > >>>>>>> post-option-processing hook.
> > >>>>>> You mean like this?
> > >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of 
> > >>>>>> making this change for all targets :)
> > >>>>>> But I suppose it's a reasonable change.
> > >>>>> No, that change is backward.  What I said is that peeling is already
> > >>>>> conditional on
> > >>>>> flag_peel_loops and that is enabled by -O3.  So you want to disable
> > >>>>> flag_peel_loops for
> > >>>>> SVE instead in the target.
> > >>>> Sorry, I got confused by the similarly named functions.
> > >>>> I'm talking about try_unroll_loop_completely when run as part of 
> > >>>> canonicalize_induction_variables i.e. the "ivcanon" pass
> > >>>> (sorry about blaming cunroll here). This doesn't get called through 
> > >>>> the try_unroll_loops_completely path.
> > >>> Well, peeling gets disabled.  From your patch I see you want to
> > >>> disable "unrolling" when
> > >>> the number of loop iteration is not constant.  That is called peeling
> > >>> where we need to
> > >>> emit the loop exit test N times.
> > >>>
> > >>> Did you check your testcases with -fno-peel-loops?
> > >> -fno-peel-loops doesn't help in the testcases. The code that does this 
> > >> peeling (try_unroll_loop_completely)
> > >> can be called through two paths, only one of which is gated on 
> > >> flag_peel_loops.
> > > I don't see the obvious here so I have to either sit down with a
> > > non-SVE specific testcase
> > > showing this, or I am misunderstanding the actual transform that you
> > > want to avoid.
> > > allow_peel is false when called from canonicalize_induction_variables.
> > > There's the slight
> > > chance that UL_NO_GROWTH lets through cases - is your case one of
> > > that?  That is,
> > > does the following help?
> > >
> > > Index: gcc/tree-ssa-loop-ivcanon.c
> > > ===================================================================
> > > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056)
> > > +++ gcc/tree-ssa-loop-ivcanon.c (working copy)
> > > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
> > >       exit = NULL;
> > >
> > >     /* See if we can improve our estimate by using recorded loop bounds.  
> > > */
> > > -  if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
> > > +  if ((allow_peel || maxiter == 0)
> > >         && maxiter >= 0
> > >         && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < 
> > > n_unroll))
> > >       {
> > >
> > > IIRC I allowed that case when adding allow_peel simply because it avoided 
> > > some
> > > testsuite regressions.  This means you eventually want to work on the
> > > size estimate
> > > of SVE style loops?
> >
> > This doesn't help.
> >
> > Sorry if we're talking over each other here, I'm not very familiar with 
> > this area :(
> > For this loop:
> >    for (int i = 0; i < 5; i++)
> >      x[i] = x[i] * 2;
> >
> > For normal vectorisation (e.g. AArch64 NEON) we know the exact number of 
> > executions of the loop latch.
> > This gets fully unrolled as:
> >          ldp     q2, q1, [x0]
> >          ldr     d0, [x0, 32]
> >          fadd    v2.2d, v2.2d, v2.2d
> >          fadd    v1.2d, v1.2d, v1.2d
> >          fadd    d0, d0, d0
> >          stp     q2, q1, [x0]
> >          str     d0, [x0, 32]
> >
> > For vector length-agnostic SVE vectorisation we don't as we don't know the 
> > number of elements we process
> > with each loop iteration. So the NEON unrolling becomes SVE peeling I guess.
> > Note that the number of iterations in SVE is still "constant", just not 
> > known at compile-time.
> > In this case peeling doesn't eliminate any branches and only serves to 
> > bloat code size.
>
> So I sat down with a cross and indeed for the late unrolling pass we
> simply pass in allow_peel == true
> given that try_unroll_loop_completely also performs peeling (and not
> just try_peel_loops which guards
> itself with flag_peel_loops).  That means instead of the above the
> below fixes this (with -fno-peel-loops):
>
> Index: gcc/tree-ssa-loop-ivcanon.c
> ===================================================================
> --- gcc/tree-ssa-loop-ivcanon.c (revision 266072)
> +++ gcc/tree-ssa-loop-ivcanon.c (working copy)
> @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
>      exit = NULL;
>
>    /* See if we can improve our estimate by using recorded loop bounds.  */
> -  if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
> +  if (((allow_peel && flag_peel_loops) || maxiter == 0 || ul == UL_NO_GROWTH)
>        && maxiter >= 0
>        && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
>      {
>
> there might be quite some testsuite fallout since flag_peel_loops is
> only enabled at -O3+,
> but one has to double-check.  As said, a per-loop target control
> whether loop peeling is
> desirable would be an improvement I guess (apart from generally
> disabling peeling for aarch64).
> I suppose you can benchmark that together with the above fix.

Oh - and I completely forgot about loop->unroll which the vectorizer
could set (for SVE loops) to 1 which disables any unrolling.

Richard.

> Richard.
>
> > Kyrill
> >
> > > Richard.
> > >
> > >> Thanks,
> > >> Kyrill
> > >>
> > >>
> > >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops 
> > >>>> or -fno-unroll-loops.
> > >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when 
> > >>>> !flag_peel_loops is viable?
> > >>>>
> > >>>> Thanks,
> > >>>> Kyrill
> > >>>>
> > >>>>>>> It might also make sense to have more fine-grained control for this
> > >>>>>>> and allow a target
> > >>>>>>> to say whether it wants to peel a specific loop or not when the
> > >>>>>>> middle-end thinks that
> > >>>>>>> would be profitable.
> > >>>>>> Can be worth looking at as a follow-up. Do you envisage the target 
> > >>>>>> analysing
> > >>>>>> the gimple statements of the loop to figure out its cost?
> > >>>>> Kind-of.  Sth like
> > >>>>>
> > >>>>>      bool targetm.peel_loop (struct loop *);
> > >>>>>
> > >>>>> I have no idea whether you can easily detect a SVE vectorized loop 
> > >>>>> though.
> > >>>>> Maybe there's always a special IV or so (the mask?)
> > >>>>>
> > >>>>> Richard.
> > >>>>>
> > >>>>>> Thanks,
> > >>>>>> Kyrill
> > >>>>>>
> > >>>>>>
> > >>>>>> 2018-11-09  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> > >>>>>>
> > >>>>>>            * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): 
> > >>>>>> Do not unroll
> > >>>>>>            loop when number of iterations is not known and 
> > >>>>>> flag_peel_loops is in
> > >>>>>>            effect.
> > >>>>>>
> > >>>>>> 2018-11-09  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> > >>>>>>
> > >>>>>>            * gcc.target/aarch64/sve/unroll-1.c: New test.
> > >>>>>>
> >

Reply via email to