Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64

Kyrill Tkachov Tue, 13 Nov 2018 01:16:36 -0800

Hi Richard,

On 13/11/18 08:24, Richard Biener wrote:

On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:


On 12/11/18 14:10, Richard Biener wrote:

On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:

On 09/11/18 12:18, Richard Biener wrote:

On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:

Hi all,

In this testcase the codegen for VLA SVE is worse than it could be due to 
unrolling:

fully_peel_me:
           mov     x1, 5
           ptrue   p1.d, all
           whilelo p0.d, xzr, x1
           ld1d    z0.d, p0/z, [x0]
           fadd    z0.d, z0.d, z0.d
           st1d    z0.d, p0, [x0]
           cntd    x2
           addvl   x3, x0, #1
           whilelo p0.d, x2, x1
           beq     .L1
           ld1d    z0.d, p0/z, [x0, #1, mul vl]
           fadd    z0.d, z0.d, z0.d
           st1d    z0.d, p0, [x3]
           cntw    x2
           incb    x0, all, mul #2
           whilelo p0.d, x2, x1
           beq     .L1
           ld1d    z0.d, p0/z, [x0]
           fadd    z0.d, z0.d, z0.d
           st1d    z0.d, p0, [x0]
.L1:
           ret

In this case, due to the vector-length-agnostic nature of SVE the compiler 
doesn't know the loop iteration count.
For such loops we don't want to unroll if we don't end up eliminating branches 
as this just bloats code size
and hurts icache performance.

This patch introduces a new unroll-known-loop-iterations-only param that 
disables cunroll when the loop iteration
count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA 
code, but it does help some
Advanced SIMD cases as well where loops with an unknown iteration count are not 
unrolled when it doesn't eliminate
the branches.

So for the above testcase we generate now:
fully_peel_me:
           mov     x2, 5
           mov     x3, x2
           mov     x1, 0
           whilelo p0.d, xzr, x2
           ptrue   p1.d, all
.L2:
           ld1d    z0.d, p0/z, [x0, x1, lsl 3]
           fadd    z0.d, z0.d, z0.d
           st1d    z0.d, p0, [x0, x1, lsl 3]
           incd    x1
           whilelo p0.d, x1, x3
           bne     .L2
           ret

Not perfect still, but it's preferable to the original code.
The new param is enabled by default on aarch64 but disabled for other targets, 
leaving their behaviour unchanged
(until other target people experiment with it and set it, if appropriate).

Bootstrapped and tested on aarch64-none-linux-gnu.
Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in 
performance.

Ok for trunk?

Hum.  Why introduce a new --param and not simply key on
flag_peel_loops instead?  That is
enabled by default at -O3 and with FDO but you of course can control
that in your targets
post-option-processing hook.

You mean like this?
It's certainly a simpler patch, but I was just a bit hesitant of making this 
change for all targets :)
But I suppose it's a reasonable change.

No, that change is backward.  What I said is that peeling is already
conditional on
flag_peel_loops and that is enabled by -O3.  So you want to disable
flag_peel_loops for
SVE instead in the target.

Sorry, I got confused by the similarly named functions.
I'm talking about try_unroll_loop_completely when run as part of 
canonicalize_induction_variables i.e. the "ivcanon" pass
(sorry about blaming cunroll here). This doesn't get called through the 
try_unroll_loops_completely path.

Well, peeling gets disabled.  From your patch I see you want to
disable "unrolling" when
the number of loop iteration is not constant.  That is called peeling
where we need to
emit the loop exit test N times.

Did you check your testcases with -fno-peel-loops?


-fno-peel-loops doesn't help in the testcases. The code that does this peeling 
(try_unroll_loop_completely)
can be called through two paths, only one of which is gated on flag_peel_loops.

Thanks,
Kyrill

try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or 
-fno-unroll-loops.
Maybe disabling peeling inside try_unroll_loop_completely itself when 
!flag_peel_loops is viable?

Thanks,
Kyrill

It might also make sense to have more fine-grained control for this
and allow a target
to say whether it wants to peel a specific loop or not when the
middle-end thinks that
would be profitable.

Can be worth looking at as a follow-up. Do you envisage the target analysing
the gimple statements of the loop to figure out its cost?

Kind-of.  Sth like

    bool targetm.peel_loop (struct loop *);

I have no idea whether you can easily detect a SVE vectorized loop though.
Maybe there's always a special IV or so (the mask?)

Richard.

Thanks,
Kyrill


2018-11-09  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

          * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
          loop when number of iterations is not known and flag_peel_loops is in
          effect.

2018-11-09  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

          * gcc.target/aarch64/sve/unroll-1.c: New test.

Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64

Reply via email to