On 13/11/18 09:28, Richard Biener wrote:
On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote:Hi Richard, On 13/11/18 08:24, Richard Biener wrote:On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote:On 12/11/18 14:10, Richard Biener wrote:On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote:On 09/11/18 12:18, Richard Biener wrote:On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote:Hi all, In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: fully_peel_me: mov x1, 5 ptrue p1.d, all whilelo p0.d, xzr, x1 ld1d z0.d, p0/z, [x0] fadd z0.d, z0.d, z0.d st1d z0.d, p0, [x0] cntd x2 addvl x3, x0, #1 whilelo p0.d, x2, x1 beq .L1 ld1d z0.d, p0/z, [x0, #1, mul vl] fadd z0.d, z0.d, z0.d st1d z0.d, p0, [x3] cntw x2 incb x0, all, mul #2 whilelo p0.d, x2, x1 beq .L1 ld1d z0.d, p0/z, [x0] fadd z0.d, z0.d, z0.d st1d z0.d, p0, [x0] .L1: ret In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size and hurts icache performance. This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate the branches. So for the above testcase we generate now: fully_peel_me: mov x2, 5 mov x3, x2 mov x1, 0 whilelo p0.d, xzr, x2 ptrue p1.d, all .L2: ld1d z0.d, p0/z, [x0, x1, lsl 3] fadd z0.d, z0.d, z0.d st1d z0.d, p0, [x0, x1, lsl 3] incd x1 whilelo p0.d, x1, x3 bne .L2 ret Not perfect still, but it's preferable to the original code. The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged (until other target people experiment with it and set it, if appropriate). Bootstrapped and tested on aarch64-none-linux-gnu. Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. Ok for trunk?Hum. Why introduce a new --param and not simply key on flag_peel_loops instead? That is enabled by default at -O3 and with FDO but you of course can control that in your targets post-option-processing hook.You mean like this? It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) But I suppose it's a reasonable change.No, that change is backward. What I said is that peeling is already conditional on flag_peel_loops and that is enabled by -O3. So you want to disable flag_peel_loops for SVE instead in the target.Sorry, I got confused by the similarly named functions. I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.Well, peeling gets disabled. From your patch I see you want to disable "unrolling" when the number of loop iteration is not constant. That is called peeling where we need to emit the loop exit test N times. Did you check your testcases with -fno-peel-loops?-fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely) can be called through two paths, only one of which is gated on flag_peel_loops.I don't see the obvious here so I have to either sit down with a non-SVE specific testcase showing this, or I am misunderstanding the actual transform that you want to avoid. allow_peel is false when called from canonicalize_induction_variables. There's the slight chance that UL_NO_GROWTH lets through cases - is your case one of that? That is, does the following help? Index: gcc/tree-ssa-loop-ivcanon.c =================================================================== --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) +++ gcc/tree-ssa-loop-ivcanon.c (working copy) @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop exit = NULL; /* See if we can improve our estimate by using recorded loop bounds. */ - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) + if ((allow_peel || maxiter == 0) && maxiter >= 0 && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) { IIRC I allowed that case when adding allow_peel simply because it avoided some testsuite regressions. This means you eventually want to work on the size estimate of SVE style loops?
This doesn't help. Sorry if we're talking over each other here, I'm not very familiar with this area :( For this loop: for (int i = 0; i < 5; i++) x[i] = x[i] * 2; For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch. This gets fully unrolled as: ldp q2, q1, [x0] ldr d0, [x0, 32] fadd v2.2d, v2.2d, v2.2d fadd v1.2d, v1.2d, v1.2d fadd d0, d0, d0 stp q2, q1, [x0] str d0, [x0, 32] For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process with each loop iteration. So the NEON unrolling becomes SVE peeling I guess. Note that the number of iterations in SVE is still "constant", just not known at compile-time. In this case peeling doesn't eliminate any branches and only serves to bloat code size. Kyrill
Richard.Thanks, Kyrilltry_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? Thanks, KyrillIt might also make sense to have more fine-grained control for this and allow a target to say whether it wants to peel a specific loop or not when the middle-end thinks that would be profitable.Can be worth looking at as a follow-up. Do you envisage the target analysing the gimple statements of the loop to figure out its cost?Kind-of. Sth like bool targetm.peel_loop (struct loop *); I have no idea whether you can easily detect a SVE vectorized loop though. Maybe there's always a special IV or so (the mask?) Richard.Thanks, Kyrill 2018-11-09 Kyrylo Tkachov <kyrylo.tkac...@arm.com> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll loop when number of iterations is not known and flag_peel_loops is in effect. 2018-11-09 Kyrylo Tkachov <kyrylo.tkac...@arm.com> * gcc.target/aarch64/sve/unroll-1.c: New test.