On 09/11/18 12:18, Richard Biener wrote:
On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
Hi all,
In this testcase the codegen for VLA SVE is worse than it could be due to
unrolling:
fully_peel_me:
mov x1, 5
ptrue p1.d, all
whilelo p0.d, xzr, x1
ld1d z0.d, p0/z, [x0]
fadd z0.d, z0.d, z0.d
st1d z0.d, p0, [x0]
cntd x2
addvl x3, x0, #1
whilelo p0.d, x2, x1
beq .L1
ld1d z0.d, p0/z, [x0, #1, mul vl]
fadd z0.d, z0.d, z0.d
st1d z0.d, p0, [x3]
cntw x2
incb x0, all, mul #2
whilelo p0.d, x2, x1
beq .L1
ld1d z0.d, p0/z, [x0]
fadd z0.d, z0.d, z0.d
st1d z0.d, p0, [x0]
.L1:
ret
In this case, due to the vector-length-agnostic nature of SVE the compiler
doesn't know the loop iteration count.
For such loops we don't want to unroll if we don't end up eliminating branches
as this just bloats code size
and hurts icache performance.
This patch introduces a new unroll-known-loop-iterations-only param that
disables cunroll when the loop iteration
count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA
code, but it does help some
Advanced SIMD cases as well where loops with an unknown iteration count are not
unrolled when it doesn't eliminate
the branches.
So for the above testcase we generate now:
fully_peel_me:
mov x2, 5
mov x3, x2
mov x1, 0
whilelo p0.d, xzr, x2
ptrue p1.d, all
.L2:
ld1d z0.d, p0/z, [x0, x1, lsl 3]
fadd z0.d, z0.d, z0.d
st1d z0.d, p0, [x0, x1, lsl 3]
incd x1
whilelo p0.d, x1, x3
bne .L2
ret
Not perfect still, but it's preferable to the original code.
The new param is enabled by default on aarch64 but disabled for other targets,
leaving their behaviour unchanged
(until other target people experiment with it and set it, if appropriate).
Bootstrapped and tested on aarch64-none-linux-gnu.
Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in
performance.
Ok for trunk?
Hum. Why introduce a new --param and not simply key on
flag_peel_loops instead? That is
enabled by default at -O3 and with FDO but you of course can control
that in your targets
post-option-processing hook.
You mean like this?
It's certainly a simpler patch, but I was just a bit hesitant of making this
change for all targets :)
But I suppose it's a reasonable change.
It might also make sense to have more fine-grained control for this
and allow a target
to say whether it wants to peel a specific loop or not when the
middle-end thinks that
would be profitable.
Can be worth looking at as a follow-up. Do you envisage the target analysing
the gimple statements of the loop to figure out its cost?
Thanks,
Kyrill
2018-11-09 Kyrylo Tkachov <kyrylo.tkac...@arm.com>
* tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
loop when number of iterations is not known and flag_peel_loops is in
effect.
2018-11-09 Kyrylo Tkachov <kyrylo.tkac...@arm.com>
* gcc.target/aarch64/sve/unroll-1.c: New test.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..7f53d20cbf8e18a4389b86c037f56f024bac22a5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+/* Check that simple loop is not fully unrolled. */
+
+void
+fully_peel_me (double *x)
+{
+ for (int i = 0; i < 5; i++)
+ x[i] = x[i] * 2;
+}
+
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[.+]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[.+\]\n} 1 } } */
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index c2953059fb9218d4bc4cf12fe9277a552b4a04bd..daeddb384254775d6482ed5580e8262e4b26a87f 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -883,6 +883,16 @@ try_unroll_loop_completely (struct loop *loop,
loop->num);
return false;
}
+ else if (TREE_CODE (niter) == SCEV_NOT_KNOWN
+ && flag_peel_loops)
+ {
+ if (dump_enabled_p ())
+ dump_printf (MSG_NOTE, "Not unrolling loop %d: "
+ "exact number of iterations not known "
+ "(-fpeel-loops).\n",
+ loop->num);
+ return false;
+ }
}
initialize_original_copy_tables ();