GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:
void
foo (char *x)
{
for (int i = 0; i < 6; i += 2)
{
x[i] += 1;
x[i + 1] += 2;
}
}
from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):
foo:
ptrue p7.b, vl6
mov w1, 513
ld1b z31.b, p7/z, [x0]
mov z30.h, w1
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:
void
foo (char *x)
{
x[0] += 1;
x[1] += 2;
x[2] += 1;
x[3] += 2;
x[4] += 1;
x[5] += 2;
}
These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):
foo:
ptrue p7.b, vl6
ptrue p6.b, all
ld1b z31.b, p7/z, [x0]
adrp x1, .LC0
add x1, x1, :lo12:.LC0
ld1rqb z30.b, p6/z, [x1]
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
Predication is only used for groups whose size is not neatly divisible
into vectors of lengths that can be supported directly by the target.
Bootstrapped and tested on aarch64-linux-gnu.
A list of test regressions that need investigation due to this change
is as follows. Some are just tests that need updating; others are ICE.
gfortran.dg/vect/pr99746.f90
c-c++-common/hwasan/handles-poly_int-marked-vars.c
gcc.dg/pr95713.c
gcc.dg/vect/bb-slp-17.c
gcc.dg/vect/bb-slp-4.c
gcc.dg/vect/bb-slp-pr95839-v8.c
gcc.dg/vect/no-scevccp-outer-10.c
gcc.dg/vect/pr46052.c
gcc.dg/vect/pr68305.c
gcc.dg/vect/slp-2.c
gcc.dg/vect/tsvc/vect-tsvc-s351.c
gcc.dg/vect/tsvc/vect-tsvc-s353.c
gcc.dg/vect/vect-over-widen-10.c
gcc.dg/vect/vect-over-widen-13.c
gcc.dg/vect/vect-over-widen-14.c
gcc.dg/vect/vect-over-widen-19.c
gcc.dg/vect/vect-over-widen-5.c
gcc.dg/vect/vect-over-widen-6.c
gcc.dg/vect/vect-over-widen-7.c
gcc.dg/vect/vect-over-widen-8.c
gcc.dg/vect/vect-over-widen-9.c
gcc.dg/vect/vect-shift-5.c
gcc.dg/vect/vect-strided-u8-i8-gap2-big-array.c
gcc.misc-tests/gcov-25.c
gcc.misc-tests/gcov-26.c
gcc.misc-tests/gcov-27.c
gcc.misc-tests/gcov-28.c
gcc.target/aarch64/popcnt-sve.c
gcc.target/aarch64/simd/faminmax-codegen-no-flag.c
gcc.target/aarch64/simd/faminmax-codegen.c
gcc.target/aarch64/sve/reduc_14.c
gcc.target/aarch64/sve/slp_6.c
gcc.target/aarch64/sve/slp_7_costly.c
gcc.target/aarch64/sve/truncated_concatenation_1.c
gcc.target/aarch64/vect_mixed_sizes_14.c
gcc.target/aarch64/vect_unary_1.c
c-c++-common/hwasan/handles-poly_int-marked-vars.c
g++.dg/opt/pr95528.C
g++.dg/vect/simd-complex-num-null-node.cc
Christopher Bazley (9):
Track the minimum and maximum number of lanes for BB SLP
Preparation to support predicated vector tails for BB SLP
Implement recording/getting of mask/length for BB SLP
Conditionally dump info on creation and destruction of SLP nodes
Conditionally dump info on pushing vectorized defs
Fix vexed ownership of stmts passed to vect_build_slp_instance
Update constant creation for BB SLP with predicated tails
Extend BB SLP vectorization to use predicated tails
AArch64/SVE: Tests for use of predicated vector tails for BB SLP
gcc/gimple-fold.cc | 2 +-
.../gcc.target/aarch64/sve/slp_pred_1.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_1_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_2.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_4.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_5.c | 36 ++
.../gcc.target/aarch64/sve/slp_pred_6.c | 39 ++
.../gcc.target/aarch64/sve/slp_pred_6_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_7.c | 38 ++
.../gcc.target/aarch64/sve/slp_pred_harness.h | 28 +
gcc/tree-vect-loop.cc | 10 +
gcc/tree-vect-slp.cc | 568 ++++++++++++-----
gcc/tree-vect-stmts.cc | 600 +++++++++++-------
gcc/tree-vectorizer.h | 154 ++++-
16 files changed, 1220 insertions(+), 405 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h
--
2.43.0