GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:

void
foo (char *x)
{
  for (int i = 0; i < 6; i += 2)
    {
      x[i] += 1;
      x[i + 1] += 2;
    }
}

from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):

foo:
        ptrue   p7.b, vl6
        mov     w1, 513
        ld1b    z31.b, p7/z, [x0]
        mov     z30.h, w1
        add     z30.b, z31.b, z30.b
        st1b    z30.b, p7, [x0]
        ret

However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:

void
foo (char *x)
{
  x[0] += 1;
  x[1] += 2;
  x[2] += 1;
  x[3] += 2;
  x[4] += 1;
  x[5] += 2;
}

These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):

foo:
        ptrue   p7.b, vl6
        ptrue   p6.b, all
        ld1b    z31.b, p7/z, [x0]
        adrp    x1, .LC0
        add     x1, x1, :lo12:.LC0
        ld1rqb  z30.b, p6/z, [x1]
        add     z30.b, z31.b, z30.b
        st1b    z30.b, p7, [x0]
        ret

Predication is only used for groups whose size is not neatly divisible
into vectors of lengths that can be supported directly by the target.

Bootstrapped and tested on aarch64-linux-gnu.

A list of test regressions that need investigation due to this change
is as follows. Some are just tests that need updating; others are ICE.

gfortran.dg/vect/pr99746.f90
c-c++-common/hwasan/handles-poly_int-marked-vars.c
gcc.dg/pr95713.c
gcc.dg/vect/bb-slp-17.c
gcc.dg/vect/bb-slp-4.c
gcc.dg/vect/bb-slp-pr95839-v8.c
gcc.dg/vect/no-scevccp-outer-10.c
gcc.dg/vect/pr46052.c
gcc.dg/vect/pr68305.c
gcc.dg/vect/slp-2.c
gcc.dg/vect/tsvc/vect-tsvc-s351.c
gcc.dg/vect/tsvc/vect-tsvc-s353.c
gcc.dg/vect/vect-over-widen-10.c
gcc.dg/vect/vect-over-widen-13.c
gcc.dg/vect/vect-over-widen-14.c
gcc.dg/vect/vect-over-widen-19.c
gcc.dg/vect/vect-over-widen-5.c
gcc.dg/vect/vect-over-widen-6.c
gcc.dg/vect/vect-over-widen-7.c
gcc.dg/vect/vect-over-widen-8.c
gcc.dg/vect/vect-over-widen-9.c
gcc.dg/vect/vect-shift-5.c
gcc.dg/vect/vect-strided-u8-i8-gap2-big-array.c
gcc.misc-tests/gcov-25.c
gcc.misc-tests/gcov-26.c
gcc.misc-tests/gcov-27.c
gcc.misc-tests/gcov-28.c
gcc.target/aarch64/popcnt-sve.c
gcc.target/aarch64/simd/faminmax-codegen-no-flag.c
gcc.target/aarch64/simd/faminmax-codegen.c
gcc.target/aarch64/sve/reduc_14.c
gcc.target/aarch64/sve/slp_6.c
gcc.target/aarch64/sve/slp_7_costly.c
gcc.target/aarch64/sve/truncated_concatenation_1.c
gcc.target/aarch64/vect_mixed_sizes_14.c
gcc.target/aarch64/vect_unary_1.c
c-c++-common/hwasan/handles-poly_int-marked-vars.c
g++.dg/opt/pr95528.C
g++.dg/vect/simd-complex-num-null-node.cc

Christopher Bazley (9):
  Track the minimum and maximum number of lanes for BB SLP
  Preparation to support predicated vector tails for BB SLP
  Implement recording/getting of mask/length for BB SLP
  Conditionally dump info on creation and destruction of SLP nodes
  Conditionally dump info on pushing vectorized defs
  Fix vexed ownership of stmts passed to vect_build_slp_instance
  Update constant creation for BB SLP with predicated tails
  Extend BB SLP vectorization to use predicated tails
  AArch64/SVE: Tests for use of predicated vector tails for BB SLP

 gcc/gimple-fold.cc                            |   2 +-
 .../gcc.target/aarch64/sve/slp_pred_1.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_1_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_2.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_4.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_5.c       |  36 ++
 .../gcc.target/aarch64/sve/slp_pred_6.c       |  39 ++
 .../gcc.target/aarch64/sve/slp_pred_6_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_7.c       |  38 ++
 .../gcc.target/aarch64/sve/slp_pred_harness.h |  28 +
 gcc/tree-vect-loop.cc                         |  10 +
 gcc/tree-vect-slp.cc                          | 568 ++++++++++++-----
 gcc/tree-vect-stmts.cc                        | 600 +++++++++++-------
 gcc/tree-vectorizer.h                         | 154 ++++-
 16 files changed, 1220 insertions(+), 405 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h

-- 
2.43.0

Reply via email to