Jennifer Schmitz <jschm...@nvidia.com> writes: > If an SVE intrinsic has predicate pfalse, we can fold the call to > a simplified assignment statement: For _m predication, the LHS can be assigned > the operand for inactive values and for _z, we can assign a zero vector. > For _x, the returned values can be arbitrary and as suggested by > Richard Sandiford, we fold to a zero vector. > > For example, > svint32_t foo (svint32_t op1, svint32_t op2) > { > return svadd_s32_m (svpfalse_b (), op1, op2); > } > can be folded to lhs = op1, such that foo is compiled to just a RET. > > For implicit predication, a case distinction is necessary: > Intrinsics that read from memory can be folded to a zero vector. > Intrinsics that write to memory or prefetch can be folded to a no-op. > Other intrinsics need case-by-case implemenation, which we added in > the corresponding svxxx_impl::fold. > > We implemented this optimization during gimple folding by calling a new method > gimple_folder::fold_pfalse from gimple_folder::fold, which covers the generic > cases described above. > > We tested the new behavior for each intrinsic with all supported predications > and data types and checked the produced assembly. There is a test file > for each shape subclass with scan-assembler-times tests that look for > the simplified instruction sequences, such as individual RET instructions > or zeroing moves. There is an additional directive counting the total number > of > functions in the test, which must be the sum of counts of all other > directives. This is to check that all tested intrinsics were optimized. > > Some few intrinsics were not covered by this patch: > - svlasta and svlastb already have an implementation to cover a pfalse > predicate. No changes were made to them. > - svld1/2/3/4 return aggregate types and were excluded from the case > that folds calls with implicit predication to lhs = {0, ...}. > - svst1/2/3/4 already have an implementation in svstx_impl that precedes > our optimization, such that it is not triggered. > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. > OK for mainline? > > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
Thanks, looks really good. > [...] > @@ -2222,6 +2435,14 @@ class svpfirst_svpnext_impl : public function_base > { > public: > CONSTEXPR svpfirst_svpnext_impl (int unspec) : m_unspec (unspec) {} > + gimple * > + fold (gimple_folder &f) const override > + { > + tree pg = gimple_call_arg (f.call, 0); > + if (is_pfalse (pg)) > + return f.fold_call_to (pg); > + return NULL; > + } I think this should instead return the second argument for UNSPEC_PFIRST. > [...] > @@ -556,6 +576,24 @@ public: > } > }; > > +class svqshlu_impl : public unspec_based_function > +{ > +public: > + CONSTEXPR svqshlu_impl () > + : unspec_based_function (UNSPEC_SQSHLU, -1, -1) {} > + gimple * > + fold (gimple_folder &f) const override > + { > + if (is_pfalse (gimple_call_arg (f.call, 0))) > + { > + tree rhs = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (f.lhs), > + gimple_call_arg (f.call, 1)); > + return gimple_build_assign (f.lhs, VIEW_CONVERT_EXPR, rhs); > + } > + return NULL; > + } > +}; > + Could we handle this in the generic code instead? (More below.) > [...] > @@ -3622,6 +3631,51 @@ gimple_folder::redirect_pred_x () > return redirect_call (instance); > } > > +/* Fold calls with predicate pfalse: > + _m predication: lhs = op1. > + _x or _z: lhs = {0, ...}. > + Implicit predication that reads from memory: lhs = {0, ...}. > + Implicit predication that writes to memory or prefetches: no-op. > + Return the new gimple statement on success, else NULL. */ > +gimple * > +gimple_folder::fold_pfalse () > +{ > + if (pred == PRED_none) > + return nullptr; > + tree arg0 = gimple_call_arg (call, 0); > + if (pred == PRED_m) > + { > + /* Unary function shapes with _m predication are folded to the > + inactive vector (arg0), while other function shapes are folded > + to op1 (arg1). > + In some intrinsics, e.g. svqshlu, lhs and op1 have different types, > + such that folding to op1 is not appropriate. */ > + tree arg1 = gimple_call_arg (call, 1); > + tree t; > + if (is_pfalse (arg1)) > + t = arg0; > + else if (is_pfalse (arg0)) > + t = arg1; > + else return nullptr; Formatting nit: the return should be on its own line. > + if (TREE_TYPE (t) == TREE_TYPE (lhs)) > + return fold_call_to (t); Going back to the question above, is there a case that this type check has to reject for correctness reasons? (I should try to work this out for myself, sorry, but I'm guessing you already have an example.) > + } > + if ((pred == PRED_x || pred == PRED_z) && is_pfalse (arg0)) > + return fold_call_to (build_zero_cst (TREE_TYPE (lhs))); > + if (pred == PRED_implicit && is_pfalse (arg0)) > + { > + unsigned int flags = call_properties (); > + /* Folding to lhs = {0, ...} is not appropriate for intrinsics with > + AGGREGATE types as lhs. */ We could probably fold to a __builtin_memset call instead, but yeah, that's probably best left as future work. > + if ((flags & CP_READ_MEMORY) > + && !AGGREGATE_TYPE_P (TREE_TYPE (lhs))) > + return fold_call_to (build_zero_cst (TREE_TYPE (lhs))); > + if (flags & (CP_WRITE_MEMORY | CP_PREFETCH_MEMORY)) > + return fold_to_stmt_vops (gimple_build_nop ()); > + } > + return nullptr; > +} > + > /* Fold the call to constant VAL. */ > gimple * > gimple_folder::fold_to_cstu (poly_uint64 val) > [...] > diff --git a/gcc/testsuite/gcc.target/aarch64/pfalse-binary_0.c > b/gcc/testsuite/gcc.target/aarch64/pfalse-binary_0.c > new file mode 100644 > index 00000000000..72fb5b6ac6d > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pfalse-binary_0.c It looks like this and pfalse-unary_0.c are more like header files than standalone tests. It would probably be better to turn them into .h files if so. > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pfalse-binary.c > b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-binary.c > new file mode 100644 > index 00000000000..c1779818fc5 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-binary.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2" } */ > + > +#include "../pfalse-binary_0.c" > + > +B (brkn, Zv) > +B (brkpa, Zv) > +B (brkpb, Zv) > +ALL_DATA (splice, IMPLICITv) > + > +/* { dg-final { scan-assembler-times > {\t.cfi_startproc\n\tpfalse\tp0\.b\n\tret\n} 3 } } */ > +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n\tmov\tz0\.d, > z1\.d\n\tret\n} 12 } } */ > +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n} 15 } } */ Nice approach! However, I think ".cfi_startproc" is specific to ELF and so will fail on Darwin. Rather than try to work with both, it would probably be easier to add: /* { dg-require-effective-target elf } */ after the dg-do line. Same for the other tests. > [...] > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pfalse-reduction.c > b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-reduction.c > new file mode 100644 > index 00000000000..9e74b0d1a35 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-reduction.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > + > +#include "../pfalse-unary_0.c" > + > +ALL_INTEGER_SCALAR (andv, IMPLICIT) > +ALL_INTEGER_SCALAR (eorv, IMPLICIT) > +ALL_ARITH_SCALAR (maxv, IMPLICIT) > +ALL_ARITH_SCALAR (minv, IMPLICIT) > +ALL_INTEGER_SCALAR (orv, IMPLICIT) > +ALL_FLOAT_SCALAR (maxnmv, IMPLICIT) > +ALL_FLOAT_SCALAR (minnmv, IMPLICIT) > + > +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n\tmov\t[wx]0, > 0\n\tret\n} 20 } } */ > +/* { dg-final { scan-tree-dump-times "return Nan" 6 "optimized" } } */ > +/* { dg-final { scan-assembler-times > {\t.cfi_startproc\n\tmov\t[wxz]0(?:\.[sd])?, #?-?[1-9]+[0-9]*\n\tret\n} 27 } > } */ This is quite a liberal regexp. Could we add one that matches specifically -1, for the ANDV and UMINV cases? It could be in addition to this one, to avoid complicating things by excluding -1 above. > +/* { dg-final { scan-assembler-times > {\t.cfi_startproc\n\t(movi|mvni)\tv0\.(2s|4h), 0x[0-9a-f]+, [a-z]+ > [0-9]+\n\tret\n} 5 } } */ > +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n} 52 } } */ > [...] > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pfalse-store.c > b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-store.c > new file mode 100644 > index 00000000000..3da28712988 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-store.c > @@ -0,0 +1,50 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2" } */ > + > +#include <arm_sve.h> > + > +#define T(F, TY1, TY2) \ > + void F##_f (TY1 *base, sv##TY2 data) \ > + { \ > + return sv##F (svpfalse_b (), base, data); \ > + } > + > +#define D_INTEGER(F, TY) \ > + T (F##_s64, TY, int64_t) \ > + T (F##_u64, TY, uint64_t) > + > +#define SD_INTEGER(F, TY) \ > + D_INTEGER (F, TY) \ > + T (F##_s32, TY, int32_t) \ > + T (F##_u32, TY, uint32_t) > + > +#define HSD_INTEGER(F, TY) \ > + SD_INTEGER (F, TY) \ > + T (F##_s16, TY, int16_t) \ > + T (F##_u16, TY, uint16_t) > + > +#define ALL_DATA(F, A) \ > + T (F##_bf16, bfloat16_t, bfloat16##A) \ > + T (F##_f16, float16_t, float16##A) \ > + T (F##_f32, float32_t, float32##A) \ > + T (F##_f64, float64_t, float64##A) \ > + T (F##_s8, int8_t, int8##A) \ > + T (F##_s16, int16_t, int16##A) \ > + T (F##_s32, int32_t, int32##A) \ > + T (F##_s64, int64_t, int64##A) \ > + T (F##_u8, uint8_t, uint8##A) \ > + T (F##_u16, uint16_t, uint16##A) \ > + T (F##_u32, uint32_t, uint32##A) \ > + T (F##_u64, uint64_t, uint64##A) \ > + > +HSD_INTEGER (st1b, int8_t) > +SD_INTEGER (st1h, int16_t) > +D_INTEGER (st1w, int32_t) > +ALL_DATA (st1, _t) > +ALL_DATA (st2, x2_t) > +ALL_DATA (st3, x3_t) > +ALL_DATA (st4, x4_t) > + > + > +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n\tret\n} 12 } } */ > +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n} 60 } } */ What happens for the other 48? I assume they're for ST[234]? The test above is OK if we can't optimise those properly yet, but it would be good to add a FIXME comment above the first regexp to explain that it ought to be 60 eventually. Thanks, Richard