Jennifer Schmitz <jschm...@nvidia.com> writes:
> If an SVE intrinsic has predicate pfalse, we can fold the call to
> a simplified assignment statement: For _m predication, the LHS can be assigned
> the operand for inactive values and for _z, we can assign a zero vector.
> For _x, the returned values can be arbitrary and as suggested by
> Richard Sandiford, we fold to a zero vector.
>
> For example,
> svint32_t foo (svint32_t op1, svint32_t op2)
> {
>   return svadd_s32_m (svpfalse_b (), op1, op2);
> }
> can be folded to lhs = op1, such that foo is compiled to just a RET.
>
> For implicit predication, a case distinction is necessary:
> Intrinsics that read from memory can be folded to a zero vector.
> Intrinsics that write to memory or prefetch can be folded to a no-op.
> Other intrinsics need case-by-case implemenation, which we added in
> the corresponding svxxx_impl::fold.
>
> We implemented this optimization during gimple folding by calling a new method
> gimple_folder::fold_pfalse from gimple_folder::fold, which covers the generic
> cases described above.
>
> We tested the new behavior for each intrinsic with all supported predications
> and data types and checked the produced assembly. There is a test file
> for each shape subclass with scan-assembler-times tests that look for
> the simplified instruction sequences, such as individual RET instructions
> or zeroing moves. There is an additional directive counting the total number 
> of
> functions in the test, which must be the sum of counts of all other
> directives. This is to check that all tested intrinsics were optimized.
>
> Some few intrinsics were not covered by this patch:
> - svlasta and svlastb already have an implementation to cover a pfalse
> predicate. No changes were made to them.
> - svld1/2/3/4 return aggregate types and were excluded from the case
> that folds calls with implicit predication to lhs = {0, ...}.
> - svst1/2/3/4 already have an implementation in svstx_impl that precedes
> our optimization, such that it is not triggered.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>

Thanks, looks really good.

> [...]
> @@ -2222,6 +2435,14 @@ class svpfirst_svpnext_impl : public function_base
>  {
>  public:
>    CONSTEXPR svpfirst_svpnext_impl (int unspec) : m_unspec (unspec) {}
> +  gimple *
> +  fold (gimple_folder &f) const override
> +  {
> +    tree pg = gimple_call_arg (f.call, 0);
> +    if (is_pfalse (pg))
> +      return f.fold_call_to (pg);
> +    return NULL;
> +  }

I think this should instead return the second argument for UNSPEC_PFIRST.

> [...]
> @@ -556,6 +576,24 @@ public:
>    }
>  };
>  
> +class svqshlu_impl : public unspec_based_function
> +{
> +public:
> +  CONSTEXPR svqshlu_impl ()
> +    : unspec_based_function (UNSPEC_SQSHLU, -1, -1) {}
> +  gimple *
> +  fold (gimple_folder &f) const override
> +  {
> +    if (is_pfalse (gimple_call_arg (f.call, 0)))
> +      {
> +     tree rhs = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (f.lhs),
> +                        gimple_call_arg (f.call, 1));
> +     return gimple_build_assign (f.lhs, VIEW_CONVERT_EXPR, rhs);
> +      }
> +    return NULL;
> +  }
> +};
> +

Could we handle this in the generic code instead?  (More below.)

> [...]
> @@ -3622,6 +3631,51 @@ gimple_folder::redirect_pred_x ()
>    return redirect_call (instance);
>  }
>  
> +/* Fold calls with predicate pfalse:
> +   _m predication: lhs = op1.
> +   _x or _z: lhs = {0, ...}.
> +   Implicit predication that reads from memory: lhs = {0, ...}.
> +   Implicit predication that writes to memory or prefetches: no-op.
> +   Return the new gimple statement on success, else NULL.  */
> +gimple *
> +gimple_folder::fold_pfalse ()
> +{
> +  if (pred == PRED_none)
> +    return nullptr;
> +  tree arg0 = gimple_call_arg (call, 0);
> +  if (pred == PRED_m)
> +    {
> +      /* Unary function shapes with _m predication are folded to the
> +      inactive vector (arg0), while other function shapes are folded
> +      to op1 (arg1).
> +      In some intrinsics, e.g. svqshlu, lhs and op1 have different types,
> +      such that folding to op1 is not appropriate.  */
> +      tree arg1 = gimple_call_arg (call, 1);
> +      tree t;
> +      if (is_pfalse (arg1))
> +     t = arg0;
> +      else if (is_pfalse (arg0))
> +     t = arg1;
> +      else return nullptr;

Formatting nit: the return should be on its own line.

> +      if (TREE_TYPE (t) == TREE_TYPE (lhs))
> +     return fold_call_to (t);

Going back to the question above, is there a case that this type check
has to reject for correctness reasons?  (I should try to work this out
for myself, sorry, but I'm guessing you already have an example.)

> +    }
> +  if ((pred == PRED_x || pred == PRED_z) && is_pfalse (arg0))
> +    return fold_call_to (build_zero_cst (TREE_TYPE (lhs)));
> +  if (pred == PRED_implicit && is_pfalse (arg0))
> +    {
> +      unsigned int flags = call_properties ();
> +      /* Folding to lhs = {0, ...} is not appropriate for intrinsics with
> +      AGGREGATE types as lhs.  */

We could probably fold to a __builtin_memset call instead, but yeah,
that's probably best left as future work.

> +      if ((flags & CP_READ_MEMORY)
> +       && !AGGREGATE_TYPE_P (TREE_TYPE (lhs)))
> +     return fold_call_to (build_zero_cst (TREE_TYPE (lhs)));
> +      if (flags & (CP_WRITE_MEMORY | CP_PREFETCH_MEMORY))
> +     return fold_to_stmt_vops (gimple_build_nop ());
> +    }
> +  return nullptr;
> +}
> +
>  /* Fold the call to constant VAL.  */
>  gimple *
>  gimple_folder::fold_to_cstu (poly_uint64 val)
> [...]
> diff --git a/gcc/testsuite/gcc.target/aarch64/pfalse-binary_0.c 
> b/gcc/testsuite/gcc.target/aarch64/pfalse-binary_0.c
> new file mode 100644
> index 00000000000..72fb5b6ac6d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pfalse-binary_0.c

It looks like this and pfalse-unary_0.c are more like header files
than standalone tests.  It would probably be better to turn them
into .h files if so.

> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pfalse-binary.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-binary.c
> new file mode 100644
> index 00000000000..c1779818fc5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-binary.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#include "../pfalse-binary_0.c"
> +
> +B (brkn, Zv)
> +B (brkpa, Zv)
> +B (brkpb, Zv)
> +ALL_DATA (splice, IMPLICITv)
> +
> +/* { dg-final { scan-assembler-times 
> {\t.cfi_startproc\n\tpfalse\tp0\.b\n\tret\n} 3 } } */
> +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n\tmov\tz0\.d, 
> z1\.d\n\tret\n} 12 } } */
> +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n} 15 } } */

Nice approach!  However, I think ".cfi_startproc" is specific to ELF
and so will fail on Darwin.  Rather than try to work with both, it would
probably be easier to add:

/* { dg-require-effective-target elf } */

after the dg-do line.  Same for the other tests.

> [...]
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pfalse-reduction.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-reduction.c
> new file mode 100644
> index 00000000000..9e74b0d1a35
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-reduction.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include "../pfalse-unary_0.c"
> +
> +ALL_INTEGER_SCALAR (andv, IMPLICIT)
> +ALL_INTEGER_SCALAR (eorv, IMPLICIT)
> +ALL_ARITH_SCALAR (maxv, IMPLICIT)
> +ALL_ARITH_SCALAR (minv, IMPLICIT)
> +ALL_INTEGER_SCALAR (orv, IMPLICIT)
> +ALL_FLOAT_SCALAR (maxnmv, IMPLICIT)
> +ALL_FLOAT_SCALAR (minnmv, IMPLICIT)
> +
> +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n\tmov\t[wx]0, 
> 0\n\tret\n} 20 } } */
> +/* { dg-final { scan-tree-dump-times "return  Nan" 6 "optimized" } } */
> +/* { dg-final { scan-assembler-times 
> {\t.cfi_startproc\n\tmov\t[wxz]0(?:\.[sd])?, #?-?[1-9]+[0-9]*\n\tret\n} 27 } 
> } */

This is quite a liberal regexp.  Could we add one that matches specifically
-1, for the ANDV and UMINV cases?  It could be in addition to this one,
to avoid complicating things by excluding -1 above.

> +/* { dg-final { scan-assembler-times 
> {\t.cfi_startproc\n\t(movi|mvni)\tv0\.(2s|4h), 0x[0-9a-f]+, [a-z]+ 
> [0-9]+\n\tret\n} 5 } } */
> +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n} 52 } } */
> [...]
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pfalse-store.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-store.c
> new file mode 100644
> index 00000000000..3da28712988
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pfalse-store.c
> @@ -0,0 +1,50 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#include <arm_sve.h>
> +
> +#define T(F, TY1, TY2)                                       \
> +  void F##_f (TY1 *base, sv##TY2 data)                       \
> +  {                                                  \
> +    return sv##F (svpfalse_b (), base, data);                \
> +  }
> +
> +#define D_INTEGER(F, TY)                             \
> +  T (F##_s64, TY, int64_t)                           \
> +  T (F##_u64, TY, uint64_t)
> +
> +#define SD_INTEGER(F, TY)                            \
> +  D_INTEGER (F, TY)                                  \
> +  T (F##_s32, TY, int32_t)                           \
> +  T (F##_u32, TY, uint32_t)
> +
> +#define HSD_INTEGER(F, TY)                           \
> +  SD_INTEGER (F, TY)                                 \
> +  T (F##_s16, TY, int16_t)                           \
> +  T (F##_u16, TY, uint16_t)
> +
> +#define ALL_DATA(F, A)                                       \
> +  T (F##_bf16, bfloat16_t, bfloat16##A)                      \
> +  T (F##_f16, float16_t, float16##A)                 \
> +  T (F##_f32, float32_t, float32##A)                 \
> +  T (F##_f64, float64_t, float64##A)                 \
> +  T (F##_s8, int8_t, int8##A)                                \
> +  T (F##_s16, int16_t, int16##A)                     \
> +  T (F##_s32, int32_t, int32##A)                     \
> +  T (F##_s64, int64_t, int64##A)                     \
> +  T (F##_u8, uint8_t, uint8##A)                              \
> +  T (F##_u16, uint16_t, uint16##A)                   \
> +  T (F##_u32, uint32_t, uint32##A)                   \
> +  T (F##_u64, uint64_t, uint64##A)                   \
> +
> +HSD_INTEGER (st1b, int8_t)
> +SD_INTEGER (st1h, int16_t)
> +D_INTEGER (st1w, int32_t)
> +ALL_DATA (st1, _t)
> +ALL_DATA (st2, x2_t)
> +ALL_DATA (st3, x3_t)
> +ALL_DATA (st4, x4_t)
> +
> +
> +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n\tret\n} 12 } } */
> +/* { dg-final { scan-assembler-times {\t.cfi_startproc\n} 60 } } */

What happens for the other 48?  I assume they're for ST[234]?

The test above is OK if we can't optimise those properly yet, but it would
be good to add a FIXME comment above the first regexp to explain that
it ought to be 60 eventually.

Thanks,
Richard

Reply via email to