Jennifer Schmitz <jschm...@nvidia.com> writes: >> On 22 Oct 2024, at 18:21, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> >> External email: Use caution opening links or attachments >> >> >> Jennifer Schmitz <jschm...@nvidia.com> writes: >>> A common idiom in intrinsics loops is to have accumulator intrinsics >>> in an unrolled loop with an accumulator initialized to zero at the >>> beginning. >>> Propagating the initial zero accumulator into the first iteration >>> of the loop and simplifying the first accumulate instruction is a >>> desirable transformation that we should teach GCC. >>> Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros, >>> producing the lower latency instructions LSR/ASR instead of USRA/SSRA. >>> We implemented this optimization in svsra_impl::fold. >>> Because svlsr/svasr are predicated intrinsics, we added a ptrue >>> predicate. Additionally, the width of the shift amount (imm3) was >>> adjusted to fit the function type. >>> In order to create the ptrue predicate, a new helper function >>> build_ptrue was added. We also refactored gimple_folder::fold_to_ptrue >>> to use the new helper function. >>> >>> Tests were added to check the produced assembly for use of LSR/ASR. >>> >>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>> regression. >>> OK for mainline? >>> >>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>> >>> gcc/ >>> * config/aarch64/aarch64-sve-builtins-sve2.cc >>> (svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros. >>> * config/aarch64/aarch64-sve-builtins.cc (build_ptrue): New >>> function that returns a ptrue tree. >>> (gimple_folder::fold_to_ptrue): Refactor to use build_ptrue. >>> * config/aarch64/aarch64-sve-builtins.h: Declare build_ptrue. >>> >>> gcc/testsuite/ >>> * gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test. >>> * gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise. >>> * gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise. >>> * gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise. >>> --- >>> .../aarch64/aarch64-sve-builtins-sve2.cc | 29 +++++++++++++++++++ >>> gcc/config/aarch64/aarch64-sve-builtins.cc | 28 +++++++++++------- >>> gcc/config/aarch64/aarch64-sve-builtins.h | 1 + >>> .../aarch64/sve2/acle/asm/sra_s32.c | 9 ++++++ >>> .../aarch64/sve2/acle/asm/sra_s64.c | 9 ++++++ >>> .../aarch64/sve2/acle/asm/sra_u32.c | 9 ++++++ >>> .../aarch64/sve2/acle/asm/sra_u64.c | 9 ++++++ >>> 7 files changed, 83 insertions(+), 11 deletions(-) >>> >>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc >>> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc >>> index 6a20a613f83..0990918cc45 100644 >>> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc >>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc >>> @@ -417,6 +417,35 @@ public: >>> >>> class svsra_impl : public function_base >>> { >>> +public: >>> + gimple * >>> + fold (gimple_folder &f) const override >>> + { >>> + /* Fold to svlsr/svasr if op1 is all zeros. */ >>> + tree op1 = gimple_call_arg (f.call, 0); >>> + if (!integer_zerop (op1)) >>> + return NULL; >>> + function_instance instance ("svlsr", functions::svlsr, >>> + shapes::binary_uint_opt_n, MODE_n, >>> + f.type_suffix_ids, GROUP_none, PRED_x); >>> + if (!f.type_suffix (0).unsigned_p) >>> + { >>> + instance.base_name = "svasr"; >>> + instance.base = functions::svasr; >>> + } >>> + gcall *call = f.redirect_call (instance); >>> + unsigned int element_bytes = f.type_suffix (0).element_bytes; >>> + /* Add a ptrue as predicate, because unlike svsra, svlsr/svasr are >>> + predicated intrinsics. */ >>> + gimple_call_set_arg (call, 0, build_ptrue (element_bytes)); >> >> Maybe it would be simpler to use build_all_ones_cst (f.gp_type ()). >> Unlike for fold_to_ptrue (which produces output predicates), >> we don't need the upper bits of each predicate element to be zero. >> >>> + /* For svsra, the shift amount (imm3) is uint64_t for all function >>> types, >>> + but for svlsr/svasr, imm3 has the same width as the function type. >>> */ >>> + tree imm3 = gimple_call_arg (f.call, 2); >>> + tree imm3_prec = wide_int_to_tree (scalar_types[f.type_suffix >>> (0).vector_type], >> >> Nit: long line. The easiest way of avoiding it would be to use >> f.scalar_type (0) instead. >> >>> + wi::to_wide (imm3, element_bytes)); >> >> This works correctly, but FWIW, it's a little simpler to use >> wi::to_widest (imm3) instead. No need to change though. > Thanks for the review. > I made the following changes (updated patch below): > - exchanged the call to build_ptrue in svsra_impl::fold by build_all_ones_cst > - removed build_ptrue and its declaration; reverted fold_to_ptrue > - refactored the type conversion of the shift amount to be more compact > Thanks, > Jennifer > > A common idiom in intrinsics loops is to have accumulator intrinsics > in an unrolled loop with an accumulator initialized to zero at the beginning. > Propagating the initial zero accumulator into the first iteration > of the loop and simplifying the first accumulate instruction is a > desirable transformation that we should teach GCC. > Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros, > producing the lower latency instructions LSR/ASR instead of USRA/SSRA. > We implemented this optimization in svsra_impl::fold. > > Tests were added to check the produced assembly for use of LSR/ASR. > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. > OK for mainline?
OK, thanks. Richard > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > > gcc/ > * config/aarch64/aarch64-sve-builtins-sve2.cc > (svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros. > > gcc/testsuite/ > * gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test. > * gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise. > * gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise. > * gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise. > --- > .../aarch64/aarch64-sve-builtins-sve2.cc | 28 +++++++++++++++++++ > .../aarch64/sve2/acle/asm/sra_s32.c | 9 ++++++ > .../aarch64/sve2/acle/asm/sra_s64.c | 9 ++++++ > .../aarch64/sve2/acle/asm/sra_u32.c | 9 ++++++ > .../aarch64/sve2/acle/asm/sra_u64.c | 9 ++++++ > 5 files changed, 64 insertions(+) > > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc > b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc > index 6a20a613f83..ddd6e466ee3 100644 > --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc > +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc > @@ -417,6 +417,34 @@ public: > > class svsra_impl : public function_base > { > +public: > + gimple * > + fold (gimple_folder &f) const override > + { > + /* Fold to svlsr/svasr if op1 is all zeros. */ > + tree op1 = gimple_call_arg (f.call, 0); > + if (!integer_zerop (op1)) > + return NULL; > + function_instance instance ("svlsr", functions::svlsr, > + shapes::binary_uint_opt_n, MODE_n, > + f.type_suffix_ids, GROUP_none, PRED_x); > + if (!f.type_suffix (0).unsigned_p) > + { > + instance.base_name = "svasr"; > + instance.base = functions::svasr; > + } > + gcall *call = f.redirect_call (instance); > + /* Add a ptrue as predicate, because unlike svsra, svlsr/svasr are > + predicated intrinsics. */ > + gimple_call_set_arg (call, 0, build_all_ones_cst (f.gp_type ())); > + /* For svsra, the shift amount (imm3) is uint64_t for all function types, > + but for svlsr/svasr, imm3 has the same width as the function type. */ > + tree imm3 = gimple_call_arg (f.call, 2); > + tree imm3_prec = wide_int_to_tree (f.scalar_type (0), > + wi::to_widest (imm3)); > + gimple_call_set_arg (call, 2, imm3_prec); > + return call; > + } > public: > rtx > expand (function_expander &e) const override > diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c > b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c > index ac992dc7b1c..86cf4bd8137 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c > @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_32_s32_tied2, svint32_t, > TEST_UNIFORM_Z (sra_32_s32_untied, svint32_t, > z0 = svsra_n_s32 (z1, z2, 32), > z0 = svsra (z1, z2, 32)) > + > +/* > +** sra_2_s32_zeroop1: > +** asr z0\.s, z1\.s, #2 > +** ret > +*/ > +TEST_UNIFORM_Z (sra_2_s32_zeroop1, svint32_t, > + z0 = svsra_n_s32 (svdup_s32 (0), z1, 2), > + z0 = svsra (svdup_s32 (0), z1, 2)) > diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c > b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c > index 9ea5657ab88..7b39798ba1d 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c > @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_64_s64_tied2, svint64_t, > TEST_UNIFORM_Z (sra_64_s64_untied, svint64_t, > z0 = svsra_n_s64 (z1, z2, 64), > z0 = svsra (z1, z2, 64)) > + > +/* > +** sra_2_s64_zeroop1: > +** asr z0\.d, z1\.d, #2 > +** ret > +*/ > +TEST_UNIFORM_Z (sra_2_s64_zeroop1, svint64_t, > + z0 = svsra_n_s64 (svdup_s64 (0), z1, 2), > + z0 = svsra (svdup_s64 (0), z1, 2)) > diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c > b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c > index 090245153f7..001e09ca78d 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c > @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_32_u32_tied2, svuint32_t, > TEST_UNIFORM_Z (sra_32_u32_untied, svuint32_t, > z0 = svsra_n_u32 (z1, z2, 32), > z0 = svsra (z1, z2, 32)) > + > +/* > +** sra_2_u32_zeroop1: > +** lsr z0\.s, z1\.s, #2 > +** ret > +*/ > +TEST_UNIFORM_Z (sra_2_u32_zeroop1, svuint32_t, > + z0 = svsra_n_u32 (svdup_u32 (0), z1, 2), > + z0 = svsra (svdup_u32 (0), z1, 2)) > diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c > b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c > index ff21c368b72..780cf7a7ff6 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c > @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_64_u64_tied2, svuint64_t, > TEST_UNIFORM_Z (sra_64_u64_untied, svuint64_t, > z0 = svsra_n_u64 (z1, z2, 64), > z0 = svsra (z1, z2, 64)) > + > +/* > +** sra_2_u64_zeroop1: > +** lsr z0\.d, z1\.d, #2 > +** ret > +*/ > +TEST_UNIFORM_Z (sra_2_u64_zeroop1, svuint64_t, > + z0 = svsra_n_u64 (svdup_u64 (0), z1, 2), > + z0 = svsra (svdup_u64 (0), z1, 2))