Jennifer Schmitz <jschm...@nvidia.com> writes:
>> On 22 Oct 2024, at 18:21, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>> A common idiom in intrinsics loops is to have accumulator intrinsics
>>> in an unrolled loop with an accumulator initialized to zero at the 
>>> beginning.
>>> Propagating the initial zero accumulator into the first iteration
>>> of the loop and simplifying the first accumulate instruction is a
>>> desirable transformation that we should teach GCC.
>>> Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros,
>>> producing the lower latency instructions LSR/ASR instead of USRA/SSRA.
>>> We implemented this optimization in svsra_impl::fold.
>>> Because svlsr/svasr are predicated intrinsics, we added a ptrue
>>> predicate. Additionally, the width of the shift amount (imm3) was
>>> adjusted to fit the function type.
>>> In order to create the ptrue predicate, a new helper function
>>> build_ptrue was added. We also refactored gimple_folder::fold_to_ptrue
>>> to use the new helper function.
>>> 
>>> Tests were added to check the produced assembly for use of LSR/ASR.
>>> 
>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>> regression.
>>> OK for mainline?
>>> 
>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>> 
>>> gcc/
>>>      * config/aarch64/aarch64-sve-builtins-sve2.cc
>>>      (svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros.
>>>      * config/aarch64/aarch64-sve-builtins.cc (build_ptrue): New
>>>      function that returns a ptrue tree.
>>>      (gimple_folder::fold_to_ptrue): Refactor to use build_ptrue.
>>>      * config/aarch64/aarch64-sve-builtins.h: Declare build_ptrue.
>>> 
>>> gcc/testsuite/
>>>      * gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test.
>>>      * gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise.
>>>      * gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise.
>>>      * gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise.
>>> ---
>>> .../aarch64/aarch64-sve-builtins-sve2.cc      | 29 +++++++++++++++++++
>>> gcc/config/aarch64/aarch64-sve-builtins.cc    | 28 +++++++++++-------
>>> gcc/config/aarch64/aarch64-sve-builtins.h     |  1 +
>>> .../aarch64/sve2/acle/asm/sra_s32.c           |  9 ++++++
>>> .../aarch64/sve2/acle/asm/sra_s64.c           |  9 ++++++
>>> .../aarch64/sve2/acle/asm/sra_u32.c           |  9 ++++++
>>> .../aarch64/sve2/acle/asm/sra_u64.c           |  9 ++++++
>>> 7 files changed, 83 insertions(+), 11 deletions(-)
>>> 
>>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
>>> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>> index 6a20a613f83..0990918cc45 100644
>>> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>> @@ -417,6 +417,35 @@ public:
>>> 
>>> class svsra_impl : public function_base
>>> {
>>> +public:
>>> +  gimple *
>>> +  fold (gimple_folder &f) const override
>>> +  {
>>> +    /* Fold to svlsr/svasr if op1 is all zeros.  */
>>> +    tree op1 = gimple_call_arg (f.call, 0);
>>> +    if (!integer_zerop (op1))
>>> +      return NULL;
>>> +    function_instance instance ("svlsr", functions::svlsr,
>>> +                             shapes::binary_uint_opt_n, MODE_n,
>>> +                             f.type_suffix_ids, GROUP_none, PRED_x);
>>> +    if (!f.type_suffix (0).unsigned_p)
>>> +      {
>>> +     instance.base_name = "svasr";
>>> +     instance.base = functions::svasr;
>>> +      }
>>> +    gcall *call = f.redirect_call (instance);
>>> +    unsigned int element_bytes = f.type_suffix (0).element_bytes;
>>> +    /* Add a ptrue as predicate, because unlike svsra, svlsr/svasr are
>>> +       predicated intrinsics.  */
>>> +    gimple_call_set_arg (call, 0, build_ptrue (element_bytes));
>> 
>> Maybe it would be simpler to use build_all_ones_cst (f.gp_type ()).
>> Unlike for fold_to_ptrue (which produces output predicates),
>> we don't need the upper bits of each predicate element to be zero.
>> 
>>> +    /* For svsra, the shift amount (imm3) is uint64_t for all function 
>>> types,
>>> +       but for svlsr/svasr, imm3 has the same width as the function type.  
>>> */
>>> +    tree imm3 = gimple_call_arg (f.call, 2);
>>> +    tree imm3_prec = wide_int_to_tree (scalar_types[f.type_suffix 
>>> (0).vector_type],
>> 
>> Nit: long line.  The easiest way of avoiding it would be to use
>> f.scalar_type (0) instead.
>> 
>>> +                                    wi::to_wide (imm3, element_bytes));
>> 
>> This works correctly, but FWIW, it's a little simpler to use
>> wi::to_widest (imm3) instead.  No need to change though.
> Thanks for the review. 
> I made the following changes (updated patch below):
> - exchanged the call to build_ptrue in svsra_impl::fold by build_all_ones_cst
> - removed build_ptrue and its declaration; reverted fold_to_ptrue
> - refactored the type conversion of the shift amount to be more compact
> Thanks,
> Jennifer
>
> A common idiom in intrinsics loops is to have accumulator intrinsics
> in an unrolled loop with an accumulator initialized to zero at the beginning.
> Propagating the initial zero accumulator into the first iteration
> of the loop and simplifying the first accumulate instruction is a
> desirable transformation that we should teach GCC.
> Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros,
> producing the lower latency instructions LSR/ASR instead of USRA/SSRA.
> We implemented this optimization in svsra_impl::fold.
>
> Tests were added to check the produced assembly for use of LSR/ASR.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?

OK, thanks.

Richard

> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>
> gcc/
>       * config/aarch64/aarch64-sve-builtins-sve2.cc
>       (svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros.
>
> gcc/testsuite/
>       * gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test.
>       * gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise.
> ---
>  .../aarch64/aarch64-sve-builtins-sve2.cc      | 28 +++++++++++++++++++
>  .../aarch64/sve2/acle/asm/sra_s32.c           |  9 ++++++
>  .../aarch64/sve2/acle/asm/sra_s64.c           |  9 ++++++
>  .../aarch64/sve2/acle/asm/sra_u32.c           |  9 ++++++
>  .../aarch64/sve2/acle/asm/sra_u64.c           |  9 ++++++
>  5 files changed, 64 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> index 6a20a613f83..ddd6e466ee3 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> @@ -417,6 +417,34 @@ public:
>  
>  class svsra_impl : public function_base
>  {
> +public:
> +  gimple *
> +  fold (gimple_folder &f) const override
> +  {
> +    /* Fold to svlsr/svasr if op1 is all zeros.  */
> +    tree op1 = gimple_call_arg (f.call, 0);
> +    if (!integer_zerop (op1))
> +      return NULL;
> +    function_instance instance ("svlsr", functions::svlsr,
> +                             shapes::binary_uint_opt_n, MODE_n,
> +                             f.type_suffix_ids, GROUP_none, PRED_x);
> +    if (!f.type_suffix (0).unsigned_p)
> +      {
> +     instance.base_name = "svasr";
> +     instance.base = functions::svasr;
> +      }
> +    gcall *call = f.redirect_call (instance);
> +    /* Add a ptrue as predicate, because unlike svsra, svlsr/svasr are
> +       predicated intrinsics.  */
> +    gimple_call_set_arg (call, 0, build_all_ones_cst (f.gp_type ()));
> +    /* For svsra, the shift amount (imm3) is uint64_t for all function types,
> +       but for svlsr/svasr, imm3 has the same width as the function type.  */
> +    tree imm3 = gimple_call_arg (f.call, 2);
> +    tree imm3_prec = wide_int_to_tree (f.scalar_type (0),
> +                                    wi::to_widest (imm3));
> +    gimple_call_set_arg (call, 2, imm3_prec);
> +    return call;
> +  }
>  public:
>    rtx
>    expand (function_expander &e) const override
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c
> index ac992dc7b1c..86cf4bd8137 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c
> @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_32_s32_tied2, svint32_t,
>  TEST_UNIFORM_Z (sra_32_s32_untied, svint32_t,
>               z0 = svsra_n_s32 (z1, z2, 32),
>               z0 = svsra (z1, z2, 32))
> +
> +/*
> +** sra_2_s32_zeroop1:
> +**   asr     z0\.s, z1\.s, #2
> +**   ret
> +*/
> +TEST_UNIFORM_Z (sra_2_s32_zeroop1, svint32_t,
> +             z0 = svsra_n_s32 (svdup_s32 (0), z1, 2),
> +             z0 = svsra (svdup_s32 (0), z1, 2))
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c
> index 9ea5657ab88..7b39798ba1d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c
> @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_64_s64_tied2, svint64_t,
>  TEST_UNIFORM_Z (sra_64_s64_untied, svint64_t,
>               z0 = svsra_n_s64 (z1, z2, 64),
>               z0 = svsra (z1, z2, 64))
> +
> +/*
> +** sra_2_s64_zeroop1:
> +**   asr     z0\.d, z1\.d, #2
> +**   ret
> +*/
> +TEST_UNIFORM_Z (sra_2_s64_zeroop1, svint64_t,
> +             z0 = svsra_n_s64 (svdup_s64 (0), z1, 2),
> +             z0 = svsra (svdup_s64 (0), z1, 2))
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c
> index 090245153f7..001e09ca78d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c
> @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_32_u32_tied2, svuint32_t,
>  TEST_UNIFORM_Z (sra_32_u32_untied, svuint32_t,
>               z0 = svsra_n_u32 (z1, z2, 32),
>               z0 = svsra (z1, z2, 32))
> +
> +/*
> +** sra_2_u32_zeroop1:
> +**   lsr     z0\.s, z1\.s, #2
> +**   ret
> +*/
> +TEST_UNIFORM_Z (sra_2_u32_zeroop1, svuint32_t,
> +             z0 = svsra_n_u32 (svdup_u32 (0), z1, 2),
> +             z0 = svsra (svdup_u32 (0), z1, 2))
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c
> index ff21c368b72..780cf7a7ff6 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c
> @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_64_u64_tied2, svuint64_t,
>  TEST_UNIFORM_Z (sra_64_u64_untied, svuint64_t,
>               z0 = svsra_n_u64 (z1, z2, 64),
>               z0 = svsra (z1, z2, 64))
> +
> +/*
> +** sra_2_u64_zeroop1:
> +**   lsr     z0\.d, z1\.d, #2
> +**   ret
> +*/
> +TEST_UNIFORM_Z (sra_2_u64_zeroop1, svuint64_t,
> +             z0 = svsra_n_u64 (svdup_u64 (0), z1, 2),
> +             z0 = svsra (svdup_u64 (0), z1, 2))

Reply via email to