Re: [PATCH] SVE intrinsics: Fold svsra with op1 all zeros to svlsr/svasr.

Jennifer Schmitz Thu, 24 Oct 2024 02:57:04 -0700

> On 24 Oct 2024, at 11:28, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>> On 22 Oct 2024, at 18:21, Richard Sandiford <richard.sandif...@arm.com> 
>>> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>> A common idiom in intrinsics loops is to have accumulator intrinsics
>>>> in an unrolled loop with an accumulator initialized to zero at the 
>>>> beginning.
>>>> Propagating the initial zero accumulator into the first iteration
>>>> of the loop and simplifying the first accumulate instruction is a
>>>> desirable transformation that we should teach GCC.
>>>> Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros,
>>>> producing the lower latency instructions LSR/ASR instead of USRA/SSRA.
>>>> We implemented this optimization in svsra_impl::fold.
>>>> Because svlsr/svasr are predicated intrinsics, we added a ptrue
>>>> predicate. Additionally, the width of the shift amount (imm3) was
>>>> adjusted to fit the function type.
>>>> In order to create the ptrue predicate, a new helper function
>>>> build_ptrue was added. We also refactored gimple_folder::fold_to_ptrue
>>>> to use the new helper function.
>>>> 
>>>> Tests were added to check the produced assembly for use of LSR/ASR.
>>>> 
>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>> regression.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>> 
>>>> gcc/
>>>>     * config/aarch64/aarch64-sve-builtins-sve2.cc
>>>>     (svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros.
>>>>     * config/aarch64/aarch64-sve-builtins.cc (build_ptrue): New
>>>>     function that returns a ptrue tree.
>>>>     (gimple_folder::fold_to_ptrue): Refactor to use build_ptrue.
>>>>     * config/aarch64/aarch64-sve-builtins.h: Declare build_ptrue.
>>>> 
>>>> gcc/testsuite/
>>>>     * gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test.
>>>>     * gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise.
>>>>     * gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise.
>>>>     * gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise.
>>>> ---
>>>> .../aarch64/aarch64-sve-builtins-sve2.cc      | 29 +++++++++++++++++++
>>>> gcc/config/aarch64/aarch64-sve-builtins.cc    | 28 +++++++++++-------
>>>> gcc/config/aarch64/aarch64-sve-builtins.h     |  1 +
>>>> .../aarch64/sve2/acle/asm/sra_s32.c           |  9 ++++++
>>>> .../aarch64/sve2/acle/asm/sra_s64.c           |  9 ++++++
>>>> .../aarch64/sve2/acle/asm/sra_u32.c           |  9 ++++++
>>>> .../aarch64/sve2/acle/asm/sra_u64.c           |  9 ++++++
>>>> 7 files changed, 83 insertions(+), 11 deletions(-)
>>>> 
>>>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
>>>> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>>> index 6a20a613f83..0990918cc45 100644
>>>> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>>> @@ -417,6 +417,35 @@ public:
>>>> 
>>>> class svsra_impl : public function_base
>>>> {
>>>> +public:
>>>> +  gimple *
>>>> +  fold (gimple_folder &f) const override
>>>> +  {
>>>> +    /* Fold to svlsr/svasr if op1 is all zeros.  */
>>>> +    tree op1 = gimple_call_arg (f.call, 0);
>>>> +    if (!integer_zerop (op1))
>>>> +      return NULL;
>>>> +    function_instance instance ("svlsr", functions::svlsr,
>>>> +                             shapes::binary_uint_opt_n, MODE_n,
>>>> +                             f.type_suffix_ids, GROUP_none, PRED_x);
>>>> +    if (!f.type_suffix (0).unsigned_p)
>>>> +      {
>>>> +     instance.base_name = "svasr";
>>>> +     instance.base = functions::svasr;
>>>> +      }
>>>> +    gcall *call = f.redirect_call (instance);
>>>> +    unsigned int element_bytes = f.type_suffix (0).element_bytes;
>>>> +    /* Add a ptrue as predicate, because unlike svsra, svlsr/svasr are
>>>> +       predicated intrinsics.  */
>>>> +    gimple_call_set_arg (call, 0, build_ptrue (element_bytes));
>>> 
>>> Maybe it would be simpler to use build_all_ones_cst (f.gp_type ()).
>>> Unlike for fold_to_ptrue (which produces output predicates),
>>> we don't need the upper bits of each predicate element to be zero.
>>> 
>>>> +    /* For svsra, the shift amount (imm3) is uint64_t for all function 
>>>> types,
>>>> +       but for svlsr/svasr, imm3 has the same width as the function type. 
>>>>  */
>>>> +    tree imm3 = gimple_call_arg (f.call, 2);
>>>> +    tree imm3_prec = wide_int_to_tree (scalar_types[f.type_suffix 
>>>> (0).vector_type],
>>> 
>>> Nit: long line.  The easiest way of avoiding it would be to use
>>> f.scalar_type (0) instead.
>>> 
>>>> +                                    wi::to_wide (imm3, element_bytes));
>>> 
>>> This works correctly, but FWIW, it's a little simpler to use
>>> wi::to_widest (imm3) instead.  No need to change though.
>> Thanks for the review.
>> I made the following changes (updated patch below):
>> - exchanged the call to build_ptrue in svsra_impl::fold by build_all_ones_cst
>> - removed build_ptrue and its declaration; reverted fold_to_ptrue
>> - refactored the type conversion of the shift amount to be more compact
>> Thanks,
>> Jennifer
>> 
>> A common idiom in intrinsics loops is to have accumulator intrinsics
>> in an unrolled loop with an accumulator initialized to zero at the beginning.
>> Propagating the initial zero accumulator into the first iteration
>> of the loop and simplifying the first accumulate instruction is a
>> desirable transformation that we should teach GCC.
>> Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros,
>> producing the lower latency instructions LSR/ASR instead of USRA/SSRA.
>> We implemented this optimization in svsra_impl::fold.
>> 
>> Tests were added to check the produced assembly for use of LSR/ASR.
>> 
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
> 
> OK, thanks.
Thanks, committed to trunk: f6fbc0d2422ce9bea6a23226f4a13a76ffd1784b
Jennifer
> 
> Richard
> 
>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>> 
>> gcc/
>>      * config/aarch64/aarch64-sve-builtins-sve2.cc
>>      (svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros.
>> 
>> gcc/testsuite/
>>      * gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test.
>>      * gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise.
>>      * gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise.
>>      * gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise.
>> ---
>> .../aarch64/aarch64-sve-builtins-sve2.cc      | 28 +++++++++++++++++++
>> .../aarch64/sve2/acle/asm/sra_s32.c           |  9 ++++++
>> .../aarch64/sve2/acle/asm/sra_s64.c           |  9 ++++++
>> .../aarch64/sve2/acle/asm/sra_u32.c           |  9 ++++++
>> .../aarch64/sve2/acle/asm/sra_u64.c           |  9 ++++++
>> 5 files changed, 64 insertions(+)
>> 
>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
>> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>> index 6a20a613f83..ddd6e466ee3 100644
>> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>> @@ -417,6 +417,34 @@ public:
>> 
>> class svsra_impl : public function_base
>> {
>> +public:
>> +  gimple *
>> +  fold (gimple_folder &f) const override
>> +  {
>> +    /* Fold to svlsr/svasr if op1 is all zeros.  */
>> +    tree op1 = gimple_call_arg (f.call, 0);
>> +    if (!integer_zerop (op1))
>> +      return NULL;
>> +    function_instance instance ("svlsr", functions::svlsr,
>> +                             shapes::binary_uint_opt_n, MODE_n,
>> +                             f.type_suffix_ids, GROUP_none, PRED_x);
>> +    if (!f.type_suffix (0).unsigned_p)
>> +      {
>> +     instance.base_name = "svasr";
>> +     instance.base = functions::svasr;
>> +      }
>> +    gcall *call = f.redirect_call (instance);
>> +    /* Add a ptrue as predicate, because unlike svsra, svlsr/svasr are
>> +       predicated intrinsics.  */
>> +    gimple_call_set_arg (call, 0, build_all_ones_cst (f.gp_type ()));
>> +    /* For svsra, the shift amount (imm3) is uint64_t for all function 
>> types,
>> +       but for svlsr/svasr, imm3 has the same width as the function type.  
>> */
>> +    tree imm3 = gimple_call_arg (f.call, 2);
>> +    tree imm3_prec = wide_int_to_tree (f.scalar_type (0),
>> +                                    wi::to_widest (imm3));
>> +    gimple_call_set_arg (call, 2, imm3_prec);
>> +    return call;
>> +  }
>> public:
>>   rtx
>>   expand (function_expander &e) const override
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c
>> index ac992dc7b1c..86cf4bd8137 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c
>> @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_32_s32_tied2, svint32_t,
>> TEST_UNIFORM_Z (sra_32_s32_untied, svint32_t,
>>              z0 = svsra_n_s32 (z1, z2, 32),
>>              z0 = svsra (z1, z2, 32))
>> +
>> +/*
>> +** sra_2_s32_zeroop1:
>> +**   asr     z0\.s, z1\.s, #2
>> +**   ret
>> +*/
>> +TEST_UNIFORM_Z (sra_2_s32_zeroop1, svint32_t,
>> +             z0 = svsra_n_s32 (svdup_s32 (0), z1, 2),
>> +             z0 = svsra (svdup_s32 (0), z1, 2))
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c
>> index 9ea5657ab88..7b39798ba1d 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c
>> @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_64_s64_tied2, svint64_t,
>> TEST_UNIFORM_Z (sra_64_s64_untied, svint64_t,
>>              z0 = svsra_n_s64 (z1, z2, 64),
>>              z0 = svsra (z1, z2, 64))
>> +
>> +/*
>> +** sra_2_s64_zeroop1:
>> +**   asr     z0\.d, z1\.d, #2
>> +**   ret
>> +*/
>> +TEST_UNIFORM_Z (sra_2_s64_zeroop1, svint64_t,
>> +             z0 = svsra_n_s64 (svdup_s64 (0), z1, 2),
>> +             z0 = svsra (svdup_s64 (0), z1, 2))
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c
>> index 090245153f7..001e09ca78d 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c
>> @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_32_u32_tied2, svuint32_t,
>> TEST_UNIFORM_Z (sra_32_u32_untied, svuint32_t,
>>              z0 = svsra_n_u32 (z1, z2, 32),
>>              z0 = svsra (z1, z2, 32))
>> +
>> +/*
>> +** sra_2_u32_zeroop1:
>> +**   lsr     z0\.s, z1\.s, #2
>> +**   ret
>> +*/
>> +TEST_UNIFORM_Z (sra_2_u32_zeroop1, svuint32_t,
>> +             z0 = svsra_n_u32 (svdup_u32 (0), z1, 2),
>> +             z0 = svsra (svdup_u32 (0), z1, 2))
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c
>> index ff21c368b72..780cf7a7ff6 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c
>> @@ -91,3 +91,12 @@ TEST_UNIFORM_Z (sra_64_u64_tied2, svuint64_t,
>> TEST_UNIFORM_Z (sra_64_u64_untied, svuint64_t,
>>              z0 = svsra_n_u64 (z1, z2, 64),
>>              z0 = svsra (z1, z2, 64))
>> +
>> +/*
>> +** sra_2_u64_zeroop1:
>> +**   lsr     z0\.d, z1\.d, #2
>> +**   ret
>> +*/
>> +TEST_UNIFORM_Z (sra_2_u64_zeroop1, svuint64_t,
>> +             z0 = svsra_n_u64 (svdup_u64 (0), z1, 2),
>> +             z0 = svsra (svdup_u64 (0), z1, 2))
smime.p7s
Description: S/MIME cryptographic signature
Re: [PATCH] SVE intrinsics: Fold svsra with op1 all zeros to svlsr/svasr.

Reply via email to