Kyrylo Tkachov <ktkac...@nvidia.com> writes:
> Hi all,
>
> On many cores, including Neoverse V2 the throughput of vector ADD
> instructions is higher than vector shifts like SHL.  We can lean on that
> to emit code like:
>   add     v0.4s, v0.4s, v0.4s
> instead of:
>   shl     v0.4s, v0.4s, 1
>
> LLVM already does this trick.
> In RTL the code gets canonincalised from (plus x x) to (ashift x 1) so I
> opted to instead do this at the final assembly printing stage, similar
> to how we emit CMLT instead of SSHR elsewhere in the backend.
>
> I'd like to also do this for SVE shifts, but those will have to be
> separate patches.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> I’ll leave it up for comments for a few days and commit next week if no 
> objections.

Nice.  LGTM FWIW.

(And given the recent discussion about intrinsics: replacing shift
intrinsics with addition seems to fit into the category of changing
the instruction choice in a way that will be beneficial in all realistic
scenarios.)

One day I should go back and remove those redundant "const"s from the vector
immediate tests :)  They're a holdover from an early version of the SVE
support.

Thanks,
Richard

> Thanks,
> Kyrill
>
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64-simd.md
>         (aarch64_simd_imm_shl<mode><vczle><vczbe>): Rewrite to new
>         syntax.  Add =w,w,vs1 alternative.
>         * config/aarch64/constraints.md (vs1): New constraint.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/advsimd_shl_add.c: New test.
>
> From 4d0c48c79d24d27724111906bcfb0b709041fa08 Mon Sep 17 00:00:00 2001
> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> Date: Mon, 5 Aug 2024 11:29:44 -0700
> Subject: [PATCH 2/2] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for
>  Advanced SIMD
>
> On many cores, including Neoverse V2 the throughput of vector ADD
> instructions is higher than vector shifts like SHL.  We can lean on that
> to emit code like:
>   add     v0.4s, v0.4s, v0.4s
> instead of:
>   shl     v0.4s, v0.4s, 1
>
> LLVM already does this trick.
> In RTL the code gets canonincalised from (plus x x) to (ashift x 1) so I
> opted to instead do this at the final assembly printing stage, similar
> to how we emit CMLT instead of SSHR elsewhere in the backend.
>
> I'd like to also do this for SVE shifts, but those will have to be
> separate patches.
>
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64-simd.md
>       (aarch64_simd_imm_shl<mode><vczle><vczbe>): Rewrite to new
>       syntax.  Add =w,w,vs1 alternative.
>       * config/aarch64/constraints.md (vs1): New constraint.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/advsimd_shl_add.c: New test.
> ---
>  gcc/config/aarch64/aarch64-simd.md            | 12 ++--
>  gcc/config/aarch64/constraints.md             |  6 ++
>  .../gcc.target/aarch64/advsimd_shl_add.c      | 64 +++++++++++++++++++
>  3 files changed, 77 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/advsimd_shl_add.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 816f499e963..dff00b68a06 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1352,12 +1352,14 @@ (define_expand "aarch64_<sra_op>rsra_n<mode>"
>  )
>  
>  (define_insn "aarch64_simd_imm_shl<mode><vczle><vczbe>"
> - [(set (match_operand:VDQ_I 0 "register_operand" "=w")
> -       (ashift:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
> -                (match_operand:VDQ_I  2 "aarch64_simd_lshift_imm" "Dl")))]
> + [(set (match_operand:VDQ_I 0 "register_operand")
> +       (ashift:VDQ_I (match_operand:VDQ_I 1 "register_operand")
> +                (match_operand:VDQ_I  2 "aarch64_simd_lshift_imm")))]
>   "TARGET_SIMD"
> -  "shl\t%0.<Vtype>, %1.<Vtype>, %2"
> -  [(set_attr "type" "neon_shift_imm<q>")]
> +  {@ [ cons: =0, 1,  2   ; attrs: type       ]
> +     [ w       , w,  vs1 ; neon_add<q>       ] add\t%0.<Vtype>, %1.<Vtype>, 
> %1.<Vtype>
> +     [ w       , w,  Dl  ; neon_shift_imm<q> ] shl\t%0.<Vtype>, %1.<Vtype>, 
> %2
> +  }
>  )
>  
>  (define_insn "aarch64_simd_reg_sshl<mode><vczle><vczbe>"
> diff --git a/gcc/config/aarch64/constraints.md 
> b/gcc/config/aarch64/constraints.md
> index a2878f580d9..f491e4bd6a0 100644
> --- a/gcc/config/aarch64/constraints.md
> +++ b/gcc/config/aarch64/constraints.md
> @@ -667,6 +667,12 @@ (define_constraint "vsm"
>     SMAX and SMIN operations."
>   (match_operand 0 "aarch64_sve_vsm_immediate"))
>  
> +(define_constraint "vs1"
> +  "@internal
> + A constraint that matches a vector of immediate one."
> + (and (match_code "const,const_vector")
> +      (match_test "op == CONST1_RTX (GET_MODE (op))")))
> +
>  (define_constraint "vsA"
>    "@internal
>     A constraint that matches an immediate operand valid for SVE FADD
> diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd_shl_add.c 
> b/gcc/testsuite/gcc.target/aarch64/advsimd_shl_add.c
> new file mode 100644
> index 00000000000..a161f89a3ac
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/advsimd_shl_add.c
> @@ -0,0 +1,64 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "--save-temps -O1" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +typedef __INT64_TYPE__ __attribute__ ((vector_size (16))) v2di;
> +typedef int __attribute__ ((vector_size (16))) v4si;
> +typedef short __attribute__ ((vector_size (16))) v8hi;
> +typedef char __attribute__ ((vector_size (16))) v16qi;
> +typedef short __attribute__ ((vector_size (8))) v4hi;
> +typedef char __attribute__ ((vector_size (8))) v8qi;
> +
> +#define FUNC(S) \
> +S               \
> +foo_##S (S a)   \
> +{ return a << 1; }
> +
> +/*
> +** foo_v2di:
> +**      add  v0.2d, v0.2d, v0.2d
> +**      ret
> +*/
> +
> +FUNC (v2di)
> +
> +/*
> +** foo_v4si:
> +**      add  v0.4s, v0.4s, v0.4s
> +**      ret
> +*/
> +
> +FUNC (v4si)
> +
> +/*
> +** foo_v8hi:
> +**      add  v0.8h, v0.8h, v0.8h
> +**      ret
> +*/
> +
> +FUNC (v8hi)
> +
> +/*
> +** foo_v16qi:
> +**      add  v0.16b, v0.16b, v0.16b
> +**      ret
> +*/
> +
> +FUNC (v16qi)
> +
> +/*
> +** foo_v4hi:
> +**      add  v0.4h, v0.4h, v0.4h
> +**      ret
> +*/
> +
> +FUNC (v4hi)
> +
> +/*
> +** foo_v8qi:
> +**      add  v0.8b, v0.8b, v0.8b
> +**      ret
> +*/
> +
> +FUNC (v8qi)
> +

Reply via email to