RE: [PATCH][PR113816] AArch64: Use SIMD+GPR for logical vector reductions

Richard Biener Fri, 11 Oct 2024 00:14:01 -0700

On Fri, 11 Oct 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Friday, October 11, 2024 7:52 AM
> > To: Richard Sandiford <richard.sandif...@arm.com>
> > Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Richard
> > Earnshaw <richard.earns...@arm.com>; Kyrylo Tkachov
> > <ktkac...@nvidia.com>; Tamar Christina <tamar.christ...@arm.com>
> > Subject: Re: [PATCH][PR113816] AArch64: Use SIMD+GPR for logical vector
> > reductions
> > 
> > On Thu, 10 Oct 2024, Richard Sandiford wrote:
> > 
> > > Jennifer Schmitz <jschm...@nvidia.com> writes:
> > > > This patch implements the optabs reduc_and_scal_<mode>,
> > > > reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for ASIMD modes V8QI,
> > > > V16QI, V4HI, and V8HI for TARGET_SIMD to improve codegen for bitwise
> > logical
> > > > vector reduction operations.
> > > > Previously, either only vector registers or only general purpose 
> > > > registers (GPR)
> > > > were used. Now, vector registers are used for the reduction from 128 to 
> > > > 64
> > bits;
> > > > 64-bit GPR are used for the reduction from 64 to 32 bits; and 32-bit 
> > > > GPR are
> > used
> > > > for the rest of the reduction steps.
> > > >
> > > > For example, the test case (V8HI)
> > > > int16_t foo (int16_t *a)
> > > > {
> > > >   int16_t b = -1;
> > > >   for (int i = 0; i < 8; ++i)
> > > >     b &= a[i];
> > > >   return b;
> > > > }
> > > >
> > > > was previously compiled to (-O2):
> > > > foo:
> > > >         ldr     q0, [x0]
> > > >         movi    v30.4s, 0
> > > >         ext     v29.16b, v0.16b, v30.16b, #8
> > > >         and     v29.16b, v29.16b, v0.16b
> > > >         ext     v31.16b, v29.16b, v30.16b, #4
> > > >         and     v31.16b, v31.16b, v29.16b
> > > >         ext     v30.16b, v31.16b, v30.16b, #2
> > > >         and     v30.16b, v30.16b, v31.16b
> > > >         umov    w0, v30.h[0]
> > > >         ret
> > > >
> > > > With patch, it is compiled to:
> > > > foo:
> > > >         ldr     q31, [x0]
> > > >         ext     v30.16b, v31.16b, v31.16b, #8
> > > >         and     v31.8b, v30.8b, v31.8b
> > > >         fmov    x0, d31
> > > >         and     x0, x0, x0, lsr 32
> > > >         and     w0, w0, w0, lsr 16
> > > >         ret
> > > >
> > > > For modes V4SI and V2DI, the pattern was not implemented, because the
> > > > current codegen (using only base instructions) is already efficient.
> > > >
> > > > Note that the PR initially suggested to use SVE reduction ops. However,
> > > > they have higher latency than the proposed sequence, which is why using
> > > > neon and base instructions is preferable.
> > > >
> > > > Test cases were added for 8/16-bit integers for all implemented modes 
> > > > and all
> > > > three operations to check the produced assembly.
> > > >
> > > > We also added [istarget aarch64*-*-*] to the selector 
> > > > vect_logical_reduc,
> > > > because for aarch64 vector types, either the logical reduction optabs 
> > > > are
> > > > implemented or the codegen for reduction operations is good as it is.
> > > > This was motivated by failure of a scan-tree-dump directive in the test 
> > > > cases
> > > > gcc.dg/vect/vect-reduc-or_1.c and gcc.dg/vect/vect-reduc-or_2.c.
> > > >
> > > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no
> > regression.
> > > > OK for mainline?
> > > >
> > > > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> > > >
> > > > gcc/
> > > >         PR target/113816
> > > >         * config/aarch64/aarch64-simd.md (reduc_<optab>_scal_<mode>):
> > > >         Implement for logical bitwise operations for VDQV_E.
> > > >
> > > > gcc/testsuite/
> > > >         PR target/113816
> > > >         * lib/target-supports.exp (vect_logical_reduc): Add aarch64*.
> > > >         * gcc.target/aarch64/simd/logical_reduc.c: New test.
> > > >         * gcc.target/aarch64/vect-reduc-or_1.c: Adjust expected outcome.
> > > > ---
> > > >  gcc/config/aarch64/aarch64-simd.md            |  55 +++++
> > > >  .../gcc.target/aarch64/simd/logical_reduc.c   | 208 ++++++++++++++++++
> > > >  .../gcc.target/aarch64/vect-reduc-or_1.c      |   2 +-
> > > >  gcc/testsuite/lib/target-supports.exp         |   4 +-
> > > >  4 files changed, 267 insertions(+), 2 deletions(-)
> > > >  create mode 100644 
> > > > gcc/testsuite/gcc.target/aarch64/simd/logical_reduc.c
> > > >
> > > > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > > > index 23c03a96371..00286b8b020 100644
> > > > --- a/gcc/config/aarch64/aarch64-simd.md
> > > > +++ b/gcc/config/aarch64/aarch64-simd.md
> > > > @@ -3608,6 +3608,61 @@
> > > >    }
> > > >  )
> > > >
> > > > +;; Emit a sequence for bitwise logical reductions over vectors for 
> > > > V8QI, V16QI,
> > > > +;; V4HI, and V8HI modes.  The reduction is achieved by iteratively 
> > > > operating
> > > > +;; on the two halves of the input.
> > > > +;; If the input has 128 bits, the first operation is performed in 
> > > > vector
> > > > +;; registers.  From 64 bits down, the reduction steps are performed in 
> > > > general
> > > > +;; purpose registers.
> > > > +;; For example, for V8HI and operation AND, the intended sequence is:
> > > > +;; EXT      v1.16b, v0.16b, v0.16b, #8
> > > > +;; AND      v0.8b, v1.8b, v0.8b
> > > > +;; FMOV     x0, d0
> > > > +;; AND      x0, x0, x0, 32
> > > > +;; AND      w0, w0, w0, 16
> > > > +;;
> > > > +;; For V8QI and operation AND, the sequence is:
> > > > +;; AND      x0, x0, x0, lsr 32
> > > > +;; AND      w0, w0, w0, lsr, 16
> > > > +;; AND      w0, w0, w0, lsr, 8
> > > > +
> > > > +(define_expand "reduc_<optab>_scal_<mode>"
> > > > + [(match_operand:<VEL> 0 "register_operand")
> > > > +  (LOGICAL:VDQV_E (match_operand:VDQV_E 1 "register_operand"))]
> > > > +  "TARGET_SIMD"
> > > > +  {
> > > > +    rtx dst = operands[1];
> > > > +    rtx tdi = gen_reg_rtx (DImode);
> > > > +    rtx tsi = lowpart_subreg (SImode, tdi, DImode);
> > > > +    rtx op1_lo;
> > > > +    if (known_eq (GET_MODE_SIZE (<MODE>mode), 16))
> > > > +      {
> > > > +       rtx t0 = gen_reg_rtx (<MODE>mode);
> > > > +       rtx t1 = gen_reg_rtx (DImode);
> > > > +       rtx t2 = gen_reg_rtx (DImode);
> > > > +       rtx idx = GEN_INT (8 / GET_MODE_UNIT_SIZE (<MODE>mode));
> > > > +       emit_insn (gen_aarch64_ext<mode> (t0, dst, dst, idx));
> > > > +       op1_lo = lowpart_subreg (V2DImode, dst, <MODE>mode);
> > > > +       rtx t0_lo = lowpart_subreg (V2DImode, t0, <MODE>mode);
> > > > +       emit_insn (gen_aarch64_get_lanev2di (t1, op1_lo, GEN_INT (0)));
> > > > +       emit_insn (gen_aarch64_get_lanev2di (t2, t0_lo, GEN_INT (0)));
> > > > +       emit_insn (gen_<optab>di3 (t1, t1, t2));
> > > > +       emit_move_insn (tdi, t1);
> > > > +      }
> > > > +    else
> > > > +      {
> > > > +       op1_lo = lowpart_subreg (DImode, dst, <MODE>mode);
> > > > +       emit_move_insn (tdi, op1_lo);
> > > > +      }
> > > > +    emit_insn (gen_<optab>_lshrdi3 (tdi, tdi, GEN_INT (32), tdi));
> > > > +    emit_insn (gen_<optab>_lshrsi3 (tsi, tsi, GEN_INT (16), tsi));
> > > > +    if (known_eq (GET_MODE_UNIT_BITSIZE (<MODE>mode), 8))
> > > > +      emit_insn (gen_<optab>_lshrsi3 (tsi, tsi, GEN_INT (8), tsi));
> > > > +    emit_move_insn (operands[0], lowpart_subreg (<VEL>mode, tsi, 
> > > > SImode));
> > > > +    DONE;
> > > > +  }
> > > > +)
> > >
> > > Sorry to be awkward, but I've got mixed feelings about this.
> > > The sequence produced looks good.  But we're not defining this
> > > optab because the target has a special instruction for implementing
> > > the operation.  We're just implementing it to tweak the loop emitted
> > > by vect_create_epilog_for_reduction.  Maybe we should change that loop
> > > instead, perhaps with a target hook?
> > 
> > So does it currently not work because there's no v8qi, v4qi or v2qi
> > vector modes?  I'll note there already _is_
> > targetm.vectorize.split_reduction (mode), but it expects a vector
> > mode as result and it doesn't get the operation in.
> > 
> 
> The point of the reduction is that they shouldn't be done on the vector side.
> So even if we have v4qi and v2qi the reduction shouldn't be done there as
> the latency and throughput for these sub 64-bit operations on the vector
> side is much higher and so should be done on the integer side. i.e. these
> should be done on DI and SI.
> 
> This is why I disagree with Richard that this can be done generically for all
> targets. The decomposition takes advantage of Arm's fused shifted logical
> operations on GPR.  This makes them really cheap to do on integer.
> 
> Richard brought up AArch32 as another example. I don't agree there either
> Because AArch32's register file overlap means you don't need shifts.
> The first step reduction can just be done by accessing the registers
> as even/odd pairs on a smaller mode.
> 
> To get this codegen generically the targets must supports modes tie-able
> between the vector and smaller vector size. i.e.  BIT_FIELD_REF on the
> vector mode to a smaller mode must be cheap or free.
> 
> The decision for when you do the transfer from SIMD to GPR is highly
> target specific, and the how as well.
> 
> > I would expect that we can improve on this existing machinery
> > instead of inventing a new one.
> > 
> 
> But we're not. We're just implementing the optab to the machinery
> that already exists.
> 
> I think we're just over engineering a new contributor's patch. Even if
> we think there is some merit to do it generically, I don't see why we
> shouldn't take this patch today.


Sure.  I was just saying there is already a target driven piece in
the generic machinery that looks somewhat fitting if we enhance it,
so no need to invent something completely new here.

Richard.

> Thanks,
> Tamar
> 
> > I'll note that providing reduc_scal optabs has the side-effect of
> > enabling BB reduction vectorization since that doesn't use any
> > open-coding (yet - it was indeed supposed to be an indication of
> > whether such a reduction would be reasonably cheap).
> > 
> > Richard.
> > 
> > > Also, this seems related to the fact that:
> > >
> > > typedef unsigned char v8qi __attribute__((vector_size(8)));
> > > unsigned char foo(v8qi x) {
> > >   x &= __builtin_shufflevector (x, x, 1, 2, 3, 4, 5, 6, 7, 0);
> > >   return x[0];
> > > }
> > >
> > > generates:
> > >
> > >         ext     v31.8b, v0.8b, v0.8b, #1
> > >         and     v31.8b, v31.8b, v0.8b
> > >         umov    x0, v31.d[0]
> > >         ret
> > >
> > > instead of something like:
> > >
> > >         umov    w0, v31.h[0]
> > >         and     w0, w0, lsr #8
> > >         ret
> > >
> > > So an alternative might be to do some post-vectorisation simplification,
> > > perhaps in isel?
> > >
> > > cc:ing Richi for his thoughts.
> > >
> > > Thanks,
> > > Richard
> > >
> > > > +
> > > >  (define_insn "aarch64_reduc_<optab>_internal<mode>"
> > > >   [(set (match_operand:VDQV_S 0 "register_operand" "=w")
> > > >         (unspec:VDQV_S [(match_operand:VDQV_S 1 "register_operand" "w")]
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/simd/logical_reduc.c
> > b/gcc/testsuite/gcc.target/aarch64/simd/logical_reduc.c
> > > > new file mode 100644
> > > > index 00000000000..9508288b218
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/simd/logical_reduc.c
> > > > @@ -0,0 +1,208 @@
> > > > +/* { dg-options "-O2 -ftree-vectorize" } */
> > > > +/* { dg-final { check-function-bodies "**" "" } } */
> > > > +
> > > > +#include <stdint.h>
> > > > +
> > > > +/*
> > > > +** fv16qi_and:
> > > > +**     ldr     q([0-9]+), \[x0\]
> > > > +**     ext     v([0-9]+)\.16b, v\1\.16b, v\1\.16b, #8
> > > > +**     and     v\1\.8b, v\2\.8b, v\1\.8b
> > > > +**     fmov    x0, d\1
> > > > +**     and     x0, x0, x0, lsr 32
> > > > +**     and     w0, w0, w0, lsr 16
> > > > +**     and     w0, w0, w0, lsr 8
> > > > +**     ret
> > > > +*/
> > > > +int8_t fv16qi_and (int8_t *a)
> > > > +{
> > > > +  int8_t b = -1;
> > > > +  for (int i = 0; i < 16; ++i)
> > > > +    b &= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv8hi_and:
> > > > +**     ldr     q([0-9]+), \[x0\]
> > > > +**     ext     v([0-9]+)\.16b, v\1\.16b, v\1\.16b, #8
> > > > +**     and     v\1\.8b, v\2\.8b, v\1\.8b
> > > > +**     fmov    x0, d\1
> > > > +**     and     x0, x0, x0, lsr 32
> > > > +**     and     w0, w0, w0, lsr 16
> > > > +**     ret
> > > > +*/
> > > > +int16_t fv8hi_and (int16_t *a)
> > > > +{
> > > > +  int16_t b = -1;
> > > > +  for (int i = 0; i < 8; ++i)
> > > > +    b &= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv16qi_or:
> > > > +**     ldr     q([0-9]+), \[x0\]
> > > > +**     ext     v([0-9]+)\.16b, v\1\.16b, v\1\.16b, #8
> > > > +**     orr     v\1\.8b, v\2\.8b, v\1\.8b
> > > > +**     fmov    x0, d\1
> > > > +**     orr     x0, x0, x0, lsr 32
> > > > +**     orr     w0, w0, w0, lsr 16
> > > > +**     orr     w0, w0, w0, lsr 8
> > > > +**     ret
> > > > +*/
> > > > +int8_t fv16qi_or (int8_t *a)
> > > > +{
> > > > +  int8_t b = 0;
> > > > +  for (int i = 0; i < 16; ++i)
> > > > +    b |= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv8hi_or:
> > > > +**     ldr     q([0-9]+), \[x0\]
> > > > +**     ext     v([0-9]+)\.16b, v\1\.16b, v\1\.16b, #8
> > > > +**     orr     v\1\.8b, v\2\.8b, v\1\.8b
> > > > +**     fmov    x0, d\1
> > > > +**     orr     x0, x0, x0, lsr 32
> > > > +**     orr     w0, w0, w0, lsr 16
> > > > +**     ret
> > > > +*/
> > > > +int16_t fv8hi_or (int16_t *a)
> > > > +{
> > > > +  int16_t b = 0;
> > > > +  for (int i = 0; i < 8; ++i)
> > > > +    b |= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv16qi_xor:
> > > > +**     ldr     q([0-9]+), \[x0\]
> > > > +**     ext     v([0-9]+)\.16b, v\1\.16b, v\1\.16b, #8
> > > > +**     eor     v\1\.8b, v\2\.8b, v\1\.8b
> > > > +**     fmov    x0, d\1
> > > > +**     eor     x0, x0, x0, lsr 32
> > > > +**     eor     w0, w0, w0, lsr 16
> > > > +**     eor     w0, w0, w0, lsr 8
> > > > +**     ret
> > > > +*/
> > > > +int8_t fv16qi_xor (int8_t *a)
> > > > +{
> > > > +  int8_t b = 0;
> > > > +  for (int i = 0; i < 16; ++i)
> > > > +    b ^= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv8hi_xor:
> > > > +**     ldr     q([0-9]+), \[x0\]
> > > > +**     ext     v([0-9]+)\.16b, v\1\.16b, v\1\.16b, #8
> > > > +**     eor     v\1\.8b, v\2\.8b, v\1\.8b
> > > > +**     fmov    x0, d\1
> > > > +**     eor     x0, x0, x0, lsr 32
> > > > +**     eor     w0, w0, w0, lsr 16
> > > > +**     ret
> > > > +*/
> > > > +int16_t fv8hi_xor (int16_t *a)
> > > > +{
> > > > +  int16_t b = 0;
> > > > +  for (int i = 0; i < 8; ++i)
> > > > +    b ^= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv8qi_and:
> > > > +**     ldr     x0, \[x0\]
> > > > +**     and     x0, x0, x0, lsr 32
> > > > +**     and     w0, w0, w0, lsr 16
> > > > +**     and     w0, w0, w0, lsr 8
> > > > +**     ret
> > > > +*/
> > > > +int8_t fv8qi_and (int8_t *a)
> > > > +{
> > > > +  int8_t b = -1;
> > > > +  for (int i = 0; i < 8; ++i)
> > > > +    b &= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv4hi_and:
> > > > +**     ldr     x0, \[x0\]
> > > > +**     and     x0, x0, x0, lsr 32
> > > > +**     and     w0, w0, w0, lsr 16
> > > > +**     ret
> > > > +*/
> > > > +int16_t fv4hi_and (int16_t *a)
> > > > +{
> > > > +  int16_t b = -1;
> > > > +  for (int i = 0; i < 4; ++i)
> > > > +    b &= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv8qi_or:
> > > > +**     ldr     x0, \[x0\]
> > > > +**     orr     x0, x0, x0, lsr 32
> > > > +**     orr     w0, w0, w0, lsr 16
> > > > +**     orr     w0, w0, w0, lsr 8
> > > > +**     ret
> > > > +*/
> > > > +int8_t fv8qi_or (int8_t *a)
> > > > +{
> > > > +  int8_t b = 0;
> > > > +  for (int i = 0; i < 8; ++i)
> > > > +    b |= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv4hi_or:
> > > > +**     ldr     x0, \[x0\]
> > > > +**     orr     x0, x0, x0, lsr 32
> > > > +**     orr     w0, w0, w0, lsr 16
> > > > +**     ret
> > > > +*/
> > > > +int16_t fv4hi_or (int16_t *a)
> > > > +{
> > > > +  int16_t b = 0;
> > > > +  for (int i = 0; i < 4; ++i)
> > > > +    b |= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv8qi_xor:
> > > > +**     ldr     x0, \[x0\]
> > > > +**     eor     x0, x0, x0, lsr 32
> > > > +**     eor     w0, w0, w0, lsr 16
> > > > +**     eor     w0, w0, w0, lsr 8
> > > > +**     ret
> > > > +*/
> > > > +int8_t fv8qi_xor (int8_t *a)
> > > > +{
> > > > +  int8_t b = 0;
> > > > +  for (int i = 0; i < 8; ++i)
> > > > +    b ^= a[i];
> > > > +  return b;
> > > > +}
> > > > +
> > > > +/*
> > > > +** fv4hi_xor:
> > > > +**     ldr     x0, \[x0\]
> > > > +**     eor     x0, x0, x0, lsr 32
> > > > +**     eor     w0, w0, w0, lsr 16
> > > > +**     ret
> > > > +*/
> > > > +int16_t fv4hi_xor (int16_t *a)
> > > > +{
> > > > +  int16_t b = 0;
> > > > +  for (int i = 0; i < 4; ++i)
> > > > +    b ^= a[i];
> > > > +  return b;
> > > > +}
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-reduc-or_1.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-reduc-or_1.c
> > > > index 918822a7d00..70c4ca18094 100644
> > > > --- a/gcc/testsuite/gcc.target/aarch64/vect-reduc-or_1.c
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-reduc-or_1.c
> > > > @@ -32,4 +32,4 @@ main (unsigned char argc, char **argv)
> > > >    return 0;
> > > >  }
> > > >
> > > > -/* { dg-final { scan-tree-dump "Reduce using vector shifts" "vect" } } 
> > > > */
> > > > +/* { dg-final { scan-tree-dump "Reduce using direct vector reduction" 
> > > > "vect" } }
> > */
> > > > diff --git a/gcc/testsuite/lib/target-supports.exp 
> > > > b/gcc/testsuite/lib/target-
> > supports.exp
> > > > index 8f2afe866c7..44f737f15d0 100644
> > > > --- a/gcc/testsuite/lib/target-supports.exp
> > > > +++ b/gcc/testsuite/lib/target-supports.exp
> > > > @@ -9564,7 +9564,9 @@ proc check_effective_target_vect_logical_reduc { 
> > > > } {
> > > >                    || [istarget amdgcn-*-*]
> > > >                    || [check_effective_target_riscv_v]
> > > >                    || [check_effective_target_loongarch_sx]
> > > > -                  || [istarget i?86-*-*] || [istarget x86_64-*-*]}]
> > > > +                  || [istarget i?86-*-*]
> > > > +                  || [istarget x86_64-*-*]
> > > > +                  || [istarget aarch64*-*-*]}]
> > > >  }
> > > >
> > > >  # Return 1 if the target supports the fold_extract_last optab.
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH][PR113816] AArch64: Use SIMD+GPR for logical vector reductions

Reply via email to