RE: [PATCH] aarch64: Recognize vector permute patterns suitable for FMOV [PR100165]

quic_pzheng Tue, 18 Feb 2025 17:05:08 -0800

> > Pengxuan Zheng <quic_pzh...@quicinc.com> writes:
> > > This patch optimizes certain vector permute expansion with the FMOV
> > > instruction when one of the input vectors is a vector of all zeros
> > > and the result of the vector permute is as if the upper lane of the
> > > non-zero input vector is set to zero and the lower lane remains
> unchanged.
> > >
> > > Note that the patch also propagates zero_op0_p and zero_op1_p during
> > > re-encode now.  They will be used by aarch64_evpc_fmov to check if
> > > the input vectors are valid candidates.
> > >
> > >   PR target/100165
> > >
> > > gcc/ChangeLog:
> > >
> > >   * config/aarch64/aarch64-simd.md
> > (aarch64_simd_vec_set_zero_fmov<mode>):
> > >   New define_insn.
> > >   * config/aarch64/aarch64.cc (aarch64_evpc_reencode): Copy
> > zero_op0_p and
> > >   zero_op1_p.
> > >   (aarch64_evpc_fmov): New function.
> > >   (aarch64_expand_vec_perm_const_1): Add call to
> > aarch64_evpc_fmov.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   * gcc.target/aarch64/vec-set-zero.c: Update test accordingly.
> > >   * gcc.target/aarch64/fmov.c: New test.
> > >   * gcc.target/aarch64/fmov-be.c: New test.
> >
> > Nice!  Thanks for doing this.  Some comments on the patch below.
> > >
> > > Signed-off-by: Pengxuan Zheng <quic_pzh...@quicinc.com>
> > > ---
> > >  gcc/config/aarch64/aarch64-simd.md            |  14 +++
> > >  gcc/config/aarch64/aarch64.cc                 |  74 +++++++++++-
> > >  gcc/testsuite/gcc.target/aarch64/fmov-be.c    |  74 ++++++++++++
> > >  gcc/testsuite/gcc.target/aarch64/fmov.c       | 110
++++++++++++++++++
> > >  .../gcc.target/aarch64/vec-set-zero.c         |   6 +-
> > >  5 files changed, 275 insertions(+), 3 deletions(-)  create mode
> > > 100644 gcc/testsuite/gcc.target/aarch64/fmov-be.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov.c
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > > b/gcc/config/aarch64/aarch64-simd.md
> > > index e456f693d2f..543126948e7 100644
> > > --- a/gcc/config/aarch64/aarch64-simd.md
> > > +++ b/gcc/config/aarch64/aarch64-simd.md
> > > @@ -1190,6 +1190,20 @@ (define_insn "aarch64_simd_vec_set<mode>"
> > >    [(set_attr "type" "neon_ins<q>, neon_from_gp<q>,
> > > neon_load1_one_lane<q>")]
> > >  )
> > >
> > > +(define_insn "aarch64_simd_vec_set_zero_fmov<mode>"
> > > +  [(set (match_operand:VP_2E 0 "register_operand" "=w")
> > > + (vec_merge:VP_2E
> > > +     (match_operand:VP_2E 1 "aarch64_simd_imm_zero" "Dz")
> > > +     (match_operand:VP_2E 3 "register_operand" "w")
> > > +     (match_operand:SI 2 "immediate_operand" "i")))]
> > > +  "TARGET_SIMD
> > > +   && (ENDIAN_LANE_N (<nunits>, exact_log2 (INTVAL (operands[2])))
> > > +==
> > 1)"
> > > +  {
> > > +    return "fmov\\t%<Vetype>0, %<Vetype>3";
> > > +  }
> > > +  [(set_attr "type" "fmov")]
> > > +)
> > > +
> >
> > I think this shows that target-independent code is missing some
> > canonicalisation of vec_merge.  combine has:
> >
> >   unsigned n_elts = 0;
> >   if (GET_CODE (x) == VEC_MERGE
> >       && CONST_INT_P (XEXP (x, 2))
> >       && GET_MODE_NUNITS (GET_MODE (x)).is_constant (&n_elts)
> >       && (swap_commutative_operands_p (XEXP (x, 0), XEXP (x, 1))
> >       /* Two operands have same precedence, then
> >          first bit of mask select first operand.  */
> >       || (!swap_commutative_operands_p (XEXP (x, 1), XEXP (x, 0))
> >           && !(UINTVAL (XEXP (x, 2)) & 1))))
> >     {
> >       rtx temp = XEXP (x, 0);
> >       unsigned HOST_WIDE_INT sel = UINTVAL (XEXP (x, 2));
> >       unsigned HOST_WIDE_INT mask = HOST_WIDE_INT_1U;
> >       if (n_elts == HOST_BITS_PER_WIDE_INT)
> >     mask = -1;
> >       else
> >     mask = (HOST_WIDE_INT_1U << n_elts) - 1;
> >       SUBST (XEXP (x, 0), XEXP (x, 1));
> >       SUBST (XEXP (x, 1), temp);
> >       SUBST (XEXP (x, 2), GEN_INT (~sel & mask));
> >     }
> >
> > which AFAICT would prefer to put the immediate second, not first.  I
> > think we should be doing the same canonicalisation in
> > simplify_ternary_operation, and possibly elsewhere, so that the .md
> > pattern only needs to match the canonical form (i.e. register, immedate,
> mask).
> 
> Thanks for the suggestion. I've added the canonicalization in a separate
patch.
> https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676105.html
> 
> >
> > On:
> >
> > > +   && (ENDIAN_LANE_N (<nunits>, exact_log2 (INTVAL (operands[2])))
> > > + ==
> > 1)"
> >
> > it seems dangerous to pass exact_log2 to ENDIAN_LANE_N when we
> haven't
> > checked whether it is a power of 2.  (0b00 or 0b11 ought to get
> > simplified, but I don't think we can ignore the possibility.)
> >
> > Rather than restrict the pattern to pairs, could we instead handle
> > VALL_F16 minus the QI elements, with the 16-bit elements restricted to
> > TARGET_F16?  E.g. we should be able to handle V4SI using an FMOV of S
> > registers if only the low element is nonzero.
> 
> Good point! I've addressed these in the latest version. Please let me know
if I
> missed anything.
> https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676106.html


Missed a small change. Here's the latest version:
https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676125.html

Thanks,
Pengxuan
> 
> Thanks,
> Pengxuan
> 
> >
> > Part of me thinks that this should just be described as a plain old
> > AND, but I suppose that doesn't work well for FP modes.  Still,
> > handling ANDs might be an interesting follow-up :)
> >
> > Thanks,
> > Richard
> >
> > >  (define_insn "aarch64_simd_vec_set_zero<mode>"
> > >    [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> > >   (vec_merge:VALL_F16
> > > diff --git a/gcc/config/aarch64/aarch64.cc
> > > b/gcc/config/aarch64/aarch64.cc index a6cc00e74ab..64756920eda
> > 100644
> > > --- a/gcc/config/aarch64/aarch64.cc
> > > +++ b/gcc/config/aarch64/aarch64.cc
> > > @@ -25950,6 +25950,8 @@ aarch64_evpc_reencode (struct
> > expand_vec_perm_d *d)
> > >    newd.target = d->target ? gen_lowpart (new_mode, d->target) : NULL;
> > >    newd.op0 = d->op0 ? gen_lowpart (new_mode, d->op0) : NULL;
> > >    newd.op1 = d->op1 ? gen_lowpart (new_mode, d->op1) : NULL;
> > > +  newd.zero_op0_p = d->zero_op0_p;
> > > +  newd.zero_op1_p = d->zero_op1_p;
> > >    newd.testing_p = d->testing_p;
> > >    newd.one_vector_p = d->one_vector_p;
> > >
> > > @@ -26434,6 +26436,74 @@ aarch64_evpc_ins (struct
> > expand_vec_perm_d *d)
> > >    return true;
> > >  }
> > >
> > > +/* Recognize patterns suitable for the FMOV instructions.  */
> > > +static bool aarch64_evpc_fmov (struct expand_vec_perm_d *d) {
> > > +  machine_mode mode = d->vmode;
> > > +  unsigned HOST_WIDE_INT nelt;
> > > +
> > > +  if (d->vec_flags != VEC_ADVSIMD)
> > > +    return false;
> > > +
> > > +  /* to_constant is safe since this routine is specific to Advanced
SIMD
> > > +     vectors.  */
> > > +  nelt = d->perm.length ().to_constant ();
> > > +
> > > +  /* Either d->op0 or d->op1 should be a vector of all zeros.  */
> > > + if (nelt != 2 || d->one_vector_p || (!d->zero_op0_p && !d-
> >zero_op1_p))
> > > +    return false;
> > > +
> > > +  HOST_WIDE_INT elt0, elt1;
> > > +  rtx in0 = d->op0;
> > > +  rtx in1 = d->op1;
> > > +
> > > +  if (!d->perm[0].is_constant (&elt0))
> > > +    return false;
> > > +
> > > +  if (!d->perm[1].is_constant (&elt1))
> > > +    return false;
> > > +
> > > +  if (!BYTES_BIG_ENDIAN)
> > > +    {
> > > +      /* Lane 0 of the output vector should come from lane 0 of the
non-
> zero
> > > +  vector.  */
> > > +      if (elt0 != (d->zero_op0_p ? 2 : 0))
> > > + return false;
> > > +
> > > +      /* Lane 1 of the output vector should come from any lane of the
zero
> > > +  vector.  */
> > > +      if (elt1 != (d->zero_op0_p ? 0 : 2) && elt1 != (d->zero_op0_p ?
1 : 3))
> > > + return false;
> > > +    }
> > > +  else
> > > +    {
> > > +      /* Lane 0 of the output vector should come from any lane of the
zero
> > > +  vector.  */
> > > +      if (elt0 != (d->zero_op0_p ? 0 : 2) && elt0 != (d->zero_op0_p ?
1 : 3))
> > > + return false;
> > > +
> > > +      /* Lane 1 of the output vector should come from lane 1 of the
non-
> zero
> > > +  vector.  */
> > > +      if (elt1 != (d->zero_op0_p ? 3 : 1))
> > > + return false;
> > > +    }
> > > +
> > > +  if (d->testing_p)
> > > +    return true;
> > > +
> > > +  insn_code icode = code_for_aarch64_simd_vec_copy_lane (mode);
> > > + expand_operand ops[5];  create_output_operand (&ops[0], d->target,
> > > + mode);  create_input_operand (&ops[1], d->zero_op0_p ? in1 : in0,
> > > + mode);  create_integer_operand (&ops[2], BYTES_BIG_ENDIAN ? 1 :
> > > + 2); create_input_operand (&ops[3], d->zero_op0_p ? in0 : in1,
> > > + mode); create_integer_operand (&ops[4], 0);  expand_insn (icode,
> > > + 5, ops);
> > > +
> > > +  return true;
> > > +}
> > > +
> > >  static bool
> > >  aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d)  {
> @@
> > > -26457,7 +26527,9 @@ aarch64_expand_vec_perm_const_1 (struct
> > expand_vec_perm_d *d)
> > >      {
> > >        if (d->vmode == d->op_mode)
> > >   {
> > > -   if (aarch64_evpc_rev_local (d))
> > > +   if (aarch64_evpc_fmov (d))
> > > +     return true;
> > > +   else if (aarch64_evpc_rev_local (d))
> > >       return true;
> > >     else if (aarch64_evpc_rev_global (d))
> > >       return true;
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/fmov-be.c
> > > b/gcc/testsuite/gcc.target/aarch64/fmov-be.c
> > > new file mode 100644
> > > index 00000000000..f864d474a7a
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/fmov-be.c
> > > @@ -0,0 +1,74 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mbig-endian" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +typedef int v2si __attribute__ ((vector_size (8))); typedef long
> > > +v2di __attribute__ ((vector_size (16))); typedef int v4si
> > > +__attribute__ ((vector_size (16))); typedef float v4sf
> > > +__attribute__ ((vector_size (16)));
> > > +
> > > +/*
> > > +** f_v2si:
> > > +**       fmov    s0, s0
> > > +**       ret
> > > +*/
> > > +v2si
> > > +f_v2si (v2si x)
> > > +{
> > > +  return __builtin_shuffle (x, (v2si){ 0, 0 }, (v2si){ 3, 1 }); }
> > > +
> > > +/*
> > > +** g_v2si:
> > > +**       fmov    s0, s0
> > > +**       ret
> > > +*/
> > > +v2si
> > > +g_v2si (v2si x)
> > > +{
> > > +  return __builtin_shuffle ((v2si){ 0, 0 }, x, (v2si){ 0, 3 }); }
> > > +
> > > +/*
> > > +** f_v2di:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v2di
> > > +f_v2di (v2di x)
> > > +{
> > > +  return __builtin_shuffle (x, (v2di){ 0, 0 }, (v2di){ 2, 1 }); }
> > > +
> > > +/*
> > > +** g_v2di:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v2di
> > > +g_v2di (v2di x)
> > > +{
> > > +  return __builtin_shuffle ((v2di){ 0, 0 }, x, (v2di){ 0, 3 }); }
> > > +
> > > +/*
> > > +** f_v4si:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v4si
> > > +f_v4si (v4si x)
> > > +{
> > > +  return __builtin_shuffle (x, (v4si){ 0, 0, 0, 0 }, (v4si){ 6, 7,
> > > +2,
> > > +3 }); }
> > > +
> > > +/*
> > > +** g_v4si:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v4si
> > > +g_v4si (v4si x)
> > > +{
> > > +  return __builtin_shuffle ((v4si){ 0, 0, 0, 0 }, x, (v4si){ 2, 3,
> > > +6,
> > > +7 }); }
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/fmov.c
> > > b/gcc/testsuite/gcc.target/aarch64/fmov.c
> > > new file mode 100644
> > > index 00000000000..f40ec56c5f8
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/fmov.c
> > > @@ -0,0 +1,110 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +typedef int v2si __attribute__ ((vector_size (8))); typedef float
> > > +v2sf __attribute__ ((vector_size (8))); typedef long v2di
> > > +__attribute__ ((vector_size (16))); typedef double v2df
> > > +__attribute__ ((vector_size (16))); typedef int v4si __attribute__
> > > +((vector_size (16))); typedef float v4sf __attribute__
> > > +((vector_size (16)));
> > > +
> > > +
> > > +/*
> > > +** f_v2si:
> > > +**       fmov    s0, s0
> > > +**       ret
> > > +*/
> > > +v2si
> > > +f_v2si (v2si x)
> > > +{
> > > +  return __builtin_shuffle (x, (v2si){ 0, 0 }, (v2si){ 0, 3 }); }
> > > +
> > > +/*
> > > +** g_v2si:
> > > +**       fmov    s0, s0
> > > +**       ret
> > > +*/
> > > +v2si
> > > +g_v2si (v2si x)
> > > +{
> > > +  return __builtin_shuffle ((v2si){ 0, 0 }, x, (v2si){ 2, 0 }); }
> > > +
> > > +/*
> > > +** f_v2sf:
> > > +**       fmov    s0, s0
> > > +**       ret
> > > +*/
> > > +v2sf
> > > +f_v2sf (v2sf x)
> > > +{
> > > +  return __builtin_shuffle (x, (v2sf){ 0, 0 }, (v2si){ 0, 2 }); }
> > > +
> > > +/*
> > > +** f_v2di:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v2di
> > > +f_v2di (v2di x)
> > > +{
> > > +  return __builtin_shuffle (x, (v2di){ 0, 0 }, (v2di){ 0, 3 }); }
> > > +
> > > +/*
> > > +** g_v2di:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v2di
> > > +g_v2di (v2di x)
> > > +{
> > > +  return __builtin_shuffle ((v2di){ 0, 0 }, x, (v2di){ 2, 1 }); }
> > > +
> > > +/*
> > > +** f_v2df:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v2df
> > > +f_v2df (v2df x)
> > > +{
> > > +  return __builtin_shuffle (x, (v2df){ 0, 0 }, (v2di){ 0, 2 }); }
> > > +
> > > +/*
> > > +** f_v4si:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v4si
> > > +f_v4si (v4si x)
> > > +{
> > > +  return __builtin_shuffle (x, (v4si){ 0, 0, 0, 0 }, (v4si){ 0, 1,
> > > +4,
> > > +5 }); }
> > > +
> > > +/*
> > > +** g_v4si:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v4si
> > > +g_v4si (v4si x)
> > > +{
> > > +  return __builtin_shuffle ((v4si){ 0, 0, 0, 0 }, x, (v4si){ 4, 5,
> > > +2,
> > > +3 }); }
> > > +
> > > +/*
> > > +** f_v4sf:
> > > +**       fmov    d0, d0
> > > +**       ret
> > > +*/
> > > +v4sf
> > > +f_v4sf (v4sf x)
> > > +{
> > > +  return __builtin_shuffle (x, (v4sf){ 0, 0, 0, 0 }, (v4si){ 0, 1,
> > > +6,
> > > +7 }); }
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-set-zero.c
> > > b/gcc/testsuite/gcc.target/aarch64/vec-set-zero.c
> > > index b34b902cf27..9040839931f 100644
> > > --- a/gcc/testsuite/gcc.target/aarch64/vec-set-zero.c
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vec-set-zero.c
> > > @@ -28,8 +28,10 @@ FOO(float64x2_t)
> > >
> > >  /* { dg-final { scan-assembler-times {ins\tv[0-9]+\.b\[1\], wzr} 2
> > > { target aarch64_little_endian } } } */
> > >  /* { dg-final { scan-assembler-times {ins\tv[0-9]+\.h\[1\], wzr} 4
> > > { target aarch64_little_endian } } } */
> > > -/* { dg-final { scan-assembler-times {ins\tv[0-9]+\.s\[1\], wzr} 4
> > > { target aarch64_little_endian } } } */
> > > -/* { dg-final { scan-assembler-times {ins\tv[0-9]+\.d\[1\], xzr} 2
> > > { target aarch64_little_endian } } } */
> > > +/* { dg-final { scan-assembler-times {ins\tv[0-9]+\.s\[1\], wzr} 2
> > > +{ target aarch64_little_endian } } } */
> > > +/* { dg-final { scan-assembler-times {ins\tv[0-9]+\.d\[1\], xzr} 0
> > > +{ target aarch64_little_endian } } } */
> > > +/* { dg-final { scan-assembler-times {fmov\ts0, s0} 2 { target
> > > +aarch64_little_endian } } } */
> > > +/* { dg-final { scan-assembler-times {fmov\td0, d0} 2 { target
> > > +aarch64_little_endian } } } */
> > >
> > >  /* { dg-final { scan-assembler-times {ins\tv[0-9]+\.b\[6\], wzr} 1
> > > { target aarch64_big_endian } } } */
> > >  /* { dg-final { scan-assembler-times {ins\tv[0-9]+\.b\[14\], wzr} 1
> > > { target aarch64_big_endian } } } */

RE: [PATCH] aarch64: Recognize vector permute patterns suitable for FMOV [PR100165]

Reply via email to