RE: [PATCH 1/5]AArch64 sve: combine inverted masks into NOTs

Tamar Christina via Gcc-patches Thu, 16 Sep 2021 09:19:11 -0700

Hi esteemed reviewer!

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Tuesday, August 31, 2021 4:46 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; Marcus Shawcroft
> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> Subject: Re: [PATCH 1/5]AArch64 sve: combine inverted masks into NOTs
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> > Hi All,
> >
> > The following example
> >
> > void f10(double * restrict z, double * restrict w, double * restrict x,
> >      double * restrict y, int n)
> > {
> >     for (int i = 0; i < n; i++) {
> >         z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
> >     }
> > }
> >
> > generates currently:
> >
> >         ld1d    z1.d, p1/z, [x1, x5, lsl 3]
> >         fcmgt   p2.d, p1/z, z1.d, #0.0
> >         fcmgt   p0.d, p3/z, z1.d, #0.0
> >         ld1d    z2.d, p2/z, [x2, x5, lsl 3]
> >         bic     p0.b, p3/z, p1.b, p0.b
> >         ld1d    z0.d, p0/z, [x3, x5, lsl 3]
> >
> > where a BIC is generated between p1 and p0 where a NOT would be better
> here
> > since we won't require the use of p3 and opens the pattern up to being
> CSEd.
> >
> > After this patch using a 2 -> 2 split we generate:
> >
> >         ld1d    z1.d, p0/z, [x1, x5, lsl 3]
> >         fcmgt   p2.d, p0/z, z1.d, #0.0
> >         not     p1.b, p0/z, p2.b
> >
> > The additional scratch is needed such that we can CSE the two operations.
> If
> > both statements wrote to the same register then CSE won't be able to CSE
> the
> > values if there are other statements in between that use the register.
> >
> > Note: This patch series is working incrementally towards generating the
> most
> >       efficient code for this and other loops in small steps.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     * config/aarch64/aarch64-sve.md (*mask_inv_combine): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     * gcc.target/aarch64/sve/pred-not-gen.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/config/aarch64/aarch64-sve.md
> b/gcc/config/aarch64/aarch64-sve.md
> > index
> 359fe0e457096cf4042a774789a5c241420703d3..2c23c6b12bafb038d82920e714
> 1a418e078a2c65 100644
> > --- a/gcc/config/aarch64/aarch64-sve.md
> > +++ b/gcc/config/aarch64/aarch64-sve.md
> > @@ -8126,6 +8126,42 @@ (define_insn_and_split
> "*fcmuo<mode>_and_combine"
> >       UNSPEC_COND_FCMUO))]
> >  )
> >
> > +;; Make sure that inversions of masked comparisons are always on the
> mask
> > +;; instead of on the operation.
> > +(define_insn_and_split "*mask_inv_combine"
> 
> I think it would be better to explain why this is an optimisation
> in its own right, e.g.:
> 
> ;; Similar to *fcm<cmp_op><mode>_and_combine, but for BIC rather than
> AND.
> ;; In this case, we still need a separate NOT/BIC operation, but predicating
> ;; the comparison on the BIC operand removes the need for a PTRUE.
> 
> For the same reason, calling it *fcm<cmp_op><mode>_bic_combine might
> be more consistent with surrounding code.
> 
> It would be good to have a pattern for FCMUO as well, even though that
> isn't a focus of the work.
> 
> > +  [(set (match_operand:<VPRED> 0 "register_operand" "=Upa")
> > +   (and:<VPRED>
> > +     (and:<VPRED>
> > +       (not:<VPRED>
> > +         (unspec:<VPRED>
> > +           [(match_operand:<VPRED> 1)
> > +            (const_int SVE_KNOWN_PTRUE)
> > +            (match_operand:SVE_FULL_F 2 "register_operand" "w")
> > +            (match_operand:SVE_FULL_F 3 "aarch64_simd_reg_or_zero"
> "wDz")]
> > +           SVE_COND_FP_CMP_I0))
> > +       (match_operand:<VPRED> 4 "register_operand" "Upa"))
> > +         (match_dup:<VPRED> 1)))
> 
> Indentation looks off here: this is a sibling of the inner “and”.


Fixed.

> 
> > +   (clobber (match_scratch:<VPRED> 5 "=&Upa"))]
> 
> This needs to be Upl, since it's the target of a comparison.

Fixed, and I didn't realise this property of the ISA at all until now..

> 
> > +  "TARGET_SVE"
> > +  "#"
> > +  "&& 1"
> > +  [(set (match_dup 5)
> > +   (unspec:<VPRED>
> > +     [(match_dup 4)
> > +      (const_int SVE_MAYBE_NOT_PTRUE)
> > +      (match_dup 2)
> > +      (match_dup 3)]
> > +     SVE_COND_FP_CMP_I0))
> > +   (set (match_dup 0)
> > +   (and:<VPRED>
> > +     (not:<VPRED>
> > +       (match_dup 5))
> > +     (match_dup 4)))]
> > +{
> > +  operands[5] = gen_reg_rtx (<VPRED>mode);
> 
> This should be protected by:
> 
>   if (can_create_pseudo_p ())
> 
> since for post-reload splits we should use operand 5 unaltered.

Done.

> 
> It would be good to test the patch with the "&& 1" changed to
> "&& reload_completed", to make sure that things still work for
> post-RA splits.  I *think* the changes above are the only ones
> needed to make that true, but nothing beats trying.

Yup, with && reload_completed it still works. CSE misses it as expected but 
other than that still works.

> 
> > +}
> > +)
> > +
> >  ;; 
> > -------------------------------------------------------------------------
> >  ;; ---- [FP] Absolute comparisons
> >  ;; 
> > -------------------------------------------------------------------------
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen.c
> b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..a5bb616ef505a63075cf33203d
> e8cf8e8c38b95d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen.c
> > @@ -0,0 +1,56 @@
> > +/* { dg-do assemble { target aarch64_asm_sve_ok } } */
> > +/* { dg-options "-O3 --save-temps -fno-schedule-insns -fno-schedule-
> insns2" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +/*
> > +** f10:
> > +** ...
> > +**         ld1d    z1.d, p0/z, \[x1, x5, lsl 3\]
> > +**         fcmgt   p2.d, p0/z, z1.d, #0.0
> > +**         ld1d    z2.d, p2/z, \[x2, x5, lsl 3\]
> > +**         not     p1.b, p0/z, p2.b
> > +**         ld1d    z0.d, p1/z, \[x3, x5, lsl 3\]
> > +** ...
> 
> This is probably too specific.  There's no reason for the RA to pick
> exactly these allocations.  There's also nothing to order the middle
> load and the NOT, or the middle load and the final load.
> 
> So for this is might be better to have a separate file for f11 and
> use normal scan-assembler-times tests.

I split them off to different tests. They have a knock on effect on the later 
series but I'll send the updated ones after those get reviewed.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        * config/aarch64/aarch64-sve.md (*fcm<cmp_op><mode>_bic_combine,
        *fcmuo<mode>_bic_combin): New.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/sve/pred-not-gen.c-1: New test.
        * gcc.target/aarch64/sve/pred-not-gen.c-2: New test.
        * gcc.target/aarch64/sve/pred-not-gen.c-3: New test.
        * gcc.target/aarch64/sve/pred-not-gen.c-4: New test.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 
359fe0e457096cf4042a774789a5c241420703d3..21c8bc28f396d7c4c753f3b7ee45ab2f9f42da67
 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -8126,6 +8126,80 @@ (define_insn_and_split "*fcmuo<mode>_and_combine"
          UNSPEC_COND_FCMUO))]
 )
 
+;; Similar to *fcm<cmp_op><mode>_and_combine, but for BIC rather than AND.
+;; In this case, we still need a separate NOT/BIC operation, but predicating
+;; the comparison on the BIC operand removes the need for a PTRUE.
+(define_insn_and_split "*fcm<cmp_op><mode>_bic_combine"
+  [(set (match_operand:<VPRED> 0 "register_operand" "=Upa")
+       (and:<VPRED>
+         (and:<VPRED>
+           (not:<VPRED>
+             (unspec:<VPRED>
+               [(match_operand:<VPRED> 1)
+                (const_int SVE_KNOWN_PTRUE)
+                (match_operand:SVE_FULL_F 2 "register_operand" "w")
+                (match_operand:SVE_FULL_F 3 "aarch64_simd_reg_or_zero" "wDz")]
+               SVE_COND_FP_CMP_I0))
+           (match_operand:<VPRED> 4 "register_operand" "Upa"))
+         (match_dup:<VPRED> 1)))
+   (clobber (match_scratch:<VPRED> 5 "=&Upl"))]
+  "TARGET_SVE"
+  "#"
+  "&& 1"
+  [(set (match_dup 5)
+       (unspec:<VPRED>
+         [(match_dup 4)
+          (const_int SVE_MAYBE_NOT_PTRUE)
+          (match_dup 2)
+          (match_dup 3)]
+         SVE_COND_FP_CMP_I0))
+   (set (match_dup 0)
+       (and:<VPRED>
+         (not:<VPRED>
+           (match_dup 5))
+         (match_dup 4)))]
+{
+  if (can_create_pseudo_p ())
+    operands[5] = gen_reg_rtx (<VPRED>mode);
+}
+)
+
+;; Same for unordered comparisons.
+(define_insn_and_split "*fcmuo<mode>_bic_combine"
+  [(set (match_operand:<VPRED> 0 "register_operand" "=Upa")
+       (and:<VPRED>
+         (and:<VPRED>
+           (not:<VPRED>
+             (unspec:<VPRED>
+               [(match_operand:<VPRED> 1)
+                (const_int SVE_KNOWN_PTRUE)
+                (match_operand:SVE_FULL_F 2 "register_operand" "w")
+                (match_operand:SVE_FULL_F 3 "aarch64_simd_reg_or_zero" "wDz")]
+               UNSPEC_COND_FCMUO))
+           (match_operand:<VPRED> 4 "register_operand" "Upa"))
+         (match_dup:<VPRED> 1)))
+   (clobber (match_scratch:<VPRED> 5 "=&Upl"))]
+  "TARGET_SVE"
+  "#"
+  "&& 1"
+  [(set (match_dup 5)
+       (unspec:<VPRED>
+         [(match_dup 4)
+          (const_int SVE_MAYBE_NOT_PTRUE)
+          (match_dup 2)
+          (match_dup 3)]
+         UNSPEC_COND_FCMUO))
+   (set (match_dup 0)
+       (and:<VPRED>
+         (not:<VPRED>
+           (match_dup 5))
+         (match_dup 4)))]
+{
+  if (can_create_pseudo_p ())
+    operands[5] = gen_reg_rtx (<VPRED>mode);
+}
+)
+
 ;; -------------------------------------------------------------------------
 ;; ---- [FP] Absolute comparisons
 ;; -------------------------------------------------------------------------
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c
new file mode 100644
index 
0000000000000000000000000000000000000000..2c06564186c5a5e7917da475a9c201c81dfeb136
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c
@@ -0,0 +1,23 @@
+/* { dg-do assemble { target aarch64_asm_sve_ok } } */
+/* { dg-options "-O3 --save-temps" } */
+
+/*
+** f10:
+** ...
+**     ld1d    z1.d, p0/z, \[x1, x5, lsl 3\]
+**     fcmgt   p2.d, p0/z, z1.d, #0.0
+**     ld1d    z2.d, p2/z, \[x2, x5, lsl 3\]
+**     not     p1.b, p0/z, p2.b
+**     ld1d    z0.d, p1/z, \[x3, x5, lsl 3\]
+** ...
+*/
+
+void f10(double * restrict z, double * restrict w, double * restrict x, double 
* restrict y, int n)
+{
+    for (int i = 0; i < n; i++) {
+        z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
+    }
+}
+
+/* { dg-final { scan-assembler-not {\tbic\t} } } */
+/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, 
p[0-9]+\.b\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-2.c
new file mode 100644
index 
0000000000000000000000000000000000000000..0c3b78d4c67455c971e94fb2ffdd7be2d4884864
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-2.c
@@ -0,0 +1,23 @@
+/* { dg-do assemble { target aarch64_asm_sve_ok } } */
+/* { dg-options "-O3 --save-temps" } */
+
+/*
+** f11:
+** ...
+**     ld1d    z0.d, p0/z, \[x1, x2, lsl 3\]
+**     fcmgt   p2.d, p3/z, z0.d, #0.0
+**     fcmgt   p1.d, p0/z, z0.d, #0.0
+**     not     p1.b, p0/z, p1.b
+**     ld1d    z1.d, p1/z, \[x3, x2, lsl 3\]
+** ...
+*/
+
+void f11(double * restrict z, double * restrict w, double * restrict x, double 
* restrict y, int n)
+{
+    for (int i = 0; i < n; i++) {
+        z[i] = (w[i] > 0) ? w[i] : y[i];
+    }
+}
+
+/* { dg-final { scan-assembler-not {\tbic\t} } } */
+/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, 
p[0-9]+\.b\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-3.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-3.c
new file mode 100644
index 
0000000000000000000000000000000000000000..248f8ab57191ce8a1d4c334533de8bc76aa07691
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-3.c
@@ -0,0 +1,21 @@
+/* { dg-do assemble { target aarch64_asm_sve_ok } } */
+/* { dg-options "-O3 --save-temps" } */
+
+/*
+** f12:
+** ...
+**     ld1w    z1.s, p0/z, \[x1, x2, lsl 2\]
+**     cmple   p1.s, p0/z, z1.s, #0
+**     ld1w    z0.s, p1/z, \[x3, x2, lsl 2\]
+** ...
+*/
+
+void f12(int * restrict z, int * restrict w, int * restrict x, int * restrict 
y, int n)
+{
+    for (int i = 0; i < n; i++) {
+        z[i] = (w[i] > 0) ? w[i] : y[i];
+    }
+}
+
+/* { dg-final { scan-assembler-not {\tbic\t} } } */
+/* { dg-final { scan-assembler-not {\tnot\tp[0-9]+\.b, p[0-9]+/z, 
p[0-9]+\.b\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
new file mode 100644
index 
0000000000000000000000000000000000000000..96200309880a91ad1db5801115c911cfdce06125
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble { target aarch64_asm_sve_ok } } */
+/* { dg-options "-O3 --save-temps" } */
+
+#include <math.h>
+
+void f13(double * restrict z, double * restrict w, double * restrict x, double 
* restrict y, int n)
+{
+    for (int i = 0; i < n; i++) {
+        z[i] = (isunordered(w[i], 0)) ? x[i] + w[i] : y[i] - w[i];
+    }
+}
+
+/* { dg-final { scan-assembler-not {\tbic\t} } } */
+/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, 
p[0-9]+\.b\n} 1 } } */

rb14776.patch
Description: rb14776.patch

RE: [PATCH 1/5]AArch64 sve: combine inverted masks into NOTs

Reply via email to