Hi Prathamesh and Richard,

Thanks for the review and nice suggestions!

> > I guess the transform should work as long as mask is same for both
> > vectors even if it's
> > not constant ?
>
> Yes, please change accordingly (and maybe push separately).
>

Removed VECTOR_CST for integer ops.

> > If this transform is meant only for VLS vectors, I guess you should
> > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > otherwise it will crash for VLA vectors.
>
> I suppose it's difficult to create a VLA permute that covers all elements
> and that is not trivial though.  But indeed add ().is_constant to the
> VECTOR_FLOAT_TYPE_P guard.

Added.

> Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p 
> ()
> (as said I can't think of a non-full encoding that isn't trivial
> but covers all elements) and then simply .qsort () the vector_builder
> (it derives
> from vec<>) so the scan is O(n log n).

The .qsort () approach requires an extra cmp_func that IMO would not
be feasible to be implemented in match.pd (I suppose lambda function
would not be a good idea either).
Another solution would be using hash_set but it does not work here for
int64_t or poly_int64 type.
So I kept current O(n^2) simple code here, and I suppose usually the
permutation indices would be a small number even for O(n^2)
complexity.

Attached updated patch.

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道:


>
> On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > Hi,
> > >
> > > This is a follow-up patch for PR98167
> > >
> > > The sequence
> > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > >      c3 = c1 op c2
> > > can be optimized to
> > >      c = a op b
> > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > for all integer vector operation, and float operation with
> > > full permutation.
> > >
> > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > >
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > >         PR target/98167
> > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         PR target/98167
> > >         * gcc.target/i386/pr98167.c: New test.
> > > ---
> > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > >  2 files changed, 93 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > index 194ba8f5188..b85ad34f609 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -8189,3 +8189,52 @@ and,
> > >   (bit_and (negate @0) integer_onep@1)
> > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > >    (bit_and @0 @1)))
> > > +
> > > +/* Optimize
> > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > +   c3 = c1 op c2
> > > +   -->
> > > +   c = a op b
> > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > +   For all integer non-div operations.  */
> > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > +        lshift rshift)
> > > + (simplify
> > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > Just wondering, why should mask be CST here ?
> > I guess the transform should work as long as mask is same for both
> > vectors even if it's
> > not constant ?
>
> Yes, please change accordingly (and maybe push separately).
>
> > > +
> > > +/* Similar for float arithmetic when permutation constant covers
> > > +   all vector elements.  */
> > > +(for op (plus minus mult)
> > > + (simplify
> > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > +     (with
> > > +      {
> > > +       tree perm_cst = @2;
> > > +       vec_perm_builder builder;
> > > +       bool full_perm_p = false;
> > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > +         {
> > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > If this transform is meant only for VLS vectors, I guess you should
> > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > otherwise it will crash for VLA vectors.
>
> I suppose it's difficult to create a VLA permute that covers all elements
> and that is not trivial though.  But indeed add ().is_constant to the
> VECTOR_FLOAT_TYPE_P guard.
>
> >
> > Thanks,
> > Prathamesh
> > > +           vec_perm_indices sel (builder, 1, nelts);
> > > +
> > > +           /* Check if perm indices covers all vector elements.  */
> > > +           int count = 0, i, j;
> > > +           for (i = 0; i < nelts; i++)
> > > +             for (j = 0; j < nelts; j++)
>
> Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p 
> ()
> (as said I can't think of a non-full encoding that isn't trivial
> but covers all elements) and then simply .qsort () the vector_builder
> (it derives
> from vec<>) so the scan is O(n log n).
>
> Maybe Richard has a better idea here though.
>
> Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns 
> it's
> always that they explode the match decision tree, we'd ideally have a way to
> match those with (op ..) (op ..) first to be able to share more of the 
> matching
> code.  That said, match.pd is a less than ideal place for these (but mostly
> because of the way we code generate *-match.cc)
>
> Richard.
>
> > > +               {
> > > +                 if (sel[j].to_constant () == i)
> > > +                   {
> > > +                     count++;
> > > +                     break;
> > > +                   }
> > > +               }
> > > +           full_perm_p = count == nelts;
> > > +         }
> > > +       }
> > > +       (if (full_perm_p)
> > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c 
> > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > new file mode 100644
> > > index 00000000000..40e0ac11332
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > @@ -0,0 +1,44 @@
> > > +/* PR target/98167 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx2" } */
> > > +
> > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > +
> > > +#define VEC_PERM_4 \
> > > +  2, 3, 1, 0
> > > +#define VEC_PERM_8 \
> > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > +#define VEC_PERM_16 \
> > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > +
> > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > +                                             v##size##s##type b) \
> > > +  { \
> > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > +                                                  VEC_PERM_##size); \
> > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > +                                                  VEC_PERM_##size); \
> > > +    return a1 op b1; \
> > > +  }
> > > +
> > > +#define INT_PERMS(op, name) \
> > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > +
> > > +#define FP_PERMS(op, name) \
> > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > +
> > > +INT_PERMS (+, add)
> > > +INT_PERMS (-, sub)
> > > +INT_PERMS (*, mul)
> > > +INT_PERMS (|, ior)
> > > +INT_PERMS (^, xor)
> > > +INT_PERMS (&, and)
> > > +INT_PERMS (<<, shl)
> > > +INT_PERMS (>>, shr)
> > > +FP_PERMS (+, add)
> > > +FP_PERMS (-, sub)
> > > +FP_PERMS (*, mul)
> > > +
> > > --
> > > 2.18.1
> > >
From 2d0014e3b0f9fedcd75fe31cffd4f998db6db543 Mon Sep 17 00:00:00 2001
From: Hongyu Wang <hongyu.w...@intel.com>
Date: Mon, 17 Jan 2022 13:01:51 +0800
Subject: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and
 operation

The sequence
     c1 = VEC_PERM_EXPR (a, a, mask)
     c2 = VEC_PERM_EXPR (b, b, mask)
     c3 = c1 op c2
can be optimized to
     c = a op b
     c3 = VEC_PERM_EXPR (c, c, mask)
for all integer vector operation, and float operation with
full permutation.

gcc/ChangeLog:

	PR target/98167
	* match.pd: New perm + vector op patterns for int and fp vector.

gcc/testsuite/ChangeLog:

	PR target/98167
	* gcc.target/i386/pr98167.c: New test.
---
 gcc/match.pd                            | 50 +++++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
 2 files changed, 94 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 194ba8f5188..a394d664226 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8189,3 +8189,53 @@ and,
  (bit_and (negate @0) integer_onep@1)
  (if (!TYPE_OVERFLOW_SANITIZED (type))
   (bit_and @0 @1)))
+
+/* Optimize
+   c1 = VEC_PERM_EXPR (a, a, mask)
+   c2 = VEC_PERM_EXPR (b, b, mask)
+   c3 = c1 op c2
+   -->
+   c = a op b
+   c3 = VEC_PERM_EXPR (c, c, mask)
+   For all integer non-div operations.  */
+(for op (plus minus mult bit_and bit_ior bit_xor
+	 lshift rshift)
+ (simplify
+  (op (vec_perm @0 @0 @2) (vec_perm @1 @1 @2))
+   (if (VECTOR_INTEGER_TYPE_P (type))
+    (vec_perm (op @0 @1) (op @0 @1) @2))))
+
+/* Similar for float arithmetic when permutation constant covers
+   all vector elements.  */
+(for op (plus minus mult)
+ (simplify
+  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
+   (if (VECTOR_FLOAT_TYPE_P (type)
+	&& TYPE_VECTOR_SUBPARTS (type).is_constant())
+    (with
+     {
+       tree perm_cst = @2;
+       vec_perm_builder builder;
+       bool full_perm_p = false;
+       if (tree_to_vec_perm_builder (&builder, perm_cst))
+	 {
+	   unsigned HOST_WIDE_INT nelts
+	     = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+	   /* Create a vec_perm_indices for the VECTOR_CST.  */
+	   vec_perm_indices sel (builder, 1, nelts);
+
+	   /* Check if perm indices covers all vector elements.  */
+	   unsigned HOST_WIDE_INT i, j, count = 0;
+
+	   for (i = 0; i < nelts; i++)
+	     for (j = 0; j < nelts; j++)
+		if (known_eq (poly_uint64 (sel[j]), i))
+		  {
+		    count++;
+		    break;
+		  }
+	   full_perm_p = known_eq (count, nelts);
+	 }
+      }
+      (if (full_perm_p)
+	(vec_perm (op @0 @1) (op @0 @1) @2))))))
diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
new file mode 100644
index 00000000000..40e0ac11332
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr98167.c
@@ -0,0 +1,44 @@
+/* PR target/98167 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2" } */
+
+/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
+/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
+
+#define VEC_PERM_4 \
+  2, 3, 1, 0
+#define VEC_PERM_8 \
+  4, 5, 6, 7, 3, 2, 1, 0
+#define VEC_PERM_16 \
+  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
+
+#define TYPE_PERM_OP(type, size, op, name) \
+  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
+  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
+					      v##size##s##type b) \
+  { \
+    v##size##s##type a1 = __builtin_shufflevector (a, a, \
+						   VEC_PERM_##size); \
+    v##size##s##type b1 = __builtin_shufflevector (b, b, \
+						   VEC_PERM_##size); \
+    return a1 op b1; \
+  }
+
+#define INT_PERMS(op, name) \
+  TYPE_PERM_OP (int, 4, op, name) \
+
+#define FP_PERMS(op, name) \
+  TYPE_PERM_OP (float, 4, op, name) \
+
+INT_PERMS (+, add)
+INT_PERMS (-, sub)
+INT_PERMS (*, mul)
+INT_PERMS (|, ior)
+INT_PERMS (^, xor)
+INT_PERMS (&, and)
+INT_PERMS (<<, shl)
+INT_PERMS (>>, shr)
+FP_PERMS (+, add)
+FP_PERMS (-, sub)
+FP_PERMS (*, mul)
+
-- 
2.18.1

Reply via email to