Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]

Richard Biener via Gcc-patches Wed, 16 Nov 2022 07:30:09 -0800

On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina <tamar.christ...@arm.com> wrote:
>
> Hi,
>
> This patch is causing several ICEs because it changes the permutes from a 
> single register
> permute to a multi register due to the lowering of the expressions to 
> different SSA names.
>
> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
>
> I have a prototype fix which adds a new rule to CSE this back to a single 
> register permute,
> but would this be the right solution? It seems hard to later on during expand 
> realize that
> the two operands are the same.
>
> It's probably also ok to just block this from happening after vec_lower, 
> however I'm worried that
> If it wasn't CSE'd before vec_lower it'll lower it so something much less 
> efficient.


You can use

 (vec_perm (op@7 @0 @1) @3)

to avoid this issue.

> Thanks,
> Tamar
>
> > -----Original Message-----
> > From: Gcc-patches <gcc-patches-
> > bounces+tamar.christina=arm....@gcc.gnu.org> On Behalf Of Richard
> > Biener via Gcc-patches
> > Sent: Monday, November 14, 2022 2:53 PM
> > To: Hongyu Wang <wwwhhhyyy...@gmail.com>
> > Cc: Prathamesh Kulkarni <prathamesh.kulka...@linaro.org>; Richard
> > Sandiford <richard.sandif...@arm.com>; Hongyu Wang
> > <hongyu.w...@intel.com>; hongtao....@intel.com; gcc-
> > patc...@gcc.gnu.org
> > Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> > index and operation [PR98167]
> >
> > On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> > <wwwhhhyyy...@gmail.com> wrote:
> > >
> > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > think a lambda function is fine to use.  The alternative (used by
> > > > the vectorizer in some places) is to use sth like
> > > >
> > > >  auto_sbitmap seen (nelts);
> > > >  for (i = 0; i < nelts; i++)
> > > >    {
> > > >      if (!bitmap_set_bit (seen, i))
> > > >        break;
> > > >      count++;
> > > >    }
> > > >  full_perm_p = count == nelts;
> > > >
> > > > I'll note that you should still check .encoding
> > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > very simple check.
> > >
> > > Thanks for the good example! We also tried using wide_int as a bitmask
> > > but your code looks more simple and reasonable.
> > >
> > > Updated the patch accordingly.
> >
> > OK.
> >
> > Thanks,
> > Richard.
> >
> > > Richard Biener <richard.guent...@gmail.com> 于2022年11月10日周四
> > 16:56写道：
> > >
> > >
> > > >
> > > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> > <wwwhhhyyy...@gmail.com> wrote:
> > > > >
> > > > > Hi Prathamesh and Richard,
> > > > >
> > > > > Thanks for the review and nice suggestions!
> > > > >
> > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > both vectors even if it's not constant ?
> > > > > >
> > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > >
> > > > >
> > > > > Removed VECTOR_CST for integer ops.
> > > > >
> > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > otherwise it will crash for VLA vectors.
> > > > > >
> > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > elements and that is not trivial though.  But indeed add
> > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > >
> > > > > Added.
> > > > >
> > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > the scan is O(n log n).
> > > > >
> > > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > > function would not be a good idea either).
> > > > > Another solution would be using hash_set but it does not work here
> > > > > for int64_t or poly_int64 type.
> > > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > > the permutation indices would be a small number even for O(n^2)
> > > > > complexity.
> > > >
> > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > think a lambda function is fine to use.  The alternative (used by
> > > > the vectorizer in some places) is to use sth like
> > > >
> > > >  auto_sbitmap seen (nelts);
> > > >  for (i = 0; i < nelts; i++)
> > > >    {
> > > >      if (!bitmap_set_bit (seen, i))
> > > >        break;
> > > >      count++;
> > > >    }
> > > >  full_perm_p = count == nelts;
> > > >
> > > > I'll note that you should still check .encoding
> > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > very simple check.
> > > >
> > > > >
> > > > > Attached updated patch.
> > > > >
> > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > > 于2022年11月8日周二 22:38写道：
> > > > >
> > > > >
> > > > > >
> > > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > This is a follow-up patch for PR98167
> > > > > > > >
> > > > > > > > The sequence
> > > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > >      c3 = c1 op c2
> > > > > > > > can be optimized to
> > > > > > > >      c = a op b
> > > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > > operation, and float operation with full permutation.
> > > > > > > >
> > > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > > >
> > > > > > > > Ok for trunk?
> > > > > > > >
> > > > > > > > gcc/ChangeLog:
> > > > > > > >
> > > > > > > >         PR target/98167
> > > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> > vector.
> > > > > > > >
> > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > >
> > > > > > > >         PR target/98167
> > > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > > ---
> > > > > > > >  gcc/match.pd                            | 49 
> > > > > > > > +++++++++++++++++++++++++
> > > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > > ++++++++++++++++++++++
> > > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > >
> > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > > --- a/gcc/match.pd
> > > > > > > > +++ b/gcc/match.pd
> > > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > > >    (bit_and @0 @1)))
> > > > > > > > +
> > > > > > > > +/* Optimize
> > > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > +   c3 = c1 op c2
> > > > > > > > +   -->
> > > > > > > > +   c = a op b
> > > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > > +        lshift rshift)
> > > > > > > > + (simplify
> > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > VECTOR_CST@2))
> > > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > > Just wondering, why should mask be CST here ?
> > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > both vectors even if it's not constant ?
> > > > > >
> > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > >
> > > > > > > > +
> > > > > > > > +/* Similar for float arithmetic when permutation constant 
> > > > > > > > covers
> > > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > > +(simplify
> > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > VECTOR_CST@2))
> > > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > > +     (with
> > > > > > > > +      {
> > > > > > > > +       tree perm_cst = @2;
> > > > > > > > +       vec_perm_builder builder;
> > > > > > > > +       bool full_perm_p = false;
> > > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > > +         {
> > > > > > > > +           /* Create a vec_perm_indices for the integer 
> > > > > > > > vector.  */
> > > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > > +(type).to_constant ();
> > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > otherwise it will crash for VLA vectors.
> > > > > >
> > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > elements and that is not trivial though.  But indeed add
> > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > > +
> > > > > > > > +           /* Check if perm indices covers all vector 
> > > > > > > > elements.  */
> > > > > > > > +           int count = 0, i, j;
> > > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > > +             for (j = 0; j < nelts; j++)
> > > > > >
> > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > the scan is O(n log n).
> > > > > >
> > > > > > Maybe Richard has a better idea here though.
> > > > > >
> > > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > > ..)) patterns it's always that they explode the match decision
> > > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > > ..) first to be able to share more of the matching code.  That
> > > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > > because of the way we code generate *-match.cc)
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > > +               {
> > > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > > +                   {
> > > > > > > > +                     count++;
> > > > > > > > +                     break;
> > > > > > > > +                   }
> > > > > > > > +               }
> > > > > > > > +           full_perm_p = count == nelts;
> > > > > > > > +         }
> > > > > > > > +       }
> > > > > > > > +       (if (full_perm_p)
> > > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > new file mode 100644
> > > > > > > > index 00000000000..40e0ac11332
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > @@ -0,0 +1,44 @@
> > > > > > > > +/* PR target/98167 */
> > > > > > > > +/* { dg-do compile } */
> > > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > > +
> > > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > > +
> > > > > > > > +#define VEC_PERM_4 \
> > > > > > > > +  2, 3, 1, 0
> > > > > > > > +#define VEC_PERM_8 \
> > > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > > +#define VEC_PERM_16 \
> > > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > > +
> > > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > > +((vector_size(4*size))); \
> > > > > > > > +  v##size##s##type type##foo##size##i_##name
> > (v##size##s##type a, \
> > > > > > > > +
> > > > > > > > +v##size##s##type b) \
> > > > > > > > +  { \
> > > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > > +                                                  
> > > > > > > > VEC_PERM_##size); \
> > > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > > +                                                  
> > > > > > > > VEC_PERM_##size); \
> > > > > > > > +    return a1 op b1; \
> > > > > > > > +  }
> > > > > > > > +
> > > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > > +
> > > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > > +
> > > > > > > > +INT_PERMS (+, add)
> > > > > > > > +INT_PERMS (-, sub)
> > > > > > > > +INT_PERMS (*, mul)
> > > > > > > > +INT_PERMS (|, ior)
> > > > > > > > +INT_PERMS (^, xor)
> > > > > > > > +INT_PERMS (&, and)
> > > > > > > > +INT_PERMS (<<, shl)
> > > > > > > > +INT_PERMS (>>, shr)
> > > > > > > > +FP_PERMS (+, add)
> > > > > > > > +FP_PERMS (-, sub)
> > > > > > > > +FP_PERMS (*, mul)
> > > > > > > > +
> > > > > > > > --
> > > > > > > > 2.18.1
> > > > > > > >

Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]

Reply via email to