On Thu, Nov 14, 2024 at 3:55 PM Christoph Müllner
<christoph.muell...@vrull.eu> wrote:
>
> On Thu, Nov 14, 2024 at 3:07 PM Richard Biener <rguent...@suse.de> wrote:
> >
> > On Thu, 14 Nov 2024, Christoph Müllner wrote:
> >
> > > This extends forwprop by yet another VEC_PERM optimization:
> > > It attempts to merge two isomorphic vector sequences by using the
> > > redundancy in the lane utilization in these sequences.
> > > This redundancy in lane utilization comes from the way how specific
> > > scalar statements end up vectorized: two VEC_PERMs on top, binary 
> > > operations
> > > on both of them, and a final VEC_PERM to create the result.
> > > Here is an example of this sequence:
> > >
> > >   v_in = {e0, e1, e2, e3}
> > >   v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
> > >   // v_1 = {e0, e2, e0, e2}
> > >   v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
> > >   // v_2 = {e1, e3, e1, e3}
> >
> > The above permutes are just to identify lane duplicates?
>
> Yes.
>
> > >   v_x = v_1 + v_2
> > >   // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
> > >   v_y = v_1 - v_2
> > >   // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
> > >
> > >   v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
> > >   // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}
> >
> > and the requirement is really that there are lanes that participate
> > in the compute but end up being unused, detected by being unused in
> > the final permute (and also not used anywhere else)?
>
> Yes.
>
> > > To remove the redundancy, lanes 2 and 3 can be freed, which allows to
> > > change the last statement into:
> > >   v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
> > >   // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}
> >
> > And this isn't the actual transform but instead the idea is that
> > we have two such final permutes where we can re-use the same
> > intermediate compute with relevant lanes "blended".  Aka we
> > implicitly go through the step narrowing the v_1, v_2, v_x and v_y
> > vectors and then "vectorize" two such instances into a wider vector.
>
> Yes.
>
> > That said - I do not see that we need those two initial VEC_PERM
> > if the inputs to the intermediate computes are not used elsewhere?
> > From the final VEC_PERM we know which lanes are used and we should
> > be able to arrange the instance merging permutes from those inputs
> > (and not the inputs of the initial VEC_PERMs)?  When generalizing
> > like this the math is 2x (2x BINOP + 1x VEC_PERM) to
> > 2x VEC_PERM + 2x BINOP + 2x VEC_PERM, 2x BINOP traded for 2x VEC_PERM
> > with the hope of the added 2x VEC_PERM being mergeable with feeding
> > VEC_PERMs?
>
> Yes, in general we can just drop all four initial VEC_PERMs and
> create two new initial VEC_PERMs for the blended narrowed vectors.
>
> The patch takes a shortcut and reuses the existing initial VEC_PERMs
> of the first
> sequence.  And the initial VEC_PERMs of the second sequences are left 
> untouched,
> but will be eliminated later by DCE (together with the 2x BINOPs from the 
> second
> sequence).
>
> > That said, this feels like it should be a backward live problem
> > for lanes combined maybe with a forward redundancy scheme.  The
> > whole thing feels like it fits basic-block (re-)vectorization,
> > there's also somewhat related "lane" tracking in the byteswap
> > recognition pass (which I always wanted to become one detecting
> > vector permutes as well).  For ISAs with multiple vector sizes
> > reducing the vector size without then combining two might be a
> > win as well (but we'd need to know the "real" vector register size,
> > like x86 exposes V4QImode but the vector size is at least V16QImode).
>
> I do not know how to decide if this land-narrowing is beneficial or not.
> The v1 did this transformation unconditional, but you mentioned that this 
> might
> not be beneficial in all cases. So, I restricted the lane compression to the
> cases where we can blend later on.
>
> > I'm commenting on the patch independent on the above considerations
> > below.
> >
> > Richard.
> >
> > > The cost of eliminating the redundancy in the lane utilization is that
> > > lowering the VEC PERM expression could get more expensive because of
> > > tighter packing of the lanes.  Therefore this optimization is not done
> > > alone, but in only in case we identify two such sequences that can be
> > > merged.
> > >
> > > Once all candidate sequences have been identified, we try to pair them
> > > up, so that we can use the freed lane for the second sequence.
> > > On success, we replace 2x (2x VEC_PERM + 2x BINOP + 1x VEC_PERM)
> > > instructions by 2x VEC_PERM + 2x BINOP + 2x VEC_PERM.
> > > 2x VEC_PERM and 2x BINOP get eliminated.
> > >
> > > This targets x264_pixel_satd_8x4, which calculates the sum of absolute
> > > transformed differences (SATD) using Hadamard transformation.
> > > We have seen 8% speedup on SPEC's x264 on a 5950X (x86-64) and 7%
> > > speedup on an AArch64 machine.
> > >
> > > Bootstrapped and reg-tested on x86-64 and AArch64 (all languages).
> > >
> > > gcc/ChangeLog:
> > >
> > >       * tree-ssa-forwprop.cc (struct _vec_perm_simplify_seq): New data
> > >       structure to store analsyis results of a vec perm sequence.
> > >       (get_tree_def): New helper to get the defining statement of an
> > >       SSA_NAME.
> > >       (get_vect_selector_index_map): Helper to get an index map from
> > >       the provided vector permute selector.
> > >       (recognise_vec_perm_simplify_seq): Helper to recognise a
> > >       vec perm sequence.
> > >       (compress_vec_perm_simplify_seq): Helper to pack the lanes more 
> > > tight.
> > >       (can_merge_vec_perm_simplify_seqs_p): Test if two vec perm 
> > > sequences can
> > >       be merged.
> > >       (merge_vec_perm_simplify_seqs): Helper to merge two vec perm 
> > > sequences.
> > >       (pass_forwprop::execute): Add calls vec perm sequence functions.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >       * gcc.dg/tree-ssa/satd-hadamard.c: New test.
> > >       * gcc.dg/tree-ssa/vector-10.c: New test.
> > >       * gcc.dg/tree-ssa/vector-11.c: New test.
> > >       * gcc.dg/tree-ssa/vector-8.c: New test.
> > >       * gcc.dg/tree-ssa/vector-9.c: New test.
> > >       * gcc.target/aarch64/sve/satd-hadamard.c: New test.
> > >
> > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
> > > ---
> > > Changes from v2:
> > > * Moved code to tree-ssa-forwprop.cc where similar VEC_PERM
> > >   optimizations are implemented.
> > > * Test operand order less strict in case of commutative operators.
> > > * Made naming more consistent.
> > > * Added a test for testing dependencies between two sequences.
> > > * Removed the instruction reordering (no necessary without dependencies).
> > > * Added tests based on __builtin_shuffle ().
> > >
> > > Changes from v1:
> > > * Moved code from tree-vect-slp.cc into a new pass (from where it could
> > >   be moved elsewhere).
> > > * Only deduplicate lanes if sequences will be merged later on.
> > > * Split functionality stricter into analysis and transformation parts.
> > >
> > > Manolis Tsamis was the patch's initial author before I took it over.
> > >
> > >  gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c |  43 ++
> > >  gcc/testsuite/gcc.dg/tree-ssa/vector-10.c     |  32 ++
> > >  gcc/testsuite/gcc.dg/tree-ssa/vector-11.c     |  35 ++
> > >  gcc/testsuite/gcc.dg/tree-ssa/vector-8.c      |  31 +
> > >  gcc/testsuite/gcc.dg/tree-ssa/vector-9.c      |  31 +
> > >  .../gcc.target/aarch64/sve/satd-hadamard.c    |   3 +
> > >  gcc/tree-ssa-forwprop.cc                      | 538 ++++++++++++++++++
> > >  7 files changed, 713 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-11.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c 
> > > b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
> > > new file mode 100644
> > > index 00000000000..576ef01628c
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
> > > @@ -0,0 +1,43 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop4-details" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> > > +    int t0 = s0 + s1;\
> > > +    int t1 = s0 - s1;\
> > > +    int t2 = s2 + s3;\
> > > +    int t3 = s2 - s3;\
> > > +    d0 = t0 + t2;\
> > > +    d1 = t1 + t3;\
> > > +    d2 = t0 - t2;\
> > > +    d3 = t1 - t3;\
> > > +}
> > > +
> > > +int
> > > +x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t 
> > > *pix2, int i_pix2)
> > > +{
> > > +  uint32_t tmp[4][4];
> > > +  uint32_t a0, a1, a2, a3;
> > > +  int sum = 0;
> > > +  int i;
> > > +
> > > +  for (i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
> > > +    {
> > > +      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> > > +      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> > > +      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> > > +      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> > > +      HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, 
> > > a3);
> > > +    }
> > > +
> > > +  for (i = 0; i < 4; i++)
> > > +    {
> > > +      HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], 
> > > tmp[3][i]);
> > > +      sum += a0 + a1 + a2 + a3;
> > > +    }
> > > +
> > > +  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" 
> > > "forwprop4" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c 
> > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
> > > new file mode 100644
> > > index 00000000000..510a8e814d7
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
> > > @@ -0,0 +1,32 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
> > > +
> > > +typedef int vec __attribute__((vector_size (4 * sizeof (int))));
> > > +
> > > +void h (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2)
> > > +{
> > > +  vec sel0 = { 0, 2, 0, 2 };
> > > +  vec sel1 = { 1, 3, 1, 3 };
> > > +  vec sel = { 0, 1, 6, 7 };
> > > +  vec v_1, v_2, v_x, v_y;
> > > +  vec v_in_1 = *p_v_in_1;
> > > +  vec v_in_2 = *p_v_in_2;
> > > +
> > > +  /* First vec perm sequence.  */
> > > +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> > > +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> > > +  v_x = v_1 + v_2;
> > > +  v_y = v_1 - v_2;
> > > +  *v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> > > +
> > > +  /* Second vec perm sequence.  */
> > > +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> > > +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> > > +  v_x = v_1 + v_2;
> > > +  /* Won't merge because v_2 is RHS1 here.  */
> > > +  v_y = v_2 - v_1;
> > > +  *v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have 
> > > been merged" "forwprop1" } } */
> > > +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" 
> > > "forwprop1" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c 
> > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c
> > > new file mode 100644
> > > index 00000000000..e0ef92d9d56
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c
> > > @@ -0,0 +1,35 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
> > > +
> > > +typedef int vec __attribute__((vector_size (4 * sizeof (int))));
> > > +
> > > +extern vec foo (vec *v);
> > > +void i (vec *p_v_in_1, vec *v_out_1, vec *v_out_2)
> > > +{
> > > +  vec sel0 = { 0, 2, 0, 2 };
> > > +  vec sel1 = { 1, 3, 1, 3 };
> > > +  vec sel = { 0, 1, 6, 7 };
> > > +  vec v_1, v_2, v_x, v_y;
> > > +  vec v_in_1 = *p_v_in_1;
> > > +  vec v_in_2;
> > > +
> > > +  /* First vec perm sequence.  */
> > > +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> > > +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> > > +  v_x = v_1 + v_2;
> > > +  v_y = v_1 - v_2;
> > > +  *v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> > > +
> > > +  /* Won't merge because of dependency.  */
> > > +  v_in_2 = foo (v_out_1);
> > > +
> > > +  /* Second vec perm sequence.  */
> > > +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> > > +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> > > +  v_x = v_1 + v_2;
> > > +  v_y = v_1 - v_2;
> > > +  *v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have 
> > > been merged" "forwprop1" } } */
> > > +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" 
> > > "forwprop1" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c 
> > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
> > > new file mode 100644
> > > index 00000000000..6b7040d4047
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
> > > @@ -0,0 +1,31 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
> > > +
> > > +typedef int vec __attribute__((vector_size (4 * sizeof (int))));
> > > +
> > > +void f (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2)
> > > +{
> > > +  vec sel0 = { 0, 2, 0, 2 };
> > > +  vec sel1 = { 1, 3, 1, 3 };
> > > +  vec sel = { 0, 1, 6, 7 };
> > > +  vec v_1, v_2, v_x, v_y;
> > > +  vec v_in_1 = *p_v_in_1;
> > > +  vec v_in_2 = *p_v_in_2;
> > > +
> > > +  /* First vec perm sequence.  */
> > > +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> > > +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> > > +  v_x = v_1 + v_2;
> > > +  v_y = v_1 - v_2;
> > > +  *v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> > > +
> > > +  /* Second vec perm sequence.  */
> > > +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> > > +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> > > +  v_x = v_1 + v_2;
> > > +  v_y = v_1 - v_2;
> > > +  *v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been 
> > > merged" "forwprop1" } } */
> > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" 
> > > "forwprop1" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c 
> > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
> > > new file mode 100644
> > > index 00000000000..bb8e53faacb
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
> > > @@ -0,0 +1,31 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
> > > +
> > > +typedef int vec __attribute__((vector_size (4 * sizeof (int))));
> > > +
> > > +void g (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2)
> > > +{
> > > +  vec sel0 = { 0, 2, 0, 2 };
> > > +  vec sel1 = { 1, 3, 1, 3 };
> > > +  vec sel = { 0, 1, 6, 7 };
> > > +  vec v_1, v_2, v_x, v_y;
> > > +  vec v_in_1 = *p_v_in_1;
> > > +  vec v_in_2 = *p_v_in_2;
> > > +
> > > +  /* First vec perm sequence.  */
> > > +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> > > +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> > > +  v_x = v_1 * v_2;
> > > +  v_y = v_2 - v_1;
> > > +  *v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> > > +
> > > +  /* Second vec perm sequence.  */
> > > +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> > > +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> > > +  v_x = v_2 * v_1;
> > > +  v_y = v_2 - v_1;
> > > +  *v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been 
> > > merged" "forwprop1" } } */
> > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" 
> > > "forwprop1" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c 
> > > b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
> > > new file mode 100644
> > > index 00000000000..fcd140e3584
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
> > > @@ -0,0 +1,3 @@
> > > +#include "../../../gcc.dg/tree-ssa/satd-hadamard.c"
> > > +
> > > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */
> > > diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc
> > > index cb1e86c6912..465fd278628 100644
> > > --- a/gcc/tree-ssa-forwprop.cc
> > > +++ b/gcc/tree-ssa-forwprop.cc
> > > @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
> > >  #include "optabs-tree.h"
> > >  #include "insn-config.h"
> > >  #include "recog.h"
> > > +#include "cfgloop.h"
> > > +#include "tree-vectorizer.h"
> > >  #include "tree-vector-builder.h"
> > >  #include "vec-perm-indices.h"
> > >  #include "internal-fn.h"
> > > @@ -184,6 +186,37 @@ along with GCC; see the file COPYING3.  If not see
> > >
> > >     This will (of course) be extended as other needs arise.  */
> > >
> > > +/* Data structure that holds a lane utilization.  */
> > > +typedef hash_map<int_hash <unsigned HOST_WIDE_INT, -1U>,
> > > +                        unsigned HOST_WIDE_INT> lane_map_t;
> > > +
> > > +/* Data structure that contains simplifiable vectorized permute 
> > > sequences.
> > > +     v_in = {e0, e1, e2, e3}
> > > +     v_1 = VEC_PERM <v_in, v_in, v_1_sel>
> > > +     v_2 = VEC_PERM <v_in, v_in, v_2_sel>
> > > +     v_y = v_1 - v_2
> > > +     v_x = v_2 + v_1
> > > +     v_out = VEC_PERM <v_x, v_y, v_out_sel>
> > > +   The sequence is simplifiable because it has the potential to free up
> > > +   half of the utilized lanes.  This allows to merge two sequences if
> > > +   they use similar operators.  */
> > > +
> > > +struct _vec_perm_simplify_seq
> > > +{
> > > +  /* Defining stmts of vectors.  */
> > > +  gassign *v_1_stmt;
> > > +  gassign *v_2_stmt;
> > > +  gassign *v_x_stmt;
> > > +  gassign *v_y_stmt;
> > > +  /* Final permute statment.  */
> > > +  gassign *stmt;
> > > +  /* Elements of each vector and selector.  */
> > > +  unsigned HOST_WIDE_INT nelts;
> > > +  /* New selector indices for stmt.  */
> > > +  vec<unsigned HOST_WIDE_INT> new_sel;
> > > +};
> > > +typedef struct _vec_perm_simplify_seq *vec_perm_simplify_seq;
> > > +
> > >  static bool forward_propagate_addr_expr (tree, tree, bool);
> > >
> > >  /* Set to true if we delete dead edges during the optimization.  */
> > > @@ -225,6 +258,26 @@ fwprop_invalidate_lattice (tree name)
> > >      lattice[SSA_NAME_VERSION (name)] = NULL_TREE;
> > >  }
> > >
> > > +/* If NAME is an SSA_NAME defined by an assignment, return that 
> > > assignment.
> > > +   If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL.  
> > > */
> > > +
> > > +static gassign *
> > > +get_tree_def (tree name, bool single_use_only)
> > > +{
> > > +  if (TREE_CODE (name) != SSA_NAME)
> > > +    return NULL;
> > > +
> > > +  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
> > > +  gassign *assign = dyn_cast <gassign *> (def_stmt);
> > > +
> > > +  if (!assign)
> > > +    return NULL;
> > > +
> > > +  if (single_use_only && !has_single_use (name))
> > > +    return NULL;
> > > +
> > > +  return assign;
> > > +}
> > >
> > >  /* Get the statement we can propagate from into NAME skipping
> > >     trivial copies.  Returns the statement which defines the
> > > @@ -3460,6 +3513,441 @@ fwprop_ssa_val (tree name)
> > >    return name;
> > >  }
> > >
> > > +/* Get an index map from the provided vector permute selector
> > > +   and return the number of unique indices.
> > > +   E.g.: { 1, 3, 1, 3 } -> <0, 1, 0, 1>, 2
> > > +      { 0, 2, 0, 2 } -> <0, 1, 0, 1>, 2
> > > +      { 3, 2, 1, 0 } -> <0, 1, 2, 3>, 4.  */
> > > +
> > > +static unsigned HOST_WIDE_INT
> > > +get_vect_selector_index_map (tree sel, lane_map_t *index_map)
> > > +{
> > > +  gcc_assert (VECTOR_CST_NELTS (sel).is_constant ());
> > > +  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
> > > +  unsigned HOST_WIDE_INT n = 0;
> > > +
> > > +  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
> > > +    {
> > > +      /* Extract the i-th value from the selector.  */
> > > +      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
> > > +      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
> > > +
> > > +      unsigned HOST_WIDE_INT j = 0;
> > > +      for (; j <= i; j++)
> > > +     {
> > > +       tree prev_sel_cst_tree = VECTOR_CST_ELT (sel, j);
> > > +       unsigned HOST_WIDE_INT prev_sel_cst
> > > +         = TREE_INT_CST_LOW (prev_sel_cst_tree);
> > > +       if (prev_sel_cst == sel_cst)
> > > +         break;
> > > +     }
> > > +      index_map->put (i, j);
> > > +      n += (i == j) ? 1 : 0;
> > > +    }
> > > +
> > > +  return n;
> > > +}
> > > +
> > > +/* Search for opportunities to free half of the lanes in the following 
> > > pattern:
> > > +
> > > +     v_in = {e0, e1, e2, e3}
> > > +     v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
> > > +     // v_1 = {e0, e2, e0, e2}
> > > +     v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
> > > +     // v_2 = {e1, e3, e1, e3}
> > > +
> > > +     v_x = v_1 + v_2
> > > +     // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
> > > +     v_y = v_1 - v_2
> > > +     // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
> > > +
> > > +     v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
> > > +     // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}
> > > +
> > > +   The last statement could be simplified to:
> > > +     v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
> > > +     // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}
> > > +
> > > +   Characteristic properties:
> > > +   - v_1 and v_2 are created from the same input vector v_in and 
> > > introduce the
> > > +     lane duplication (in the selection operand) that we can eliminate.
> > > +   - v_x and v_y are results from lane-preserving operations that use 
> > > v_1 and
> > > +     v_2 as inputs.
> > > +   - v_out is created by selecting from duplicated lanes.  */
> > > +
> > > +static bool
> > > +recognise_vec_perm_simplify_seq (gassign *stmt, vec_perm_simplify_seq 
> > > *seq)
> > > +{
> > > +  gcc_checking_assert (stmt);
> > > +  gcc_checking_assert (gimple_assign_rhs_code (stmt) == VEC_PERM_EXPR);
> > > +
> > > +  /* Decompose the final vec permute statement.  */
> > > +  tree v_x = gimple_assign_rhs1 (stmt);
> > > +  tree v_y = gimple_assign_rhs2 (stmt);
> > > +  tree sel = gimple_assign_rhs3 (stmt);
> > > +
> > > +  if (!VECTOR_CST_NELTS (sel).is_constant ())
> > > +    return false;
> > > +
> > > +  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
> > > +
> > > +  /* Lookup the definition of v_x and v_y.  */
> > > +  gassign *v_x_stmt = get_tree_def (v_x, true);
> > > +  gassign *v_y_stmt = get_tree_def (v_y, true);
> > > +  if (!v_x_stmt || !v_y_stmt)
> > > +    return false;
> > > +
> > > +  /* Check the operations that define v_x and v_y.  */
> > > +  if (TREE_CODE_CLASS (gimple_assign_rhs_code (v_x_stmt)) != tcc_binary
> > > +      || TREE_CODE_CLASS (gimple_assign_rhs_code (v_y_stmt)) != 
> > > tcc_binary)
> > > +    return false;
> > > +
> > > +  tree v_x_1 = gimple_assign_rhs1 (v_x_stmt);
> > > +  tree v_x_2 = gimple_assign_rhs2 (v_x_stmt);
> > > +  tree v_y_1 = gimple_assign_rhs1 (v_y_stmt);
> > > +  tree v_y_2 = gimple_assign_rhs2 (v_y_stmt);
> > > +
> > > +  if (v_x_stmt == v_y_stmt
> > > +      || TREE_CODE (v_x_1) != SSA_NAME
> > > +      || TREE_CODE (v_x_2) != SSA_NAME
> > > +      || num_imm_uses (v_x_1) != 2
> > > +      || num_imm_uses (v_x_2) != 2)
> > > +    return false;
> > > +
> > > +  if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
> > > +    {
> > > +      /* Allow operands of commutative operators to swap.  */
> > > +      if (commutative_tree_code (gimple_assign_rhs_code (v_x_stmt)))
> > > +     {
> > > +       /* Keep v_x_1 the first operand for non-commutative operators.  */
> > > +       v_x_1 = gimple_assign_rhs2 (v_x_stmt);
> > > +       v_x_2 = gimple_assign_rhs1 (v_x_stmt);
> > > +       if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
> > > +         return false;
> > > +     }
> > > +      else if (commutative_tree_code (gimple_assign_rhs_code (v_y_stmt)))
> > > +     {
> > > +       if (v_x_1 != v_y_2 || v_x_2 != v_y_1)
> > > +         return false;
> > > +     }
> > > +      else
> > > +     return false;
> > > +    }
> > > +
> > > +  gassign *v_1_stmt = get_tree_def (v_x_1, false);
> > > +  gassign *v_2_stmt = get_tree_def (v_x_2, false);
> > > +  if (!v_1_stmt || !v_2_stmt)
> > > +    return false;
> > > +
> > > +  if (gimple_assign_rhs_code (v_1_stmt) != VEC_PERM_EXPR
> > > +      || gimple_assign_rhs_code (v_2_stmt) != VEC_PERM_EXPR)
> > > +    return false;
> > > +
> > > +  /* Decompose initial VEC_PERM_EXPRs.  */
> > > +  tree v_in = gimple_assign_rhs1 (v_1_stmt);
> > > +  tree v_1_sel = gimple_assign_rhs3 (v_1_stmt);
> > > +  tree v_2_sel = gimple_assign_rhs3 (v_2_stmt);
> > > +  if (v_in != gimple_assign_rhs2 (v_1_stmt)
> > > +      || v_in != gimple_assign_rhs1 (v_2_stmt)
> > > +      || v_in != gimple_assign_rhs2 (v_2_stmt))
> > > +    return false;
> > > +
> > > +  if (!VECTOR_CST_NELTS (v_1_sel).is_constant ()
> > > +      || !VECTOR_CST_NELTS (v_2_sel).is_constant ())
> > > +    return false;
> > > +
> > > +  if (nelts != VECTOR_CST_NELTS (v_1_sel).to_constant ()
> > > +      || nelts != VECTOR_CST_NELTS (v_2_sel).to_constant ())
> > > +    return false;
> > > +
> > > +  /* Now check permutation selection operands.  */
> > > +  lane_map_t v_1_lane_map, v_2_lane_map;
> > > +  unsigned HOST_WIDE_INT v_1_lanes, v_2_lanes;
> > > +  v_1_lanes = get_vect_selector_index_map (v_1_sel, &v_1_lane_map);
> > > +  v_2_lanes = get_vect_selector_index_map (v_2_sel, &v_2_lane_map);
> > > +
> > > +  /* Check if we could free up half of the lanes.  */
> > > +  if (v_1_lanes != v_2_lanes || v_1_lanes > (nelts / 2))
> > > +    return false;
> > > +
> > > +  if (dump_enabled_p ())
> > > +    {
> > > +      fprintf (dump_file, "Found vec perm simplify sequence ending with: 
> > > ");
> > > +      print_gimple_stmt (dump_file, stmt, 0);
> > > +
> > > +      if (dump_flags & TDF_DETAILS)
> > > +     fprintf (dump_file, "\tPossible new selector: { ");
> > > +     }
> > > +
> > > +  *seq = XNEW (struct _vec_perm_simplify_seq);
> > > +  (*seq)->stmt = stmt;
> > > +  (*seq)->v_1_stmt = v_1_stmt;
> > > +  (*seq)->v_2_stmt = v_2_stmt;
> > > +  (*seq)->v_x_stmt = v_x_stmt;
> > > +  (*seq)->v_y_stmt = v_y_stmt;
> > > +  (*seq)->nelts = nelts;
> > > +  (*seq)->new_sel = vNULL;
> > > +
> > > +  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
> > > +    {
> > > +      /* Extract the i-th value from the selector.  */
> > > +      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
> > > +      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
> > > +
> > > +      unsigned HOST_WIDE_INT j;
> > > +      if (sel_cst < nelts)
> > > +     j = *v_1_lane_map.get (sel_cst);
> > > +      else
> > > +     j = *v_2_lane_map.get (sel_cst - nelts) + nelts;
> > > +
> > > +      (*seq)->new_sel.safe_push (j);
> > > +
> > > +      if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
> > > +     {
> > > +       fprintf (dump_file, "%lu", j);
> > > +       if (i != (nelts -1))
> > > +         fprintf (dump_file, ", ");
> > > +     }
> > > +    }
> > > +
> > > +  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
> > > +    fprintf (dump_file, " }\n");
> > > +
> > > +  return true;
> > > +}
> > > +
> > > +/* Reduce the lane consumption of a simplifiable vec perm sequence.  */
> > > +
> > > +static void
> > > +compress_vec_perm_simplify_seq (vec_perm_simplify_seq seq)
> > > +{
> > > +  unsigned HOST_WIDE_INT nelts = seq->nelts;
> > > +  int i;
> > > +  unsigned HOST_WIDE_INT idx;
> > > +  vec_perm_builder new_sel (nelts, nelts, 1);
> > > +
> > > +  FOR_EACH_VEC_ELT (seq->new_sel, i, idx)
> > > +      new_sel.quick_push (idx);
> > > +
> > > +  gassign *stmt = seq->stmt;
> > > +  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
> > > +    {
> > > +      fprintf (dump_file, "Updating VEC_PERM statment:\n");
> > > +      fprintf (dump_file, "Old stmt: ");
> > > +      print_gimple_stmt (dump_file, stmt, 0);
> > > +    }
> > > +
> > > +  /* Update the last VEC_PERM statement.  */
> > > +  vec_perm_indices new_indices (new_sel, 2, nelts);
> > > +  tree vectype = TREE_TYPE (gimple_assign_lhs (stmt));
> > > +  tree m = vect_gen_perm_mask_checked (vectype, new_indices);
> > > +  gimple_assign_set_rhs3 (stmt, m);
> > > +  update_stmt (stmt);
> > > +
> > > +  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
> > > +    {
> > > +      fprintf (dump_file, "New stmt: ");
> > > +      print_gimple_stmt (dump_file, stmt, 0);
> > > +    }
> > > +}
> > > +
> > > +/* Test if we can merge two simplifiable vec permute sequences.  */
> > > +
> > > +static bool
> > > +can_merge_vec_perm_simplify_seqs_p (vec_perm_simplify_seq seq1,
> > > +                                 vec_perm_simplify_seq seq2)
> > > +{
> > > +  unsigned HOST_WIDE_INT nelts = seq1->nelts;
> > > +
> > > +  /* Number of elements must be equal.  */
> > > +  if (seq1->nelts != seq2->nelts)
> > > +    return false;
> > > +
> > > +  /* We need vectors of the same type.  */
> > > +  if (TREE_TYPE (gimple_assign_lhs (seq1->stmt))
> > > +      != TREE_TYPE (gimple_assign_lhs (seq2->stmt)))
> > > +    return false;
> > > +
> > > +  /* We require isomorphic operators.  */
> > > +  if (((gimple_assign_rhs_code (seq1->v_x_stmt)
> > > +     != gimple_assign_rhs_code (seq2->v_x_stmt))
> > > +       || (gimple_assign_rhs_code (seq1->v_y_stmt)
> > > +        != gimple_assign_rhs_code (seq2->v_y_stmt))))
> > > +    return false;
> > > +
> > > +  /* We require equal selectors for v_1_stmt and v_2_stmt.
> > > +     This ensures we don't introduce a complex permutation
> > > +     in the merged sequence.  */
> > > +  tree seq1_v_1_sel = gimple_assign_rhs3 (seq1->v_1_stmt);
> > > +  tree seq1_v_2_sel = gimple_assign_rhs3 (seq1->v_2_stmt);
> > > +  tree seq2_v_1_sel = gimple_assign_rhs3 (seq2->v_1_stmt);
> > > +  tree seq2_v_2_sel = gimple_assign_rhs3 (seq2->v_2_stmt);
> > > +  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
> > > +    {
> > > +      if ((TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_1_sel, i))
> > > +        != TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_1_sel, i)))
> > > +       || (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_2_sel, i))
> > > +           != (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_2_sel, i)))))
> > > +     return false;
> > > +    }
> > > +
> > > +  /* We cannot have any dependencies between the sequences.
> > > +     Therefore, v_in of both sequences must be defined before
> > > +     v_1_stmt and v_2_stmt.  */
> > > +  tree seq1_v_in = gimple_assign_rhs1 (seq1->v_1_stmt);
> > > +  gassign *seq1_v_in_stmt = get_tree_def (seq1_v_in, false);
> > > +  tree seq2_v_in = gimple_assign_rhs1 (seq2->v_1_stmt);
> > > +  gassign *seq2_v_in_stmt = get_tree_def (seq2_v_in, false);
> > > +
> > > +  /* Don't merge if either v_in is not created by a statement.  */
> > > +  if (!seq1_v_in_stmt || !seq2_v_in_stmt)
> > > +    return false;
> > > +
> > > +  /* Assure that seq1_v_in is defined before seq2_v_1.  */
> > > +  gimple_stmt_iterator gsi_lookup = gsi_for_stmt (seq1_v_in_stmt);
> > > +  bool found = false;
> > > +  while (!gsi_end_p (gsi_lookup))
> > > +    {
> > > +      gimple *s = gsi_stmt (gsi_lookup);
> > > +      if (s == seq2_v_in_stmt || s == seq2->v_1_stmt)
> > > +     {
> > > +       found = true;
> > > +       break;
> > > +     }
> > > +      gsi_next (&gsi_lookup);
> >
> > Ick.  I think we should simply assign gimple_uid () during the stmt
> > walk of forwprop instead of linearly walking stmts.
>
> Will do.
>
> > I also think we want to avoid combining across any stmts with
> > side-effects (calls, asm(), I guess even stores?), which we can do ...
>
> Will do...
>
> >
> > > +    }
> > > +
> > > +  if (!found)
> > > +    return false;
> > > +
> > > +  /* Assure that seq2_v_in is defined before seq1_v_1.  */
> > > +  gsi_lookup = gsi_for_stmt (seq2_v_in_stmt);
> > > +  found = false;
> > > +  while (!gsi_end_p (gsi_lookup))
> > > +    {
> > > +      gimple *s = gsi_stmt (gsi_lookup);
> > > +      if (s == seq1_v_in_stmt || s == seq1->v_1_stmt)
> > > +     {
> > > +       found = true;
> > > +       break;
> > > +     }
> > > +      gsi_next (&gsi_lookup);
> > > +    }
> > > +
> > > +  if (!found)
> > > +    return false;
> > > +
> > > +  if (dump_enabled_p ())
> > > +    fprintf (dump_file, "Found vec perm simplify sequence pair.\n");
> > > +
> > > +  return true;
> > > +}
> > > +
> > > +/* Merge the two given simplifiable vec permute sequences.  */
> > > +
> > > +static void
> > > +merge_vec_perm_simplify_seqs (vec_perm_simplify_seq seq1,
> > > +                           vec_perm_simplify_seq seq2)
> > > +{
> > > +  unsigned HOST_WIDE_INT nelts = seq1->nelts;
> > > +
> > > +  /* Calculate the lanes that the first permute needs.  */
> > > +  tree mask1 = gimple_assign_rhs3 (seq1->stmt);
> > > +  lane_map_t mask1_lanes;
> > > +  unsigned HOST_WIDE_INT i;
> > > +  for (i = 0; i < nelts; i++)
> > > +    {
> > > +      tree val = VECTOR_CST_ELT (mask1, i);
> > > +      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts;
> > > +      mask1_lanes.put (j, j);
> > > +    }
> > > +
> > > +  /* Calculate the lanes that the second permute needs and rewrite
> > > +     them to use the lanes that are unused from the first permute.  */
> > > +  tree mask2 = gimple_assign_rhs3 (seq2->stmt);
> > > +  lane_map_t mask2_lanes;
> > > +  unsigned HOST_WIDE_INT lane_gen = 0;
> > > +  for (i = 0; i < nelts; i++)
> > > +    {
> > > +      tree val = VECTOR_CST_ELT (mask2, i);
> > > +      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts;
> > > +      bool existed;
> > > +      unsigned HOST_WIDE_INT &rewrite_lane
> > > +     = mask2_lanes.get_or_insert (j, &existed);
> > > +      if (existed)
> > > +     continue;
> > > +
> > > +      while (mask1_lanes.get (lane_gen))
> > > +     lane_gen++;
> > > +      rewrite_lane = lane_gen;
> > > +      lane_gen++;
> > > +    }
> > > +
> > > +  /* Ensure we don't need more than one permute.  */
> > > +  gcc_assert (i >= nelts);
> > > +
> > > +  /* Calculate the permutations based on MASK1_LANES/MASK2_LANES.  */
> > > +  vec_perm_builder seq2_stmt_sel (nelts, nelts, 1);
> > > +  vec_perm_builder seq1_v_1_stmt_sel (nelts, nelts, 1);
> > > +  vec_perm_builder seq1_v_2_stmt_sel (nelts, nelts, 1);
> > > +  for (i = 0; i < nelts; i++)
> > > +    {
> > > +      /* Calculate new mask for STMT2.  */
> > > +      tree val = VECTOR_CST_ELT (mask2, i);
> > > +      unsigned HOST_WIDE_INT lane
> > > +     = TREE_INT_CST_LOW (val) & (2 * nelts - 1);
> > > +      unsigned off = (lane < nelts) ? 0 : nelts;
> > > +      unsigned HOST_WIDE_INT new_lane
> > > +     = *mask2_lanes.get (lane - off) + off;
> > > +      seq2_stmt_sel.quick_push (new_lane);
> > > +
> > > +      /* Calculate new masks for SRC1_1 and SRC1_2.  */
> > > +      bool use_src1 = mask1_lanes.get (i);
> > > +      tree mask1 = gimple_assign_rhs3 (use_src1 ? seq1->v_1_stmt
> > > +                                             : seq2->v_1_stmt);
> > > +      tree mask2 = gimple_assign_rhs3 (use_src1 ? seq1->v_2_stmt
> > > +                                             : seq2->v_2_stmt);
> > > +      unsigned HOST_WIDE_INT lane1
> > > +     = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) % nelts;
> > > +      unsigned HOST_WIDE_INT lane2
> > > +     = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) % nelts;
> > > +      seq1_v_1_stmt_sel.quick_push (lane1 + (use_src1 ? 0 : nelts));
> > > +      seq1_v_2_stmt_sel.quick_push (lane2 + (use_src1 ? 0 : nelts));
> > > +    }
> > > +
> > > +  /* We don't need to adjust seq1->stmt because its lanes consumption
> > > +     was already tightened before entering this function.  */
> > > +
> > > +  /* Adjust seq2->stmt: copy RHS1/RHS2 from seq1->stmt and set new sel.  
> > > */
> > > +  gimple_assign_set_rhs1 (seq2->stmt, gimple_assign_rhs1 (seq1->stmt));
> > > +  gimple_assign_set_rhs2 (seq2->stmt, gimple_assign_rhs2 (seq1->stmt));
> > > +  vec_perm_indices seq2_stmt_indices (seq2_stmt_sel, 2, nelts);
> > > +  tree vectype = TREE_TYPE (gimple_assign_lhs (seq2->stmt));
> > > +  tree mask = vect_gen_perm_mask_checked (vectype, seq2_stmt_indices);
> > > +  gimple_assign_set_rhs3 (seq2->stmt, mask);
> > > +  update_stmt (seq2->stmt);
> > > +
> > > +  /* Adjust seq1->v_1_stmt: copy RHS2 from seq2->v_1_stmt and set new 
> > > sel.  */
> > > +  gimple_assign_set_rhs2 (seq1->v_1_stmt, gimple_assign_rhs1 
> > > (seq2->v_1_stmt));
> > > +  vec_perm_indices seq1_v_1_stmt_indices (seq1_v_1_stmt_sel, 2, nelts);
> > > +  mask = vect_gen_perm_mask_checked (vectype, seq1_v_1_stmt_indices);
> > > +  gimple_assign_set_rhs3 (seq1->v_1_stmt, mask);
> > > +  update_stmt (seq1->v_1_stmt);
> > > +
> > > +  /* Adjust seq1->v_2_stmt: copy RHS2 from seq2->v_2_stmt and set new 
> > > sel.  */
> > > +  gimple_assign_set_rhs2 (seq1->v_2_stmt, gimple_assign_rhs1 
> > > (seq2->v_2_stmt));
> > > +  vec_perm_indices seq1_v_2_stmt_indices (seq1_v_2_stmt_sel, 2, nelts);
> > > +  mask = vect_gen_perm_mask_checked (vectype, seq1_v_2_stmt_indices);
> > > +  gimple_assign_set_rhs3 (seq1->v_2_stmt, mask);
> > > +  update_stmt (seq1->v_2_stmt);
> > > +
> > > +  /* At this point, we have four unmodified seq2 stmts, which will be
> > > +     eliminated by DCE.  */
> > > +
> > > +  if (dump_enabled_p ())
> > > +    fprintf (dump_file, "Vec perm simplify sequences have been 
> > > merged.\n\n");
> > > +}
> > > +
> > >  /* Main entry point for the forward propagation and statement combine
> > >     optimizer.  */
> > >
> > > @@ -3532,6 +4020,7 @@ pass_forwprop::execute (function *fun)
> > >        basic_block bb = BASIC_BLOCK_FOR_FN (fun, postorder[i]);
> > >        edge_iterator ei;
> > >        edge e;
> > > +      vec<vec_perm_simplify_seq> vec_perm_simplify_seq_list = vNULL;
> > >
> > >        /* Skip processing not executable blocks.  We could improve
> > >        single_use tracking by at least unlinking uses from unreachable
> > > @@ -3901,10 +4390,59 @@ pass_forwprop::execute (function *fun)
> > >             else
> > >               gsi_next (&gsi);
> > >           }
> > > +       else if (code == VEC_PERM_EXPR)
> > > +         {
> > > +           /* Find vectorized sequences, where we can reduce
> > > +              the lane utilization.  This compression will be
> > > +              done later, if we find a matching pair of
> > > +              sequences.  */
> > > +           gassign *assign = dyn_cast <gassign *> (stmt);
> > > +           vec_perm_simplify_seq seq;
> > > +           if (recognise_vec_perm_simplify_seq (assign, &seq))
> > > +             vec_perm_simplify_seq_list.safe_push (seq);
> > > +
> > > +           gsi_next (&gsi);
> > > +       }
> > >         else
> > >           gsi_next (&gsi);
> > >       }
> >
> > ... in the above loop by combining/truncating vec_perm_simplify_seq_list
> > when we hit such a "blockage" stmt?  I'll note the pattern matching
> > in recognise_vec_perm_simplify_seq isn't protected against leaving
> > the basic-block or crossing such "blockace" stmt.
>
> This means that the merging/blending cannot be on BB level,
> but must be triggered whenever a "blockage" stmt is found
> (otherwise we would end up with multiple lists per BB).
> Thanks for this idea!
>
> >
> > > +      /* If vec perm simplify sequences are found, then try to merge 
> > > them.  */
> > > +      if (dump_enabled_p () && !vec_perm_simplify_seq_list.is_empty ())
> > > +     {
> > > +       fprintf (dump_file, "Found %u vec perm simplify sequences.  "
> > > +                "Trying to merge them.\n\n",
> > > +                vec_perm_simplify_seq_list.length ());
> > > +     }
> > > +
> > > +      while (!vec_perm_simplify_seq_list.is_empty ())
> > > +     {
> > > +       vec_perm_simplify_seq seq1 = vec_perm_simplify_seq_list[0];
> > > +       vec_perm_simplify_seq_list.ordered_remove (0);
> >
> > ordered_remove is expensive, I wonder if this can be improved.
> > Note I think we want an upper limit on the size of the vector
> > anyway, either with a --param or a hard-coded number.  I'd simply
> > do this processing once we hit the limit and start over, even when
> > the basic-block isn't finished yet.
>
> So, using an array of pointers to vec_perm_simplify_seq is preferred.
> While it is possible to exploit this limit in an artifical test case,
> I haven't seen real-world code where this would be the case.
> I will go use an array with a hard-coded maximum of 8 sequences.

I misunderstood you here yesterday.
So, instead of using the O(n) methods of vec, I now allocate the vec
with a fixed size with
  auto_vec<vec_perm_simplify_seq, 8> vec_perm_simplify_seq_list;
and then only use quick_push(), unordered_remove() and truncate(0).

Revised patch (including fixes for all mentioned items) can be found here:
  https://gcc.gnu.org/pipermail/gcc-patches/2024-November/669020.html

Sorry, that this comes so late.
dump_enabled_p () wanted to teach me a lesson (that it does not replace a
test for dump_file unless one wants to debug interesting effects).

Thanks,
Christoph

>
> I will also switch the naming a bit to match your words:
> * lane compression -> narrowing
> * sequences merging -> blending
>
> Thanks!
>
> >
> > > +
> > > +       int i;
> > > +       vec_perm_simplify_seq seq2;
> > > +       FOR_EACH_VEC_ELT (vec_perm_simplify_seq_list, i, seq2)
> > > +         {
> > > +           if (can_merge_vec_perm_simplify_seqs_p (seq1, seq2))
> > > +             {
> > > +               vec_perm_simplify_seq_list.ordered_remove (i);
> > > +               compress_vec_perm_simplify_seq (seq1);
> > > +               compress_vec_perm_simplify_seq (seq2);
> > > +
> > > +               merge_vec_perm_simplify_seqs (seq1, seq2);
> > > +
> > > +               seq2->new_sel.release ();
> > > +               XDELETE (seq2);
> > > +               break;
> > > +             }
> > > +         }
> > > +
> > > +       seq1->new_sel.release ();
> > > +       XDELETE (seq1);
> > > +     }
> > > +      vec_perm_simplify_seq_list.release ();
> > > +
> > >        /* Combine stmts with the stmts defining their operands.
> > >        Note we update GSI within the loop as necessary.  */
> > >        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > >
> >
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to