Re: [PATCH v4] forwprop: Try to blend two isomorphic VEC_PERM sequences

Andrew Pinski Fri, 15 Nov 2024 13:42:13 -0800

On Fri, Nov 15, 2024 at 1:30 PM Christoph Müllner
<christoph.muell...@vrull.eu> wrote:
>
> This extends forwprop by yet another VEC_PERM optimization:
> It attempts to blend two isomorphic vector sequences by using the
> redundancy in the lane utilization in these sequences.
> This redundancy in lane utilization comes from the way how specific
> scalar statements end up vectorized: two VEC_PERMs on top, binary operations
> on both of them, and a final VEC_PERM to create the result.
> Here is an example of this sequence:
>
>   v_in = {e0, e1, e2, e3}
>   v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
>   // v_1 = {e0, e2, e0, e2}
>   v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
>   // v_2 = {e1, e3, e1, e3}
>
>   v_x = v_1 + v_2
>   // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
>   v_y = v_1 - v_2
>   // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
>
>   v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
>   // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}
>
> To remove the redundancy, lanes 2 and 3 can be freed, which allows to
> change the last statement into:
>   v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
>   // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}
>
> The cost of eliminating the redundancy in the lane utilization is that
> lowering the VEC PERM expression could get more expensive because of
> tighter packing of the lanes.  Therefore this optimization is not done
> alone, but in only in case we identify two such sequences that can be
> blended.
>
> Once all candidate sequences have been identified, we try to blend them,
> so that we can use the freed lanes for the second sequence.
> On success we convert 2x (2x BINOP + 1x VEC_PERM) to
> 2x VEC_PERM + 2x BINOP + 2x VEC_PERM traded for 4x VEC_PERM + 2x BINOP.
>
> The implemented transformation reuses (rewrites) the statements
> of the first sequence and the last VEC_PERM of the second sequence.
> The remaining four statements of the second statment are left untouched
> and will be eliminated by DCE later.
>
> This targets x264_pixel_satd_8x4, which calculates the sum of absolute
> transformed differences (SATD) using Hadamard transformation.
> We have seen 8% speedup on SPEC's x264 on a 5950X (x86-64) and 7%
> speedup on an AArch64 machine.
>
> Bootstrapped and reg-tested on x86-64 and AArch64 (all languages).


This is more of a style issue rather than correct issue but
+  if (!l->length ())
+    return;

Maybe should be:
if (l->is_empty() )
  return;

Which makes it more obvious of what is going on.

Also this:
+  FOR_EACH_VEC_ELT (seq->new_sel, i, idx)
+      new_sel.quick_push (idx);

Could just be:
for (auto idx : seq->new_sel)
  new_sel.quick_push (idx);

Using C++11 range-based for in this case makes it easier to understand
what is happening.

Thanks,
Andrew Pinski

>
> gcc/ChangeLog:
>
>         * tree-ssa-forwprop.cc (struct _vec_perm_simplify_seq): New data
>         structure to store analysis results of a vec perm simplify sequence.
>         (get_tree_def): New helper to get the defining statement of an
>         SSA_NAME.
>         (get_vect_selector_index_map): Helper to get an index map from the
>         provided vector permute selector.
>         (recognise_vec_perm_simplify_seq): Helper to recognise a
>         vec perm simplify sequence.
>         (narrow_vec_perm_simplify_seq): Helper to pack the lanes more
>         tight.
>         (can_blend_vec_perm_simplify_seqs_p): Test if two vec perm
>         sequences can be blended.
>         (blend_vec_perm_simplify_seqs): Helper to blend two vec perm
>         simplify sequences.
>         (process_vec_perm_simplify_seq_list): Helper to process a list
>         of vec perm simplify sequences.
>         (append_vec_perm_simplify_seq_list): Helper to add a vec perm
>         simplify sequence to the list.
>         (pass_forwprop::execute): Integrate new functionality.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/satd-hadamard.c: New test.
>         * gcc.dg/tree-ssa/vector-10.c: New test.
>         * gcc.dg/tree-ssa/vector-8.c: New test.
>         * gcc.dg/tree-ssa/vector-9.c: New test.
>         * gcc.target/aarch64/sve/satd-hadamard.c: New test.
>
> Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
> ---
> Changes from v3:
> * Fix test condition for writing to the dump file
> * Use gimple UIDs instead on expensive walks for comparing ordering.
> * Ensure to not blend across assignments to SSA_NAMES.
> * Restrict list to fix-sized vector with 8 entries.
> * Remove calls of expensive vec methods by restructuring the code.
> * Improved wording.
>
> Changes from v2:
> * Moved code to tree-ssa-forwprop.cc where similar VEC_PERM
>   optimizations are implemented.
> * Test operand order less strict in case of commutative operators.
> * Made naming more consistent.
> * Added a test for testing dependencies between two sequences.
> * Removed the instruction reordering (no necessary without dependencies).
> * Added tests based on __builtin_shuffle ().
>
> Changes from v1:
> * Moved code from tree-vect-slp.cc into a new pass (from where it could
>   be moved elsewhere).
> * Only deduplicate lanes if sequences will be merged later on.
> * Split functionality stricter into analysis and transformation parts.
>
> Manolis Tsamis was the patch's initial author before I took it over.
>
>  gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c |  43 ++
>  gcc/testsuite/gcc.dg/tree-ssa/vector-10.c     | 122 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/vector-8.c      |  34 ++
>  gcc/testsuite/gcc.dg/tree-ssa/vector-9.c      |  34 ++
>  .../gcc.target/aarch64/sve/satd-hadamard.c    |   3 +
>  gcc/tree-ssa-forwprop.cc                      | 550 ++++++++++++++++++
>  6 files changed, 786 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
> new file mode 100644
> index 00000000000..576ef01628c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
> @@ -0,0 +1,43 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-forwprop4-details" } */
> +
> +#include <stdint.h>
> +
> +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> +    int t0 = s0 + s1;\
> +    int t1 = s0 - s1;\
> +    int t2 = s2 + s3;\
> +    int t3 = s2 - s3;\
> +    d0 = t0 + t2;\
> +    d1 = t1 + t3;\
> +    d2 = t0 - t2;\
> +    d3 = t1 - t3;\
> +}
> +
> +int
> +x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t *pix2, 
> int i_pix2)
> +{
> +  uint32_t tmp[4][4];
> +  uint32_t a0, a1, a2, a3;
> +  int sum = 0;
> +  int i;
> +
> +  for (i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
> +    {
> +      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> +      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> +      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> +      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> +      HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3);
> +    }
> +
> +  for (i = 0; i < 4; i++)
> +    {
> +      HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]);
> +      sum += a0 + a1 + a2 + a3;
> +    }
> +
> +  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
> +}
> +
> +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop4" } 
> } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
> new file mode 100644
> index 00000000000..d5caebdf174
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
> @@ -0,0 +1,122 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
> +
> +typedef int vec __attribute__((vector_size (4 * sizeof (int))));
> +
> +void f1 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
> +{
> +  vec sel0 = { 0, 2, 0, 2 };
> +  vec sel1 = { 1, 3, 1, 3 };
> +  vec sel = { 0, 1, 6, 7 };
> +  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
> +  vec v_in_1 = *p_v_in_1;
> +  vec v_in_2;
> +
> +  /* First vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  /* Won't blend because v_in_2 is defined after v_1 above.  */
> +  v_in_2 = *p_v_in_2;
> +  /* Second vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  *p_v_out_1 = v_out_1;
> +  *p_v_out_2 = v_out_2;
> +}
> +
> +void f2 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
> +{
> +  vec sel0 = { 0, 2, 0, 2 };
> +  vec sel1 = { 1, 3, 1, 3 };
> +  vec sel = { 0, 1, 6, 7 };
> +  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
> +  vec v_in_1 = *p_v_in_1;
> +  vec v_in_2 = *p_v_in_2;
> +
> +  /* First vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> +  /* Won't blend because of this store between the sequences.  */
> +  *p_v_out_1 = v_out_1;
> +
> +  /* Second vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  *p_v_out_2 = v_out_2;
> +}
> +
> +void f3 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
> +{
> +  vec sel0 = { 0, 2, 0, 2 };
> +  vec sel1 = { 1, 3, 1, 3 };
> +  vec sel = { 0, 1, 6, 7 };
> +  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
> +  vec v_in_1 = *p_v_in_1;
> +  vec v_in_2 = *p_v_in_2;
> +
> +  /* First vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  /* Second vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> +  v_x = v_1 + v_2;
> +  /* Won't blend because v_2 is RHS1 here.  */
> +  v_y = v_2 - v_1;
> +  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  *p_v_out_1 = v_out_1;
> +  *p_v_out_2 = v_out_2;
> +}
> +
> +extern vec foo (vec v);
> +void f4 (vec *p_v_in_1, vec *p_v_out_1, vec *p_v_out_2)
> +{
> +  vec sel0 = { 0, 2, 0, 2 };
> +  vec sel1 = { 1, 3, 1, 3 };
> +  vec sel = { 0, 1, 6, 7 };
> +  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
> +  vec v_in_1 = *p_v_in_1;
> +  vec v_in_2;
> +
> +  /* First vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  /* Won't merge because of dependency.  */
> +  v_in_2 = foo (v_out_1);
> +
> +  /* Second vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  *p_v_out_1 = v_out_1;
> +  *p_v_out_2 = v_out_2;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have been 
> merged" "forwprop1" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
> new file mode 100644
> index 00000000000..bc2269065e4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
> +
> +typedef int vec __attribute__((vector_size (4 * sizeof (int))));
> +
> +void f (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
> +{
> +  vec sel0 = { 0, 2, 0, 2 };
> +  vec sel1 = { 1, 3, 1, 3 };
> +  vec sel = { 0, 1, 6, 7 };
> +  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
> +  vec v_in_1 = *p_v_in_1;
> +  vec v_in_2 = *p_v_in_2;
> +
> +  /* First vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  /* Second vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> +  v_x = v_1 + v_2;
> +  v_y = v_1 - v_2;
> +  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  *p_v_out_1 = v_out_1;
> +  *p_v_out_2 = v_out_2;
> +}
> +
> +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been 
> blended" "forwprop1" } } */
> +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" } 
> } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
> new file mode 100644
> index 00000000000..e5f898e0281
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
> +
> +typedef int vec __attribute__((vector_size (4 * sizeof (int))));
> +
> +void f (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
> +{
> +  vec sel0 = { 0, 2, 0, 2 };
> +  vec sel1 = { 1, 3, 1, 3 };
> +  vec sel = { 0, 1, 6, 7 };
> +  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
> +  vec v_in_1 = *p_v_in_1;
> +  vec v_in_2 = *p_v_in_2;
> +
> +  /* First vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
> +  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
> +  v_x = v_1 * v_2;
> +  v_y = v_2 - v_1;
> +  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  /* Second vec perm sequence.  */
> +  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
> +  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
> +  v_x = v_2 * v_1;
> +  v_y = v_2 - v_1;
> +  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
> +
> +  *p_v_out_1 = v_out_1;
> +  *p_v_out_2 = v_out_2;
> +}
> +
> +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been 
> blended" "forwprop1" } } */
> +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" } 
> } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
> new file mode 100644
> index 00000000000..fcd140e3584
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
> @@ -0,0 +1,3 @@
> +#include "../../../gcc.dg/tree-ssa/satd-hadamard.c"
> +
> +/* { dg-final { scan-assembler-not {\ttbl\t} } } */
> diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc
> index 5c690a2b03e..1c755f89c29 100644
> --- a/gcc/tree-ssa-forwprop.cc
> +++ b/gcc/tree-ssa-forwprop.cc
> @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "optabs-tree.h"
>  #include "insn-config.h"
>  #include "recog.h"
> +#include "cfgloop.h"
> +#include "tree-vectorizer.h"
>  #include "tree-vector-builder.h"
>  #include "vec-perm-indices.h"
>  #include "internal-fn.h"
> @@ -184,6 +186,29 @@ along with GCC; see the file COPYING3.  If not see
>
>     This will (of course) be extended as other needs arise.  */
>
> +/* Data structure that holds a lane utilization.  */
> +typedef hash_map<int_hash <unsigned HOST_WIDE_INT, -1U>,
> +                          unsigned HOST_WIDE_INT> lane_map_t;
> +
> +/* Data structure that contains simplifiable vectorized permute sequences.
> +   See recognise_vec_perm_simplify_seq () for a description of the sequence. 
>  */
> +
> +struct _vec_perm_simplify_seq
> +{
> +  /* Defining stmts of vectors in the sequence.  */
> +  gassign *v_1_stmt;
> +  gassign *v_2_stmt;
> +  gassign *v_x_stmt;
> +  gassign *v_y_stmt;
> +  /* Final permute statment.  */
> +  gassign *stmt;
> +  /* New selector indices for stmt.  */
> +  vec<unsigned HOST_WIDE_INT> new_sel;
> +  /* Elements of each vector and selector.  */
> +  unsigned HOST_WIDE_INT nelts;
> +};
> +typedef struct _vec_perm_simplify_seq *vec_perm_simplify_seq;
> +
>  static bool forward_propagate_addr_expr (tree, tree, bool);
>
>  /* Set to true if we delete dead edges during the optimization.  */
> @@ -225,6 +250,26 @@ fwprop_invalidate_lattice (tree name)
>      lattice[SSA_NAME_VERSION (name)] = NULL_TREE;
>  }
>
> +/* If NAME is an SSA_NAME defined by an assignment, return that assignment.
> +   If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL.  */
> +
> +static gassign *
> +get_tree_def (tree name, bool single_use_only)
> +{
> +  if (TREE_CODE (name) != SSA_NAME)
> +    return NULL;
> +
> +  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
> +  gassign *assign = dyn_cast <gassign *> (def_stmt);
> +
> +  if (!assign)
> +    return NULL;
> +
> +  if (single_use_only && !has_single_use (name))
> +    return NULL;
> +
> +  return assign;
> +}
>
>  /* Get the statement we can propagate from into NAME skipping
>     trivial copies.  Returns the statement which defines the
> @@ -3460,6 +3505,489 @@ fwprop_ssa_val (tree name)
>    return name;
>  }
>
> +/* Get an index map from the provided vector permute selector
> +   and return the number of unique indices.
> +   E.g.: { 1, 3, 1, 3 } -> <0, 1, 0, 1>, 2
> +        { 0, 2, 0, 2 } -> <0, 1, 0, 1>, 2
> +        { 3, 2, 1, 0 } -> <0, 1, 2, 3>, 4.  */
> +
> +static unsigned HOST_WIDE_INT
> +get_vect_selector_index_map (tree sel, lane_map_t *index_map)
> +{
> +  gcc_assert (VECTOR_CST_NELTS (sel).is_constant ());
> +  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
> +  unsigned HOST_WIDE_INT n = 0;
> +
> +  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
> +    {
> +      /* Extract the i-th value from the selector.  */
> +      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
> +      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
> +
> +      unsigned HOST_WIDE_INT j = 0;
> +      for (; j <= i; j++)
> +       {
> +         tree prev_sel_cst_tree = VECTOR_CST_ELT (sel, j);
> +         unsigned HOST_WIDE_INT prev_sel_cst
> +           = TREE_INT_CST_LOW (prev_sel_cst_tree);
> +         if (prev_sel_cst == sel_cst)
> +           break;
> +       }
> +      index_map->put (i, j);
> +      n += (i == j) ? 1 : 0;
> +    }
> +
> +  return n;
> +}
> +
> +/* Search for opportunities to free half of the lanes in the following 
> pattern:
> +
> +     v_in = {e0, e1, e2, e3}
> +     v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
> +     // v_1 = {e0, e2, e0, e2}
> +     v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
> +     // v_2 = {e1, e3, e1, e3}
> +
> +     v_x = v_1 + v_2
> +     // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
> +     v_y = v_1 - v_2
> +     // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
> +
> +     v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
> +     // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}
> +
> +   The last statement could be simplified to:
> +     v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
> +     // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}
> +
> +   Characteristic properties:
> +   - v_1 and v_2 are created from the same input vector v_in and introduce 
> the
> +     lane duplication (in the selection operand) that we can eliminate.
> +   - v_x and v_y are results from lane-preserving operations that use v_1 and
> +     v_2 as inputs.
> +   - v_out is created by selecting from duplicated lanes.  */
> +
> +static bool
> +recognise_vec_perm_simplify_seq (gassign *stmt, vec_perm_simplify_seq *seq)
> +{
> +  gcc_checking_assert (stmt);
> +  gcc_checking_assert (gimple_assign_rhs_code (stmt) == VEC_PERM_EXPR);
> +
> +  /* Decompose the final vec permute statement.  */
> +  tree v_x = gimple_assign_rhs1 (stmt);
> +  tree v_y = gimple_assign_rhs2 (stmt);
> +  tree sel = gimple_assign_rhs3 (stmt);
> +
> +  if (!VECTOR_CST_NELTS (sel).is_constant ())
> +    return false;
> +
> +  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
> +
> +  /* Lookup the definition of v_x and v_y.  */
> +  gassign *v_x_stmt = get_tree_def (v_x, true);
> +  gassign *v_y_stmt = get_tree_def (v_y, true);
> +  if (!v_x_stmt || !v_y_stmt)
> +    return false;
> +
> +  /* Check the operations that define v_x and v_y.  */
> +  if (TREE_CODE_CLASS (gimple_assign_rhs_code (v_x_stmt)) != tcc_binary
> +      || TREE_CODE_CLASS (gimple_assign_rhs_code (v_y_stmt)) != tcc_binary)
> +    return false;
> +
> +  tree v_x_1 = gimple_assign_rhs1 (v_x_stmt);
> +  tree v_x_2 = gimple_assign_rhs2 (v_x_stmt);
> +  tree v_y_1 = gimple_assign_rhs1 (v_y_stmt);
> +  tree v_y_2 = gimple_assign_rhs2 (v_y_stmt);
> +
> +  if (v_x_stmt == v_y_stmt
> +      || TREE_CODE (v_x_1) != SSA_NAME
> +      || TREE_CODE (v_x_2) != SSA_NAME
> +      || num_imm_uses (v_x_1) != 2
> +      || num_imm_uses (v_x_2) != 2)
> +    return false;
> +
> +  if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
> +    {
> +      /* Allow operands of commutative operators to swap.  */
> +      if (commutative_tree_code (gimple_assign_rhs_code (v_x_stmt)))
> +       {
> +         /* Keep v_x_1 the first operand for non-commutative operators.  */
> +         v_x_1 = gimple_assign_rhs2 (v_x_stmt);
> +         v_x_2 = gimple_assign_rhs1 (v_x_stmt);
> +         if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
> +           return false;
> +       }
> +      else if (commutative_tree_code (gimple_assign_rhs_code (v_y_stmt)))
> +       {
> +         if (v_x_1 != v_y_2 || v_x_2 != v_y_1)
> +           return false;
> +       }
> +      else
> +       return false;
> +    }
> +
> +  gassign *v_1_stmt = get_tree_def (v_x_1, false);
> +  gassign *v_2_stmt = get_tree_def (v_x_2, false);
> +  if (!v_1_stmt || !v_2_stmt)
> +    return false;
> +
> +  if (gimple_assign_rhs_code (v_1_stmt) != VEC_PERM_EXPR
> +      || gimple_assign_rhs_code (v_2_stmt) != VEC_PERM_EXPR)
> +    return false;
> +
> +  /* Decompose initial VEC_PERM_EXPRs.  */
> +  tree v_in = gimple_assign_rhs1 (v_1_stmt);
> +  tree v_1_sel = gimple_assign_rhs3 (v_1_stmt);
> +  tree v_2_sel = gimple_assign_rhs3 (v_2_stmt);
> +  if (v_in != gimple_assign_rhs2 (v_1_stmt)
> +      || v_in != gimple_assign_rhs1 (v_2_stmt)
> +      || v_in != gimple_assign_rhs2 (v_2_stmt))
> +    return false;
> +
> +  if (!VECTOR_CST_NELTS (v_1_sel).is_constant ()
> +      || !VECTOR_CST_NELTS (v_2_sel).is_constant ())
> +    return false;
> +
> +  if (nelts != VECTOR_CST_NELTS (v_1_sel).to_constant ()
> +      || nelts != VECTOR_CST_NELTS (v_2_sel).to_constant ())
> +    return false;
> +
> +  /* Now check permutation selection operands.  */
> +  lane_map_t v_1_lane_map, v_2_lane_map;
> +  unsigned HOST_WIDE_INT v_1_lanes, v_2_lanes;
> +  v_1_lanes = get_vect_selector_index_map (v_1_sel, &v_1_lane_map);
> +  v_2_lanes = get_vect_selector_index_map (v_2_sel, &v_2_lane_map);
> +
> +  /* Check if we could free up half of the lanes.  */
> +  if (v_1_lanes != v_2_lanes || v_1_lanes > (nelts / 2))
> +    return false;
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "Found vec perm simplify sequence ending with: ");
> +      print_gimple_stmt (dump_file, stmt, 0);
> +
> +      if (dump_flags & TDF_DETAILS)
> +       fprintf (dump_file, "\tPossible new selector: { ");
> +     }
> +
> +  *seq = XNEW (struct _vec_perm_simplify_seq);
> +  (*seq)->stmt = stmt;
> +  (*seq)->v_1_stmt = v_1_stmt;
> +  (*seq)->v_2_stmt = v_2_stmt;
> +  (*seq)->v_x_stmt = v_x_stmt;
> +  (*seq)->v_y_stmt = v_y_stmt;
> +  (*seq)->nelts = nelts;
> +  (*seq)->new_sel = vNULL;
> +
> +  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
> +    {
> +      /* Extract the i-th value from the selector.  */
> +      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
> +      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
> +
> +      unsigned HOST_WIDE_INT j;
> +      if (sel_cst < nelts)
> +       j = *v_1_lane_map.get (sel_cst);
> +      else
> +       j = *v_2_lane_map.get (sel_cst - nelts) + nelts;
> +
> +      (*seq)->new_sel.safe_push (j);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "%lu", j);
> +         if (i != (nelts -1))
> +           fprintf (dump_file, ", ");
> +       }
> +    }
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, " }\n");
> +
> +  return true;
> +}
> +
> +/* Reduce the lane consumption of a simplifiable vec perm sequence.  */
> +
> +static void
> +narrow_vec_perm_simplify_seq (vec_perm_simplify_seq seq)
> +{
> +  unsigned HOST_WIDE_INT nelts = seq->nelts;
> +  int i;
> +  unsigned HOST_WIDE_INT idx;
> +  vec_perm_builder new_sel (nelts, nelts, 1);
> +
> +  FOR_EACH_VEC_ELT (seq->new_sel, i, idx)
> +      new_sel.quick_push (idx);
> +
> +  gassign *stmt = seq->stmt;
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "Updating VEC_PERM statment:\n");
> +      fprintf (dump_file, "Old stmt: ");
> +      print_gimple_stmt (dump_file, stmt, 0);
> +    }
> +
> +  /* Update the last VEC_PERM statement.  */
> +  vec_perm_indices new_indices (new_sel, 2, nelts);
> +  tree vectype = TREE_TYPE (gimple_assign_lhs (stmt));
> +  tree m = vect_gen_perm_mask_checked (vectype, new_indices);
> +  gimple_assign_set_rhs3 (stmt, m);
> +  update_stmt (stmt);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "New stmt: ");
> +      print_gimple_stmt (dump_file, stmt, 0);
> +    }
> +}
> +
> +/* Test if we can blend two simplifiable vec permute sequences.
> +   NEED_SWAP will be set, if sequences must be swapped for blending.  */
> +
> +static bool
> +can_blend_vec_perm_simplify_seqs_p (vec_perm_simplify_seq seq1,
> +                                   vec_perm_simplify_seq seq2,
> +                                   bool *need_swap)
> +{
> +  unsigned HOST_WIDE_INT nelts = seq1->nelts;
> +
> +  gcc_assert (gimple_bb (seq1->stmt) == gimple_bb (seq2->stmt));
> +
> +  /* Number of elements must be equal.  */
> +  if (seq1->nelts != seq2->nelts)
> +    return false;
> +
> +  /* We need vectors of the same type.  */
> +  if (TREE_TYPE (gimple_assign_lhs (seq1->stmt))
> +      != TREE_TYPE (gimple_assign_lhs (seq2->stmt)))
> +    return false;
> +
> +  /* We require isomorphic operators.  */
> +  if (((gimple_assign_rhs_code (seq1->v_x_stmt)
> +       != gimple_assign_rhs_code (seq2->v_x_stmt))
> +       || (gimple_assign_rhs_code (seq1->v_y_stmt)
> +          != gimple_assign_rhs_code (seq2->v_y_stmt))))
> +    return false;
> +
> +  /* We require equal selectors for v_1_stmt and v_2_stmt.
> +     This ensures we don't introduce a complex permutation in the blended
> +     sequence.  */
> +  tree seq1_v_1_sel = gimple_assign_rhs3 (seq1->v_1_stmt);
> +  tree seq1_v_2_sel = gimple_assign_rhs3 (seq1->v_2_stmt);
> +  tree seq2_v_1_sel = gimple_assign_rhs3 (seq2->v_1_stmt);
> +  tree seq2_v_2_sel = gimple_assign_rhs3 (seq2->v_2_stmt);
> +  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
> +    {
> +      if ((TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_1_sel, i))
> +          != TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_1_sel, i)))
> +         || (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_2_sel, i))
> +             != (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_2_sel, i)))))
> +       return false;
> +    }
> +
> +  /* We cannot have any dependencies between the sequences.
> +
> +     For merging, we will reuse seq1->v_1_stmt and seq1->v_2_stmt.
> +     seq1's v_in is defined before these statements, but we need
> +     to check if seq2's v_in is defined before as well.
> +
> +     Further, we will reuse seq2->stmt.  We need to ensure that
> +     seq1->v_x_stmt and seq1->v_y_stmt are before that.  */
> +
> +  tree seq2_v_in = gimple_assign_rhs1 (seq2->v_1_stmt);
> +  gassign *seq2_v_in_stmt = get_tree_def (seq2_v_in, false);
> +  if (!seq2_v_in_stmt
> +      || (gimple_uid (seq2_v_in_stmt) > gimple_uid (seq1->v_1_stmt))
> +      || (gimple_uid (seq1->v_x_stmt) > gimple_uid (seq2->stmt))
> +      || (gimple_uid (seq1->v_y_stmt) > gimple_uid (seq2->stmt)))
> +    {
> +      tree seq1_v_in = gimple_assign_rhs1 (seq1->v_1_stmt);
> +      gassign *seq1_v_in_stmt = get_tree_def (seq1_v_in, false);
> +      /* Let's try to see if we succeed when swapping the sequences.  */
> +      if (!seq1_v_in_stmt
> +         || (gimple_uid (seq1_v_in_stmt) > gimple_uid (seq2->v_1_stmt))
> +         || (gimple_uid (seq2->v_x_stmt) > gimple_uid (seq1->stmt))
> +         || (gimple_uid (seq2->v_y_stmt) > gimple_uid (seq1->stmt)))
> +       return false;
> +      *need_swap = true;
> +    }
> +  else
> +    *need_swap = false;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "Found vec perm simplify sequence pair.\n");
> +
> +  return true;
> +}
> +
> +/* Merge the two given simplifiable vec permute sequences.  */
> +
> +static void
> +blend_vec_perm_simplify_seqs (vec_perm_simplify_seq seq1,
> +                             vec_perm_simplify_seq seq2)
> +{
> +  unsigned HOST_WIDE_INT nelts = seq1->nelts;
> +
> +  /* Calculate the lanes that the first permute needs.  */
> +  tree mask1 = gimple_assign_rhs3 (seq1->stmt);
> +  lane_map_t mask1_lanes;
> +  unsigned HOST_WIDE_INT i;
> +  for (i = 0; i < nelts; i++)
> +    {
> +      tree val = VECTOR_CST_ELT (mask1, i);
> +      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts;
> +      mask1_lanes.put (j, j);
> +    }
> +
> +  /* Calculate the lanes that the second permute needs and rewrite
> +     them to use the lanes that are unused from the first permute.  */
> +  tree mask2 = gimple_assign_rhs3 (seq2->stmt);
> +  lane_map_t mask2_lanes;
> +  unsigned HOST_WIDE_INT lane_gen = 0;
> +  for (i = 0; i < nelts; i++)
> +    {
> +      tree val = VECTOR_CST_ELT (mask2, i);
> +      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts;
> +      bool existed;
> +      unsigned HOST_WIDE_INT &rewrite_lane
> +       = mask2_lanes.get_or_insert (j, &existed);
> +      if (existed)
> +       continue;
> +
> +      while (mask1_lanes.get (lane_gen))
> +       lane_gen++;
> +      rewrite_lane = lane_gen;
> +      lane_gen++;
> +    }
> +
> +  /* Ensure we don't need more than one permute.  */
> +  gcc_assert (i >= nelts);
> +
> +  /* Calculate the permutations based on MASK1_LANES/MASK2_LANES.  */
> +  vec_perm_builder seq2_stmt_sel (nelts, nelts, 1);
> +  vec_perm_builder seq1_v_1_stmt_sel (nelts, nelts, 1);
> +  vec_perm_builder seq1_v_2_stmt_sel (nelts, nelts, 1);
> +  for (i = 0; i < nelts; i++)
> +    {
> +      /* Calculate new mask for STMT2.  */
> +      tree val = VECTOR_CST_ELT (mask2, i);
> +      unsigned HOST_WIDE_INT lane
> +       = TREE_INT_CST_LOW (val) & (2 * nelts - 1);
> +      unsigned off = (lane < nelts) ? 0 : nelts;
> +      unsigned HOST_WIDE_INT new_lane
> +       = *mask2_lanes.get (lane - off) + off;
> +      seq2_stmt_sel.quick_push (new_lane);
> +
> +      /* Calculate new masks for SRC1_1 and SRC1_2.  */
> +      bool use_src1 = mask1_lanes.get (i);
> +      tree mask1 = gimple_assign_rhs3 (use_src1 ? seq1->v_1_stmt
> +                                               : seq2->v_1_stmt);
> +      tree mask2 = gimple_assign_rhs3 (use_src1 ? seq1->v_2_stmt
> +                                               : seq2->v_2_stmt);
> +      unsigned HOST_WIDE_INT lane1
> +       = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) % nelts;
> +      unsigned HOST_WIDE_INT lane2
> +       = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) % nelts;
> +      seq1_v_1_stmt_sel.quick_push (lane1 + (use_src1 ? 0 : nelts));
> +      seq1_v_2_stmt_sel.quick_push (lane2 + (use_src1 ? 0 : nelts));
> +    }
> +
> +  /* We don't need to adjust seq1->stmt because its lanes consumption
> +     was already tightened before entering this function.  */
> +
> +  /* Adjust seq2->stmt: copy RHS1/RHS2 from seq1->stmt and set new sel.  */
> +  gimple_assign_set_rhs1 (seq2->stmt, gimple_assign_rhs1 (seq1->stmt));
> +  gimple_assign_set_rhs2 (seq2->stmt, gimple_assign_rhs2 (seq1->stmt));
> +  vec_perm_indices seq2_stmt_indices (seq2_stmt_sel, 2, nelts);
> +  tree vectype = TREE_TYPE (gimple_assign_lhs (seq2->stmt));
> +  tree mask = vect_gen_perm_mask_checked (vectype, seq2_stmt_indices);
> +  gimple_assign_set_rhs3 (seq2->stmt, mask);
> +  update_stmt (seq2->stmt);
> +
> +  /* Adjust seq1->v_1_stmt: copy RHS2 from seq2->v_1_stmt and set new sel.  
> */
> +  gimple_assign_set_rhs2 (seq1->v_1_stmt, gimple_assign_rhs1 
> (seq2->v_1_stmt));
> +  vec_perm_indices seq1_v_1_stmt_indices (seq1_v_1_stmt_sel, 2, nelts);
> +  mask = vect_gen_perm_mask_checked (vectype, seq1_v_1_stmt_indices);
> +  gimple_assign_set_rhs3 (seq1->v_1_stmt, mask);
> +  update_stmt (seq1->v_1_stmt);
> +
> +  /* Adjust seq1->v_2_stmt: copy RHS2 from seq2->v_2_stmt and set new sel.  
> */
> +  gimple_assign_set_rhs2 (seq1->v_2_stmt, gimple_assign_rhs1 
> (seq2->v_2_stmt));
> +  vec_perm_indices seq1_v_2_stmt_indices (seq1_v_2_stmt_sel, 2, nelts);
> +  mask = vect_gen_perm_mask_checked (vectype, seq1_v_2_stmt_indices);
> +  gimple_assign_set_rhs3 (seq1->v_2_stmt, mask);
> +  update_stmt (seq1->v_2_stmt);
> +
> +  /* At this point, we have four unmodified seq2 stmts, which will be
> +     eliminated by DCE.  */
> +
> +  if (dump_file)
> +    fprintf (dump_file, "Vec perm simplify sequences have been 
> blended.\n\n");
> +}
> +
> +/* Try to blend narrowed vec_perm_simplify_seqs pairwise.
> +   The provided list will be empty after this call.  */
> +
> +static void
> +process_vec_perm_simplify_seq_list (vec<vec_perm_simplify_seq> *l)
> +{
> +  unsigned int i, j;
> +  vec_perm_simplify_seq seq1, seq2;
> +
> +  if (!l->length ())
> +    return;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "Processing %u vec perm simplify sequences.\n",
> +            l->length ());
> +
> +  FOR_EACH_VEC_ELT (*l, i, seq1)
> +    {
> +      if (i + 1 < l->length ())
> +       {
> +         FOR_EACH_VEC_ELT_FROM (*l, j, seq2, i + 1)
> +           {
> +             bool swap = false;
> +             if (can_blend_vec_perm_simplify_seqs_p (seq1, seq2, &swap))
> +               {
> +                 /* Compress lanes.  */
> +                 narrow_vec_perm_simplify_seq (seq1);
> +                 narrow_vec_perm_simplify_seq (seq2);
> +
> +                 /* Blend sequences.  */
> +                 blend_vec_perm_simplify_seqs (swap ? seq2 : seq1,
> +                                             swap ? seq1 : seq2);
> +
> +                 /* We can use unordered_remove as we break the loop below.  
> */
> +                 l->unordered_remove (j);
> +                 seq2->new_sel.release ();
> +                 XDELETE (seq2);
> +                 break;
> +               }
> +           }
> +       }
> +
> +      /* We don't need to call l->remove for seq1.  */
> +      seq1->new_sel.release ();
> +      XDELETE (seq1);
> +    }
> +
> +  l->truncate (0);
> +}
> +
> +static void
> +append_vec_perm_simplify_seq_list (vec<vec_perm_simplify_seq> *l,
> +                                  const vec_perm_simplify_seq &seq)
> +{
> +  /* If no space on list left, then process the list.  */
> +  if (!l->space (1))
> +      process_vec_perm_simplify_seq_list (l);
> +
> +  l->quick_push (seq);
> +}
> +
>  /* Main entry point for the forward propagation and statement combine
>     optimizer.  */
>
> @@ -3526,6 +4054,7 @@ pass_forwprop::execute (function *fun)
>    auto_bitmap simple_dce_worklist;
>    auto_bitmap need_ab_cleanup;
>    to_purge = BITMAP_ALLOC (NULL);
> +  auto_vec<vec_perm_simplify_seq, 8> vec_perm_simplify_seq_list;
>    for (int i = 0; i < postorder_num; ++i)
>      {
>        gimple_stmt_iterator gsi;
> @@ -3605,14 +4134,18 @@ pass_forwprop::execute (function *fun)
>
>        /* Apply forward propagation to all stmts in the basic-block.
>          Note we update GSI within the loop as necessary.  */
> +      unsigned int uid = 1;
>        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); )
>         {
>           gimple *stmt = gsi_stmt (gsi);
>           tree lhs, rhs;
>           enum tree_code code;
>
> +         gimple_set_uid (stmt, uid++);
> +
>           if (!is_gimple_assign (stmt))
>             {
> +             process_vec_perm_simplify_seq_list 
> (&vec_perm_simplify_seq_list);
>               gsi_next (&gsi);
>               continue;
>             }
> @@ -3620,9 +4153,11 @@ pass_forwprop::execute (function *fun)
>           lhs = gimple_assign_lhs (stmt);
>           rhs = gimple_assign_rhs1 (stmt);
>           code = gimple_assign_rhs_code (stmt);
> +
>           if (TREE_CODE (lhs) != SSA_NAME
>               || has_zero_uses (lhs))
>             {
> +             process_vec_perm_simplify_seq_list 
> (&vec_perm_simplify_seq_list);
>               gsi_next (&gsi);
>               continue;
>             }
> @@ -3901,10 +4436,25 @@ pass_forwprop::execute (function *fun)
>               else
>                 gsi_next (&gsi);
>             }
> +         else if (code == VEC_PERM_EXPR)
> +           {
> +             /* Find vectorized sequences, where we can reduce the lane
> +                utilization.  The narrowing will be donw later and only
> +                if we find a pair of sequences that can be blended.  */
> +             gassign *assign = dyn_cast <gassign *> (stmt);
> +             vec_perm_simplify_seq seq;
> +             if (recognise_vec_perm_simplify_seq (assign, &seq))
> +               append_vec_perm_simplify_seq_list 
> (&vec_perm_simplify_seq_list,
> +                                                  seq);
> +
> +             gsi_next (&gsi);
> +         }
>           else
>             gsi_next (&gsi);
>         }
>
> +      process_vec_perm_simplify_seq_list (&vec_perm_simplify_seq_list);
> +
>        /* Combine stmts with the stmts defining their operands.
>          Note we update GSI within the loop as necessary.  */
>        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> --
> 2.47.0
>

Re: [PATCH v4] forwprop: Try to blend two isomorphic VEC_PERM sequences

Reply via email to