This extends forwprop by yet another VEC_PERM optimization:
It attempts to blend two isomorphic vector sequences by using the
redundancy in the lane utilization in these sequences.
This redundancy in lane utilization comes from the way how specific
scalar statements end up vectorized: two VEC_PERMs on top, binary operations
on both of them, and a final VEC_PERM to create the result.
Here is an example of this sequence:

  v_in = {e0, e1, e2, e3}
  v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
  // v_1 = {e0, e2, e0, e2}
  v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
  // v_2 = {e1, e3, e1, e3}

  v_x = v_1 + v_2
  // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
  v_y = v_1 - v_2
  // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}

  v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
  // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}

To remove the redundancy, lanes 2 and 3 can be freed, which allows to
change the last statement into:
  v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
  // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}

The cost of eliminating the redundancy in the lane utilization is that
lowering the VEC PERM expression could get more expensive because of
tighter packing of the lanes.  Therefore this optimization is not done
alone, but in only in case we identify two such sequences that can be
blended.

Once all candidate sequences have been identified, we try to blend them,
so that we can use the freed lanes for the second sequence.
On success we convert 2x (2x BINOP + 1x VEC_PERM) to
2x VEC_PERM + 2x BINOP + 2x VEC_PERM traded for 4x VEC_PERM + 2x BINOP.

The implemented transformation reuses (rewrites) the statements
of the first sequence and the last VEC_PERM of the second sequence.
The remaining four statements of the second statment are left untouched
and will be eliminated by DCE later.

This targets x264_pixel_satd_8x4, which calculates the sum of absolute
transformed differences (SATD) using Hadamard transformation.
We have seen 8% speedup on SPEC's x264 on a 5950X (x86-64) and 7%
speedup on an AArch64 machine.

Bootstrapped and reg-tested on x86-64 and AArch64 (all languages).

gcc/ChangeLog:

        * tree-ssa-forwprop.cc (struct _vec_perm_simplify_seq): New data
        structure to store analysis results of a vec perm simplify sequence.
        (get_tree_def): New helper to get the defining statement of an
        SSA_NAME.
        (get_vect_selector_index_map): Helper to get an index map from the
        provided vector permute selector.
        (recognise_vec_perm_simplify_seq): Helper to recognise a
        vec perm simplify sequence.
        (narrow_vec_perm_simplify_seq): Helper to pack the lanes more
        tight.
        (can_blend_vec_perm_simplify_seqs_p): Test if two vec perm
        sequences can be blended.
        (blend_vec_perm_simplify_seqs): Helper to blend two vec perm
        simplify sequences.
        (process_vec_perm_simplify_seq_list): Helper to process a list
        of vec perm simplify sequences.
        (append_vec_perm_simplify_seq_list): Helper to add a vec perm
        simplify sequence to the list.
        (pass_forwprop::execute): Integrate new functionality.

gcc/testsuite/ChangeLog:

        * gcc.dg/tree-ssa/satd-hadamard.c: New test.
        * gcc.dg/tree-ssa/vector-10.c: New test.
        * gcc.dg/tree-ssa/vector-8.c: New test.
        * gcc.dg/tree-ssa/vector-9.c: New test.
        * gcc.target/aarch64/sve/satd-hadamard.c: New test.

Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
---
Changes from v3:
* Fix test condition for writing to the dump file
* Use gimple UIDs instead on expensive walks for comparing ordering.
* Ensure to not blend across assignments to SSA_NAMES.
* Restrict list to fix-sized vector with 8 entries.
* Remove calls of expensive vec methods by restructuring the code.
* Improved wording.

Changes from v2:
* Moved code to tree-ssa-forwprop.cc where similar VEC_PERM
  optimizations are implemented.
* Test operand order less strict in case of commutative operators.
* Made naming more consistent.
* Added a test for testing dependencies between two sequences.
* Removed the instruction reordering (no necessary without dependencies).
* Added tests based on __builtin_shuffle ().

Changes from v1:
* Moved code from tree-vect-slp.cc into a new pass (from where it could
  be moved elsewhere).
* Only deduplicate lanes if sequences will be merged later on.
* Split functionality stricter into analysis and transformation parts.

Manolis Tsamis was the patch's initial author before I took it over.

 gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c |  43 ++
 gcc/testsuite/gcc.dg/tree-ssa/vector-10.c     | 122 ++++
 gcc/testsuite/gcc.dg/tree-ssa/vector-8.c      |  34 ++
 gcc/testsuite/gcc.dg/tree-ssa/vector-9.c      |  34 ++
 .../gcc.target/aarch64/sve/satd-hadamard.c    |   3 +
 gcc/tree-ssa-forwprop.cc                      | 550 ++++++++++++++++++
 6 files changed, 786 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c 
b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
new file mode 100644
index 00000000000..576ef01628c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop4-details" } */
+
+#include <stdint.h>
+
+#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
+    int t0 = s0 + s1;\
+    int t1 = s0 - s1;\
+    int t2 = s2 + s3;\
+    int t3 = s2 - s3;\
+    d0 = t0 + t2;\
+    d1 = t1 + t3;\
+    d2 = t0 - t2;\
+    d3 = t1 - t3;\
+}
+
+int
+x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t *pix2, int 
i_pix2)
+{
+  uint32_t tmp[4][4];
+  uint32_t a0, a1, a2, a3;
+  int sum = 0;
+  int i;
+
+  for (i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
+    {
+      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
+      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
+      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
+      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
+      HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3);
+    }
+
+  for (i = 0; i < 4; i++)
+    {
+      HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]);
+      sum += a0 + a1 + a2 + a3;
+    }
+
+  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
+}
+
+/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop4" } } 
*/
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c 
b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
new file mode 100644
index 00000000000..d5caebdf174
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
@@ -0,0 +1,122 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
+
+typedef int vec __attribute__((vector_size (4 * sizeof (int))));
+
+void f1 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Won't blend because v_in_2 is defined after v_1 above.  */
+  v_in_2 = *p_v_in_2;
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+
+  *p_v_out_1 = v_out_1;
+  *p_v_out_2 = v_out_2;
+}
+
+void f2 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2 = *p_v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+  /* Won't blend because of this store between the sequences.  */
+  *p_v_out_1 = v_out_1;
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+
+  *p_v_out_2 = v_out_2;
+}
+
+void f3 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2 = *p_v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_1 + v_2;
+  /* Won't blend because v_2 is RHS1 here.  */
+  v_y = v_2 - v_1;
+  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+
+  *p_v_out_1 = v_out_1;
+  *p_v_out_2 = v_out_2;
+}
+
+extern vec foo (vec v);
+void f4 (vec *p_v_in_1, vec *p_v_out_1, vec *p_v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Won't merge because of dependency.  */
+  v_in_2 = foo (v_out_1);
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+
+  *p_v_out_1 = v_out_1;
+  *p_v_out_2 = v_out_2;
+}
+
+/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have been 
merged" "forwprop1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c 
b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
new file mode 100644
index 00000000000..bc2269065e4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
+
+typedef int vec __attribute__((vector_size (4 * sizeof (int))));
+
+void f (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2 = *p_v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+
+  *p_v_out_1 = v_out_1;
+  *p_v_out_2 = v_out_2;
+}
+
+/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been blended" 
"forwprop1" } } */
+/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" } } 
*/
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c 
b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
new file mode 100644
index 00000000000..e5f898e0281
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
+
+typedef int vec __attribute__((vector_size (4 * sizeof (int))));
+
+void f (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y, v_out_1, v_out_2;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2 = *p_v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 * v_2;
+  v_y = v_2 - v_1;
+  v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_2 * v_1;
+  v_y = v_2 - v_1;
+  v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+
+  *p_v_out_1 = v_out_1;
+  *p_v_out_2 = v_out_2;
+}
+
+/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been blended" 
"forwprop1" } } */
+/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" } } 
*/
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c 
b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
new file mode 100644
index 00000000000..fcd140e3584
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
@@ -0,0 +1,3 @@
+#include "../../../gcc.dg/tree-ssa/satd-hadamard.c"
+
+/* { dg-final { scan-assembler-not {\ttbl\t} } } */
diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc
index 5c690a2b03e..1c755f89c29 100644
--- a/gcc/tree-ssa-forwprop.cc
+++ b/gcc/tree-ssa-forwprop.cc
@@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "optabs-tree.h"
 #include "insn-config.h"
 #include "recog.h"
+#include "cfgloop.h"
+#include "tree-vectorizer.h"
 #include "tree-vector-builder.h"
 #include "vec-perm-indices.h"
 #include "internal-fn.h"
@@ -184,6 +186,29 @@ along with GCC; see the file COPYING3.  If not see
 
    This will (of course) be extended as other needs arise.  */
 
+/* Data structure that holds a lane utilization.  */
+typedef hash_map<int_hash <unsigned HOST_WIDE_INT, -1U>,
+                          unsigned HOST_WIDE_INT> lane_map_t;
+
+/* Data structure that contains simplifiable vectorized permute sequences.
+   See recognise_vec_perm_simplify_seq () for a description of the sequence.  
*/
+
+struct _vec_perm_simplify_seq
+{
+  /* Defining stmts of vectors in the sequence.  */
+  gassign *v_1_stmt;
+  gassign *v_2_stmt;
+  gassign *v_x_stmt;
+  gassign *v_y_stmt;
+  /* Final permute statment.  */
+  gassign *stmt;
+  /* New selector indices for stmt.  */
+  vec<unsigned HOST_WIDE_INT> new_sel;
+  /* Elements of each vector and selector.  */
+  unsigned HOST_WIDE_INT nelts;
+};
+typedef struct _vec_perm_simplify_seq *vec_perm_simplify_seq;
+
 static bool forward_propagate_addr_expr (tree, tree, bool);
 
 /* Set to true if we delete dead edges during the optimization.  */
@@ -225,6 +250,26 @@ fwprop_invalidate_lattice (tree name)
     lattice[SSA_NAME_VERSION (name)] = NULL_TREE;
 }
 
+/* If NAME is an SSA_NAME defined by an assignment, return that assignment.
+   If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL.  */
+
+static gassign *
+get_tree_def (tree name, bool single_use_only)
+{
+  if (TREE_CODE (name) != SSA_NAME)
+    return NULL;
+
+  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
+  gassign *assign = dyn_cast <gassign *> (def_stmt);
+
+  if (!assign)
+    return NULL;
+
+  if (single_use_only && !has_single_use (name))
+    return NULL;
+
+  return assign;
+}
 
 /* Get the statement we can propagate from into NAME skipping
    trivial copies.  Returns the statement which defines the
@@ -3460,6 +3505,489 @@ fwprop_ssa_val (tree name)
   return name;
 }
 
+/* Get an index map from the provided vector permute selector
+   and return the number of unique indices.
+   E.g.: { 1, 3, 1, 3 } -> <0, 1, 0, 1>, 2
+        { 0, 2, 0, 2 } -> <0, 1, 0, 1>, 2
+        { 3, 2, 1, 0 } -> <0, 1, 2, 3>, 4.  */
+
+static unsigned HOST_WIDE_INT
+get_vect_selector_index_map (tree sel, lane_map_t *index_map)
+{
+  gcc_assert (VECTOR_CST_NELTS (sel).is_constant ());
+  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
+  unsigned HOST_WIDE_INT n = 0;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      /* Extract the i-th value from the selector.  */
+      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
+      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
+
+      unsigned HOST_WIDE_INT j = 0;
+      for (; j <= i; j++)
+       {
+         tree prev_sel_cst_tree = VECTOR_CST_ELT (sel, j);
+         unsigned HOST_WIDE_INT prev_sel_cst
+           = TREE_INT_CST_LOW (prev_sel_cst_tree);
+         if (prev_sel_cst == sel_cst)
+           break;
+       }
+      index_map->put (i, j);
+      n += (i == j) ? 1 : 0;
+    }
+
+  return n;
+}
+
+/* Search for opportunities to free half of the lanes in the following pattern:
+
+     v_in = {e0, e1, e2, e3}
+     v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
+     // v_1 = {e0, e2, e0, e2}
+     v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
+     // v_2 = {e1, e3, e1, e3}
+
+     v_x = v_1 + v_2
+     // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
+     v_y = v_1 - v_2
+     // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
+
+     v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
+     // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}
+
+   The last statement could be simplified to:
+     v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
+     // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}
+
+   Characteristic properties:
+   - v_1 and v_2 are created from the same input vector v_in and introduce the
+     lane duplication (in the selection operand) that we can eliminate.
+   - v_x and v_y are results from lane-preserving operations that use v_1 and
+     v_2 as inputs.
+   - v_out is created by selecting from duplicated lanes.  */
+
+static bool
+recognise_vec_perm_simplify_seq (gassign *stmt, vec_perm_simplify_seq *seq)
+{
+  gcc_checking_assert (stmt);
+  gcc_checking_assert (gimple_assign_rhs_code (stmt) == VEC_PERM_EXPR);
+
+  /* Decompose the final vec permute statement.  */
+  tree v_x = gimple_assign_rhs1 (stmt);
+  tree v_y = gimple_assign_rhs2 (stmt);
+  tree sel = gimple_assign_rhs3 (stmt);
+
+  if (!VECTOR_CST_NELTS (sel).is_constant ())
+    return false;
+
+  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
+
+  /* Lookup the definition of v_x and v_y.  */
+  gassign *v_x_stmt = get_tree_def (v_x, true);
+  gassign *v_y_stmt = get_tree_def (v_y, true);
+  if (!v_x_stmt || !v_y_stmt)
+    return false;
+
+  /* Check the operations that define v_x and v_y.  */
+  if (TREE_CODE_CLASS (gimple_assign_rhs_code (v_x_stmt)) != tcc_binary
+      || TREE_CODE_CLASS (gimple_assign_rhs_code (v_y_stmt)) != tcc_binary)
+    return false;
+
+  tree v_x_1 = gimple_assign_rhs1 (v_x_stmt);
+  tree v_x_2 = gimple_assign_rhs2 (v_x_stmt);
+  tree v_y_1 = gimple_assign_rhs1 (v_y_stmt);
+  tree v_y_2 = gimple_assign_rhs2 (v_y_stmt);
+
+  if (v_x_stmt == v_y_stmt
+      || TREE_CODE (v_x_1) != SSA_NAME
+      || TREE_CODE (v_x_2) != SSA_NAME
+      || num_imm_uses (v_x_1) != 2
+      || num_imm_uses (v_x_2) != 2)
+    return false;
+
+  if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
+    {
+      /* Allow operands of commutative operators to swap.  */
+      if (commutative_tree_code (gimple_assign_rhs_code (v_x_stmt)))
+       {
+         /* Keep v_x_1 the first operand for non-commutative operators.  */
+         v_x_1 = gimple_assign_rhs2 (v_x_stmt);
+         v_x_2 = gimple_assign_rhs1 (v_x_stmt);
+         if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
+           return false;
+       }
+      else if (commutative_tree_code (gimple_assign_rhs_code (v_y_stmt)))
+       {
+         if (v_x_1 != v_y_2 || v_x_2 != v_y_1)
+           return false;
+       }
+      else
+       return false;
+    }
+
+  gassign *v_1_stmt = get_tree_def (v_x_1, false);
+  gassign *v_2_stmt = get_tree_def (v_x_2, false);
+  if (!v_1_stmt || !v_2_stmt)
+    return false;
+
+  if (gimple_assign_rhs_code (v_1_stmt) != VEC_PERM_EXPR
+      || gimple_assign_rhs_code (v_2_stmt) != VEC_PERM_EXPR)
+    return false;
+
+  /* Decompose initial VEC_PERM_EXPRs.  */
+  tree v_in = gimple_assign_rhs1 (v_1_stmt);
+  tree v_1_sel = gimple_assign_rhs3 (v_1_stmt);
+  tree v_2_sel = gimple_assign_rhs3 (v_2_stmt);
+  if (v_in != gimple_assign_rhs2 (v_1_stmt)
+      || v_in != gimple_assign_rhs1 (v_2_stmt)
+      || v_in != gimple_assign_rhs2 (v_2_stmt))
+    return false;
+
+  if (!VECTOR_CST_NELTS (v_1_sel).is_constant ()
+      || !VECTOR_CST_NELTS (v_2_sel).is_constant ())
+    return false;
+
+  if (nelts != VECTOR_CST_NELTS (v_1_sel).to_constant ()
+      || nelts != VECTOR_CST_NELTS (v_2_sel).to_constant ())
+    return false;
+
+  /* Now check permutation selection operands.  */
+  lane_map_t v_1_lane_map, v_2_lane_map;
+  unsigned HOST_WIDE_INT v_1_lanes, v_2_lanes;
+  v_1_lanes = get_vect_selector_index_map (v_1_sel, &v_1_lane_map);
+  v_2_lanes = get_vect_selector_index_map (v_2_sel, &v_2_lane_map);
+
+  /* Check if we could free up half of the lanes.  */
+  if (v_1_lanes != v_2_lanes || v_1_lanes > (nelts / 2))
+    return false;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Found vec perm simplify sequence ending with: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+
+      if (dump_flags & TDF_DETAILS)
+       fprintf (dump_file, "\tPossible new selector: { ");
+     }
+
+  *seq = XNEW (struct _vec_perm_simplify_seq);
+  (*seq)->stmt = stmt;
+  (*seq)->v_1_stmt = v_1_stmt;
+  (*seq)->v_2_stmt = v_2_stmt;
+  (*seq)->v_x_stmt = v_x_stmt;
+  (*seq)->v_y_stmt = v_y_stmt;
+  (*seq)->nelts = nelts;
+  (*seq)->new_sel = vNULL;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      /* Extract the i-th value from the selector.  */
+      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
+      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
+
+      unsigned HOST_WIDE_INT j;
+      if (sel_cst < nelts)
+       j = *v_1_lane_map.get (sel_cst);
+      else
+       j = *v_2_lane_map.get (sel_cst - nelts) + nelts;
+
+      (*seq)->new_sel.safe_push (j);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+       {
+         fprintf (dump_file, "%lu", j);
+         if (i != (nelts -1))
+           fprintf (dump_file, ", ");
+       }
+    }
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, " }\n");
+
+  return true;
+}
+
+/* Reduce the lane consumption of a simplifiable vec perm sequence.  */
+
+static void
+narrow_vec_perm_simplify_seq (vec_perm_simplify_seq seq)
+{
+  unsigned HOST_WIDE_INT nelts = seq->nelts;
+  int i;
+  unsigned HOST_WIDE_INT idx;
+  vec_perm_builder new_sel (nelts, nelts, 1);
+
+  FOR_EACH_VEC_ELT (seq->new_sel, i, idx)
+      new_sel.quick_push (idx);
+
+  gassign *stmt = seq->stmt;
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Updating VEC_PERM statment:\n");
+      fprintf (dump_file, "Old stmt: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+    }
+
+  /* Update the last VEC_PERM statement.  */
+  vec_perm_indices new_indices (new_sel, 2, nelts);
+  tree vectype = TREE_TYPE (gimple_assign_lhs (stmt));
+  tree m = vect_gen_perm_mask_checked (vectype, new_indices);
+  gimple_assign_set_rhs3 (stmt, m);
+  update_stmt (stmt);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "New stmt: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+    }
+}
+
+/* Test if we can blend two simplifiable vec permute sequences.
+   NEED_SWAP will be set, if sequences must be swapped for blending.  */
+
+static bool
+can_blend_vec_perm_simplify_seqs_p (vec_perm_simplify_seq seq1,
+                                   vec_perm_simplify_seq seq2,
+                                   bool *need_swap)
+{
+  unsigned HOST_WIDE_INT nelts = seq1->nelts;
+
+  gcc_assert (gimple_bb (seq1->stmt) == gimple_bb (seq2->stmt));
+
+  /* Number of elements must be equal.  */
+  if (seq1->nelts != seq2->nelts)
+    return false;
+
+  /* We need vectors of the same type.  */
+  if (TREE_TYPE (gimple_assign_lhs (seq1->stmt))
+      != TREE_TYPE (gimple_assign_lhs (seq2->stmt)))
+    return false;
+
+  /* We require isomorphic operators.  */
+  if (((gimple_assign_rhs_code (seq1->v_x_stmt)
+       != gimple_assign_rhs_code (seq2->v_x_stmt))
+       || (gimple_assign_rhs_code (seq1->v_y_stmt)
+          != gimple_assign_rhs_code (seq2->v_y_stmt))))
+    return false;
+
+  /* We require equal selectors for v_1_stmt and v_2_stmt.
+     This ensures we don't introduce a complex permutation in the blended
+     sequence.  */
+  tree seq1_v_1_sel = gimple_assign_rhs3 (seq1->v_1_stmt);
+  tree seq1_v_2_sel = gimple_assign_rhs3 (seq1->v_2_stmt);
+  tree seq2_v_1_sel = gimple_assign_rhs3 (seq2->v_1_stmt);
+  tree seq2_v_2_sel = gimple_assign_rhs3 (seq2->v_2_stmt);
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      if ((TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_1_sel, i))
+          != TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_1_sel, i)))
+         || (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_2_sel, i))
+             != (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_2_sel, i)))))
+       return false;
+    }
+
+  /* We cannot have any dependencies between the sequences.
+
+     For merging, we will reuse seq1->v_1_stmt and seq1->v_2_stmt.
+     seq1's v_in is defined before these statements, but we need
+     to check if seq2's v_in is defined before as well.
+
+     Further, we will reuse seq2->stmt.  We need to ensure that
+     seq1->v_x_stmt and seq1->v_y_stmt are before that.  */
+
+  tree seq2_v_in = gimple_assign_rhs1 (seq2->v_1_stmt);
+  gassign *seq2_v_in_stmt = get_tree_def (seq2_v_in, false);
+  if (!seq2_v_in_stmt
+      || (gimple_uid (seq2_v_in_stmt) > gimple_uid (seq1->v_1_stmt))
+      || (gimple_uid (seq1->v_x_stmt) > gimple_uid (seq2->stmt))
+      || (gimple_uid (seq1->v_y_stmt) > gimple_uid (seq2->stmt)))
+    {
+      tree seq1_v_in = gimple_assign_rhs1 (seq1->v_1_stmt);
+      gassign *seq1_v_in_stmt = get_tree_def (seq1_v_in, false);
+      /* Let's try to see if we succeed when swapping the sequences.  */
+      if (!seq1_v_in_stmt
+         || (gimple_uid (seq1_v_in_stmt) > gimple_uid (seq2->v_1_stmt))
+         || (gimple_uid (seq2->v_x_stmt) > gimple_uid (seq1->stmt))
+         || (gimple_uid (seq2->v_y_stmt) > gimple_uid (seq1->stmt)))
+       return false;
+      *need_swap = true;
+    }
+  else
+    *need_swap = false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Found vec perm simplify sequence pair.\n");
+
+  return true;
+}
+
+/* Merge the two given simplifiable vec permute sequences.  */
+
+static void
+blend_vec_perm_simplify_seqs (vec_perm_simplify_seq seq1,
+                             vec_perm_simplify_seq seq2)
+{
+  unsigned HOST_WIDE_INT nelts = seq1->nelts;
+
+  /* Calculate the lanes that the first permute needs.  */
+  tree mask1 = gimple_assign_rhs3 (seq1->stmt);
+  lane_map_t mask1_lanes;
+  unsigned HOST_WIDE_INT i;
+  for (i = 0; i < nelts; i++)
+    {
+      tree val = VECTOR_CST_ELT (mask1, i);
+      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts;
+      mask1_lanes.put (j, j);
+    }
+
+  /* Calculate the lanes that the second permute needs and rewrite
+     them to use the lanes that are unused from the first permute.  */
+  tree mask2 = gimple_assign_rhs3 (seq2->stmt);
+  lane_map_t mask2_lanes;
+  unsigned HOST_WIDE_INT lane_gen = 0;
+  for (i = 0; i < nelts; i++)
+    {
+      tree val = VECTOR_CST_ELT (mask2, i);
+      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts;
+      bool existed;
+      unsigned HOST_WIDE_INT &rewrite_lane
+       = mask2_lanes.get_or_insert (j, &existed);
+      if (existed)
+       continue;
+
+      while (mask1_lanes.get (lane_gen))
+       lane_gen++;
+      rewrite_lane = lane_gen;
+      lane_gen++;
+    }
+
+  /* Ensure we don't need more than one permute.  */
+  gcc_assert (i >= nelts);
+
+  /* Calculate the permutations based on MASK1_LANES/MASK2_LANES.  */
+  vec_perm_builder seq2_stmt_sel (nelts, nelts, 1);
+  vec_perm_builder seq1_v_1_stmt_sel (nelts, nelts, 1);
+  vec_perm_builder seq1_v_2_stmt_sel (nelts, nelts, 1);
+  for (i = 0; i < nelts; i++)
+    {
+      /* Calculate new mask for STMT2.  */
+      tree val = VECTOR_CST_ELT (mask2, i);
+      unsigned HOST_WIDE_INT lane
+       = TREE_INT_CST_LOW (val) & (2 * nelts - 1);
+      unsigned off = (lane < nelts) ? 0 : nelts;
+      unsigned HOST_WIDE_INT new_lane
+       = *mask2_lanes.get (lane - off) + off;
+      seq2_stmt_sel.quick_push (new_lane);
+
+      /* Calculate new masks for SRC1_1 and SRC1_2.  */
+      bool use_src1 = mask1_lanes.get (i);
+      tree mask1 = gimple_assign_rhs3 (use_src1 ? seq1->v_1_stmt
+                                               : seq2->v_1_stmt);
+      tree mask2 = gimple_assign_rhs3 (use_src1 ? seq1->v_2_stmt
+                                               : seq2->v_2_stmt);
+      unsigned HOST_WIDE_INT lane1
+       = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) % nelts;
+      unsigned HOST_WIDE_INT lane2
+       = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) % nelts;
+      seq1_v_1_stmt_sel.quick_push (lane1 + (use_src1 ? 0 : nelts));
+      seq1_v_2_stmt_sel.quick_push (lane2 + (use_src1 ? 0 : nelts));
+    }
+
+  /* We don't need to adjust seq1->stmt because its lanes consumption
+     was already tightened before entering this function.  */
+
+  /* Adjust seq2->stmt: copy RHS1/RHS2 from seq1->stmt and set new sel.  */
+  gimple_assign_set_rhs1 (seq2->stmt, gimple_assign_rhs1 (seq1->stmt));
+  gimple_assign_set_rhs2 (seq2->stmt, gimple_assign_rhs2 (seq1->stmt));
+  vec_perm_indices seq2_stmt_indices (seq2_stmt_sel, 2, nelts);
+  tree vectype = TREE_TYPE (gimple_assign_lhs (seq2->stmt));
+  tree mask = vect_gen_perm_mask_checked (vectype, seq2_stmt_indices);
+  gimple_assign_set_rhs3 (seq2->stmt, mask);
+  update_stmt (seq2->stmt);
+
+  /* Adjust seq1->v_1_stmt: copy RHS2 from seq2->v_1_stmt and set new sel.  */
+  gimple_assign_set_rhs2 (seq1->v_1_stmt, gimple_assign_rhs1 (seq2->v_1_stmt));
+  vec_perm_indices seq1_v_1_stmt_indices (seq1_v_1_stmt_sel, 2, nelts);
+  mask = vect_gen_perm_mask_checked (vectype, seq1_v_1_stmt_indices);
+  gimple_assign_set_rhs3 (seq1->v_1_stmt, mask);
+  update_stmt (seq1->v_1_stmt);
+
+  /* Adjust seq1->v_2_stmt: copy RHS2 from seq2->v_2_stmt and set new sel.  */
+  gimple_assign_set_rhs2 (seq1->v_2_stmt, gimple_assign_rhs1 (seq2->v_2_stmt));
+  vec_perm_indices seq1_v_2_stmt_indices (seq1_v_2_stmt_sel, 2, nelts);
+  mask = vect_gen_perm_mask_checked (vectype, seq1_v_2_stmt_indices);
+  gimple_assign_set_rhs3 (seq1->v_2_stmt, mask);
+  update_stmt (seq1->v_2_stmt);
+
+  /* At this point, we have four unmodified seq2 stmts, which will be
+     eliminated by DCE.  */
+
+  if (dump_file)
+    fprintf (dump_file, "Vec perm simplify sequences have been blended.\n\n");
+}
+
+/* Try to blend narrowed vec_perm_simplify_seqs pairwise.
+   The provided list will be empty after this call.  */
+
+static void
+process_vec_perm_simplify_seq_list (vec<vec_perm_simplify_seq> *l)
+{
+  unsigned int i, j;
+  vec_perm_simplify_seq seq1, seq2;
+
+  if (!l->length ())
+    return;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Processing %u vec perm simplify sequences.\n",
+            l->length ());
+
+  FOR_EACH_VEC_ELT (*l, i, seq1)
+    {
+      if (i + 1 < l->length ())
+       {
+         FOR_EACH_VEC_ELT_FROM (*l, j, seq2, i + 1)
+           {
+             bool swap = false;
+             if (can_blend_vec_perm_simplify_seqs_p (seq1, seq2, &swap))
+               {
+                 /* Compress lanes.  */
+                 narrow_vec_perm_simplify_seq (seq1);
+                 narrow_vec_perm_simplify_seq (seq2);
+
+                 /* Blend sequences.  */
+                 blend_vec_perm_simplify_seqs (swap ? seq2 : seq1,
+                                             swap ? seq1 : seq2);
+
+                 /* We can use unordered_remove as we break the loop below.  */
+                 l->unordered_remove (j);
+                 seq2->new_sel.release ();
+                 XDELETE (seq2);
+                 break;
+               }
+           }
+       }
+
+      /* We don't need to call l->remove for seq1.  */
+      seq1->new_sel.release ();
+      XDELETE (seq1);
+    }
+
+  l->truncate (0);
+}
+
+static void
+append_vec_perm_simplify_seq_list (vec<vec_perm_simplify_seq> *l,
+                                  const vec_perm_simplify_seq &seq)
+{
+  /* If no space on list left, then process the list.  */
+  if (!l->space (1))
+      process_vec_perm_simplify_seq_list (l);
+
+  l->quick_push (seq);
+}
+
 /* Main entry point for the forward propagation and statement combine
    optimizer.  */
 
@@ -3526,6 +4054,7 @@ pass_forwprop::execute (function *fun)
   auto_bitmap simple_dce_worklist;
   auto_bitmap need_ab_cleanup;
   to_purge = BITMAP_ALLOC (NULL);
+  auto_vec<vec_perm_simplify_seq, 8> vec_perm_simplify_seq_list;
   for (int i = 0; i < postorder_num; ++i)
     {
       gimple_stmt_iterator gsi;
@@ -3605,14 +4134,18 @@ pass_forwprop::execute (function *fun)
 
       /* Apply forward propagation to all stmts in the basic-block.
         Note we update GSI within the loop as necessary.  */
+      unsigned int uid = 1;
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); )
        {
          gimple *stmt = gsi_stmt (gsi);
          tree lhs, rhs;
          enum tree_code code;
 
+         gimple_set_uid (stmt, uid++);
+
          if (!is_gimple_assign (stmt))
            {
+             process_vec_perm_simplify_seq_list (&vec_perm_simplify_seq_list);
              gsi_next (&gsi);
              continue;
            }
@@ -3620,9 +4153,11 @@ pass_forwprop::execute (function *fun)
          lhs = gimple_assign_lhs (stmt);
          rhs = gimple_assign_rhs1 (stmt);
          code = gimple_assign_rhs_code (stmt);
+
          if (TREE_CODE (lhs) != SSA_NAME
              || has_zero_uses (lhs))
            {
+             process_vec_perm_simplify_seq_list (&vec_perm_simplify_seq_list);
              gsi_next (&gsi);
              continue;
            }
@@ -3901,10 +4436,25 @@ pass_forwprop::execute (function *fun)
              else
                gsi_next (&gsi);
            }
+         else if (code == VEC_PERM_EXPR)
+           {
+             /* Find vectorized sequences, where we can reduce the lane
+                utilization.  The narrowing will be donw later and only
+                if we find a pair of sequences that can be blended.  */
+             gassign *assign = dyn_cast <gassign *> (stmt);
+             vec_perm_simplify_seq seq;
+             if (recognise_vec_perm_simplify_seq (assign, &seq))
+               append_vec_perm_simplify_seq_list (&vec_perm_simplify_seq_list,
+                                                  seq);
+
+             gsi_next (&gsi);
+         }
          else
            gsi_next (&gsi);
        }
 
+      process_vec_perm_simplify_seq_list (&vec_perm_simplify_seq_list);
+
       /* Combine stmts with the stmts defining their operands.
         Note we update GSI within the loop as necessary.  */
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
-- 
2.47.0

Reply via email to