This extends forwprop by yet another VEC_PERM optimization:
It attempts to merge two isomorphic vector sequences by using the
redundancy in the lane utilization in these sequences.
This redundancy in lane utilization comes from the way how specific
scalar statements end up vectorized: two VEC_PERMs on top, binary operations
on both of them, and a final VEC_PERM to create the result.
Here is an example of this sequence:

  v_in = {e0, e1, e2, e3}
  v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
  // v_1 = {e0, e2, e0, e2}
  v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
  // v_2 = {e1, e3, e1, e3}

  v_x = v_1 + v_2
  // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
  v_y = v_1 - v_2
  // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}

  v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
  // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}

To remove the redundancy, lanes 2 and 3 can be freed, which allows to
change the last statement into:
  v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
  // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}

The cost of eliminating the redundancy in the lane utilization is that
lowering the VEC PERM expression could get more expensive because of
tighter packing of the lanes.  Therefore this optimization is not done
alone, but in only in case we identify two such sequences that can be
merged.

Once all candidate sequences have been identified, we try to pair them
up, so that we can use the freed lane for the second sequence.
On success, we replace 2x (2x VEC_PERM + 2x BINOP + 1x VEC_PERM)
instructions by 2x VEC_PERM + 2x BINOP + 2x VEC_PERM.
2x VEC_PERM and 2x BINOP get eliminated.

This targets x264_pixel_satd_8x4, which calculates the sum of absolute
transformed differences (SATD) using Hadamard transformation.
We have seen 8% speedup on SPEC's x264 on a 5950X (x86-64) and 7%
speedup on an AArch64 machine.

Bootstrapped and reg-tested on x86-64 and AArch64 (all languages).

gcc/ChangeLog:

        * tree-ssa-forwprop.cc (struct _vec_perm_simplify_seq): New data
        structure to store analsyis results of a vec perm sequence.
        (get_tree_def): New helper to get the defining statement of an
        SSA_NAME.
        (get_vect_selector_index_map): Helper to get an index map from
        the provided vector permute selector.
        (recognise_vec_perm_simplify_seq): Helper to recognise a
        vec perm sequence.
        (compress_vec_perm_simplify_seq): Helper to pack the lanes more tight.
        (can_merge_vec_perm_simplify_seqs_p): Test if two vec perm sequences can
        be merged.
        (merge_vec_perm_simplify_seqs): Helper to merge two vec perm sequences.
        (pass_forwprop::execute): Add calls vec perm sequence functions.

gcc/testsuite/ChangeLog:

        * gcc.dg/tree-ssa/satd-hadamard.c: New test.
        * gcc.dg/tree-ssa/vector-10.c: New test.
        * gcc.dg/tree-ssa/vector-11.c: New test.
        * gcc.dg/tree-ssa/vector-8.c: New test.
        * gcc.dg/tree-ssa/vector-9.c: New test.
        * gcc.target/aarch64/sve/satd-hadamard.c: New test.

Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
---
Changes from v2:
* Moved code to tree-ssa-forwprop.cc where similar VEC_PERM
  optimizations are implemented.
* Test operand order less strict in case of commutative operators.
* Made naming more consistent.
* Added a test for testing dependencies between two sequences.
* Removed the instruction reordering (no necessary without dependencies).
* Added tests based on __builtin_shuffle ().

Changes from v1:
* Moved code from tree-vect-slp.cc into a new pass (from where it could
  be moved elsewhere).
* Only deduplicate lanes if sequences will be merged later on.
* Split functionality stricter into analysis and transformation parts.

Manolis Tsamis was the patch's initial author before I took it over.

 gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c |  43 ++
 gcc/testsuite/gcc.dg/tree-ssa/vector-10.c     |  32 ++
 gcc/testsuite/gcc.dg/tree-ssa/vector-11.c     |  35 ++
 gcc/testsuite/gcc.dg/tree-ssa/vector-8.c      |  31 +
 gcc/testsuite/gcc.dg/tree-ssa/vector-9.c      |  31 +
 .../gcc.target/aarch64/sve/satd-hadamard.c    |   3 +
 gcc/tree-ssa-forwprop.cc                      | 538 ++++++++++++++++++
 7 files changed, 713 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-11.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c 
b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
new file mode 100644
index 00000000000..576ef01628c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop4-details" } */
+
+#include <stdint.h>
+
+#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
+    int t0 = s0 + s1;\
+    int t1 = s0 - s1;\
+    int t2 = s2 + s3;\
+    int t3 = s2 - s3;\
+    d0 = t0 + t2;\
+    d1 = t1 + t3;\
+    d2 = t0 - t2;\
+    d3 = t1 - t3;\
+}
+
+int
+x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t *pix2, int 
i_pix2)
+{
+  uint32_t tmp[4][4];
+  uint32_t a0, a1, a2, a3;
+  int sum = 0;
+  int i;
+
+  for (i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
+    {
+      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
+      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
+      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
+      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
+      HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3);
+    }
+
+  for (i = 0; i < 4; i++)
+    {
+      HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]);
+      sum += a0 + a1 + a2 + a3;
+    }
+
+  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
+}
+
+/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop4" } } 
*/
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c 
b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
new file mode 100644
index 00000000000..510a8e814d7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
+
+typedef int vec __attribute__((vector_size (4 * sizeof (int))));
+
+void h (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2 = *p_v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  *v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_1 + v_2;
+  /* Won't merge because v_2 is RHS1 here.  */
+  v_y = v_2 - v_1;
+  *v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+}
+
+/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have been 
merged" "forwprop1" } } */
+/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" 
} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c 
b/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c
new file mode 100644
index 00000000000..e0ef92d9d56
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
+
+typedef int vec __attribute__((vector_size (4 * sizeof (int))));
+
+extern vec foo (vec *v);
+void i (vec *p_v_in_1, vec *v_out_1, vec *v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  *v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Won't merge because of dependency.  */
+  v_in_2 = foo (v_out_1);
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  *v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+}
+
+/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have been 
merged" "forwprop1" } } */
+/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" 
} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c 
b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
new file mode 100644
index 00000000000..6b7040d4047
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
+
+typedef int vec __attribute__((vector_size (4 * sizeof (int))));
+
+void f (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2 = *p_v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  *v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_1 + v_2;
+  v_y = v_1 - v_2;
+  *v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+}
+
+/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been merged" 
"forwprop1" } } */
+/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" } } 
*/
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c 
b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
new file mode 100644
index 00000000000..bb8e53faacb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */
+
+typedef int vec __attribute__((vector_size (4 * sizeof (int))));
+
+void g (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2)
+{
+  vec sel0 = { 0, 2, 0, 2 };
+  vec sel1 = { 1, 3, 1, 3 };
+  vec sel = { 0, 1, 6, 7 };
+  vec v_1, v_2, v_x, v_y;
+  vec v_in_1 = *p_v_in_1;
+  vec v_in_2 = *p_v_in_2;
+
+  /* First vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0);
+  v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1);
+  v_x = v_1 * v_2;
+  v_y = v_2 - v_1;
+  *v_out_1 = __builtin_shuffle (v_x, v_y, sel);
+
+  /* Second vec perm sequence.  */
+  v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0);
+  v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1);
+  v_x = v_2 * v_1;
+  v_y = v_2 - v_1;
+  *v_out_2 = __builtin_shuffle (v_x, v_y, sel);
+}
+
+/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been merged" 
"forwprop1" } } */
+/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" } } 
*/
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c 
b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
new file mode 100644
index 00000000000..fcd140e3584
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
@@ -0,0 +1,3 @@
+#include "../../../gcc.dg/tree-ssa/satd-hadamard.c"
+
+/* { dg-final { scan-assembler-not {\ttbl\t} } } */
diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc
index cb1e86c6912..465fd278628 100644
--- a/gcc/tree-ssa-forwprop.cc
+++ b/gcc/tree-ssa-forwprop.cc
@@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "optabs-tree.h"
 #include "insn-config.h"
 #include "recog.h"
+#include "cfgloop.h"
+#include "tree-vectorizer.h"
 #include "tree-vector-builder.h"
 #include "vec-perm-indices.h"
 #include "internal-fn.h"
@@ -184,6 +186,37 @@ along with GCC; see the file COPYING3.  If not see
 
    This will (of course) be extended as other needs arise.  */
 
+/* Data structure that holds a lane utilization.  */
+typedef hash_map<int_hash <unsigned HOST_WIDE_INT, -1U>,
+                          unsigned HOST_WIDE_INT> lane_map_t;
+
+/* Data structure that contains simplifiable vectorized permute sequences.
+     v_in = {e0, e1, e2, e3}
+     v_1 = VEC_PERM <v_in, v_in, v_1_sel>
+     v_2 = VEC_PERM <v_in, v_in, v_2_sel>
+     v_y = v_1 - v_2
+     v_x = v_2 + v_1
+     v_out = VEC_PERM <v_x, v_y, v_out_sel>
+   The sequence is simplifiable because it has the potential to free up
+   half of the utilized lanes.  This allows to merge two sequences if
+   they use similar operators.  */
+
+struct _vec_perm_simplify_seq
+{
+  /* Defining stmts of vectors.  */
+  gassign *v_1_stmt;
+  gassign *v_2_stmt;
+  gassign *v_x_stmt;
+  gassign *v_y_stmt;
+  /* Final permute statment.  */
+  gassign *stmt;
+  /* Elements of each vector and selector.  */
+  unsigned HOST_WIDE_INT nelts;
+  /* New selector indices for stmt.  */
+  vec<unsigned HOST_WIDE_INT> new_sel;
+};
+typedef struct _vec_perm_simplify_seq *vec_perm_simplify_seq;
+
 static bool forward_propagate_addr_expr (tree, tree, bool);
 
 /* Set to true if we delete dead edges during the optimization.  */
@@ -225,6 +258,26 @@ fwprop_invalidate_lattice (tree name)
     lattice[SSA_NAME_VERSION (name)] = NULL_TREE;
 }
 
+/* If NAME is an SSA_NAME defined by an assignment, return that assignment.
+   If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL.  */
+
+static gassign *
+get_tree_def (tree name, bool single_use_only)
+{
+  if (TREE_CODE (name) != SSA_NAME)
+    return NULL;
+
+  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
+  gassign *assign = dyn_cast <gassign *> (def_stmt);
+
+  if (!assign)
+    return NULL;
+
+  if (single_use_only && !has_single_use (name))
+    return NULL;
+
+  return assign;
+}
 
 /* Get the statement we can propagate from into NAME skipping
    trivial copies.  Returns the statement which defines the
@@ -3460,6 +3513,441 @@ fwprop_ssa_val (tree name)
   return name;
 }
 
+/* Get an index map from the provided vector permute selector
+   and return the number of unique indices.
+   E.g.: { 1, 3, 1, 3 } -> <0, 1, 0, 1>, 2
+        { 0, 2, 0, 2 } -> <0, 1, 0, 1>, 2
+        { 3, 2, 1, 0 } -> <0, 1, 2, 3>, 4.  */
+
+static unsigned HOST_WIDE_INT
+get_vect_selector_index_map (tree sel, lane_map_t *index_map)
+{
+  gcc_assert (VECTOR_CST_NELTS (sel).is_constant ());
+  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
+  unsigned HOST_WIDE_INT n = 0;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      /* Extract the i-th value from the selector.  */
+      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
+      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
+
+      unsigned HOST_WIDE_INT j = 0;
+      for (; j <= i; j++)
+       {
+         tree prev_sel_cst_tree = VECTOR_CST_ELT (sel, j);
+         unsigned HOST_WIDE_INT prev_sel_cst
+           = TREE_INT_CST_LOW (prev_sel_cst_tree);
+         if (prev_sel_cst == sel_cst)
+           break;
+       }
+      index_map->put (i, j);
+      n += (i == j) ? 1 : 0;
+    }
+
+  return n;
+}
+
+/* Search for opportunities to free half of the lanes in the following pattern:
+
+     v_in = {e0, e1, e2, e3}
+     v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
+     // v_1 = {e0, e2, e0, e2}
+     v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
+     // v_2 = {e1, e3, e1, e3}
+
+     v_x = v_1 + v_2
+     // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
+     v_y = v_1 - v_2
+     // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
+
+     v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
+     // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}
+
+   The last statement could be simplified to:
+     v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
+     // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}
+
+   Characteristic properties:
+   - v_1 and v_2 are created from the same input vector v_in and introduce the
+     lane duplication (in the selection operand) that we can eliminate.
+   - v_x and v_y are results from lane-preserving operations that use v_1 and
+     v_2 as inputs.
+   - v_out is created by selecting from duplicated lanes.  */
+
+static bool
+recognise_vec_perm_simplify_seq (gassign *stmt, vec_perm_simplify_seq *seq)
+{
+  gcc_checking_assert (stmt);
+  gcc_checking_assert (gimple_assign_rhs_code (stmt) == VEC_PERM_EXPR);
+
+  /* Decompose the final vec permute statement.  */
+  tree v_x = gimple_assign_rhs1 (stmt);
+  tree v_y = gimple_assign_rhs2 (stmt);
+  tree sel = gimple_assign_rhs3 (stmt);
+
+  if (!VECTOR_CST_NELTS (sel).is_constant ())
+    return false;
+
+  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
+
+  /* Lookup the definition of v_x and v_y.  */
+  gassign *v_x_stmt = get_tree_def (v_x, true);
+  gassign *v_y_stmt = get_tree_def (v_y, true);
+  if (!v_x_stmt || !v_y_stmt)
+    return false;
+
+  /* Check the operations that define v_x and v_y.  */
+  if (TREE_CODE_CLASS (gimple_assign_rhs_code (v_x_stmt)) != tcc_binary
+      || TREE_CODE_CLASS (gimple_assign_rhs_code (v_y_stmt)) != tcc_binary)
+    return false;
+
+  tree v_x_1 = gimple_assign_rhs1 (v_x_stmt);
+  tree v_x_2 = gimple_assign_rhs2 (v_x_stmt);
+  tree v_y_1 = gimple_assign_rhs1 (v_y_stmt);
+  tree v_y_2 = gimple_assign_rhs2 (v_y_stmt);
+
+  if (v_x_stmt == v_y_stmt
+      || TREE_CODE (v_x_1) != SSA_NAME
+      || TREE_CODE (v_x_2) != SSA_NAME
+      || num_imm_uses (v_x_1) != 2
+      || num_imm_uses (v_x_2) != 2)
+    return false;
+
+  if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
+    {
+      /* Allow operands of commutative operators to swap.  */
+      if (commutative_tree_code (gimple_assign_rhs_code (v_x_stmt)))
+       {
+         /* Keep v_x_1 the first operand for non-commutative operators.  */
+         v_x_1 = gimple_assign_rhs2 (v_x_stmt);
+         v_x_2 = gimple_assign_rhs1 (v_x_stmt);
+         if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
+           return false;
+       }
+      else if (commutative_tree_code (gimple_assign_rhs_code (v_y_stmt)))
+       {
+         if (v_x_1 != v_y_2 || v_x_2 != v_y_1)
+           return false;
+       }
+      else
+       return false;
+    }
+
+  gassign *v_1_stmt = get_tree_def (v_x_1, false);
+  gassign *v_2_stmt = get_tree_def (v_x_2, false);
+  if (!v_1_stmt || !v_2_stmt)
+    return false;
+
+  if (gimple_assign_rhs_code (v_1_stmt) != VEC_PERM_EXPR
+      || gimple_assign_rhs_code (v_2_stmt) != VEC_PERM_EXPR)
+    return false;
+
+  /* Decompose initial VEC_PERM_EXPRs.  */
+  tree v_in = gimple_assign_rhs1 (v_1_stmt);
+  tree v_1_sel = gimple_assign_rhs3 (v_1_stmt);
+  tree v_2_sel = gimple_assign_rhs3 (v_2_stmt);
+  if (v_in != gimple_assign_rhs2 (v_1_stmt)
+      || v_in != gimple_assign_rhs1 (v_2_stmt)
+      || v_in != gimple_assign_rhs2 (v_2_stmt))
+    return false;
+
+  if (!VECTOR_CST_NELTS (v_1_sel).is_constant ()
+      || !VECTOR_CST_NELTS (v_2_sel).is_constant ())
+    return false;
+
+  if (nelts != VECTOR_CST_NELTS (v_1_sel).to_constant ()
+      || nelts != VECTOR_CST_NELTS (v_2_sel).to_constant ())
+    return false;
+
+  /* Now check permutation selection operands.  */
+  lane_map_t v_1_lane_map, v_2_lane_map;
+  unsigned HOST_WIDE_INT v_1_lanes, v_2_lanes;
+  v_1_lanes = get_vect_selector_index_map (v_1_sel, &v_1_lane_map);
+  v_2_lanes = get_vect_selector_index_map (v_2_sel, &v_2_lane_map);
+
+  /* Check if we could free up half of the lanes.  */
+  if (v_1_lanes != v_2_lanes || v_1_lanes > (nelts / 2))
+    return false;
+
+  if (dump_enabled_p ())
+    {
+      fprintf (dump_file, "Found vec perm simplify sequence ending with: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+
+      if (dump_flags & TDF_DETAILS)
+       fprintf (dump_file, "\tPossible new selector: { ");
+     }
+
+  *seq = XNEW (struct _vec_perm_simplify_seq);
+  (*seq)->stmt = stmt;
+  (*seq)->v_1_stmt = v_1_stmt;
+  (*seq)->v_2_stmt = v_2_stmt;
+  (*seq)->v_x_stmt = v_x_stmt;
+  (*seq)->v_y_stmt = v_y_stmt;
+  (*seq)->nelts = nelts;
+  (*seq)->new_sel = vNULL;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      /* Extract the i-th value from the selector.  */
+      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
+      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
+
+      unsigned HOST_WIDE_INT j;
+      if (sel_cst < nelts)
+       j = *v_1_lane_map.get (sel_cst);
+      else
+       j = *v_2_lane_map.get (sel_cst - nelts) + nelts;
+
+      (*seq)->new_sel.safe_push (j);
+
+      if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
+       {
+         fprintf (dump_file, "%lu", j);
+         if (i != (nelts -1))
+           fprintf (dump_file, ", ");
+       }
+    }
+
+  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, " }\n");
+
+  return true;
+}
+
+/* Reduce the lane consumption of a simplifiable vec perm sequence.  */
+
+static void
+compress_vec_perm_simplify_seq (vec_perm_simplify_seq seq)
+{
+  unsigned HOST_WIDE_INT nelts = seq->nelts;
+  int i;
+  unsigned HOST_WIDE_INT idx;
+  vec_perm_builder new_sel (nelts, nelts, 1);
+
+  FOR_EACH_VEC_ELT (seq->new_sel, i, idx)
+      new_sel.quick_push (idx);
+
+  gassign *stmt = seq->stmt;
+  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Updating VEC_PERM statment:\n");
+      fprintf (dump_file, "Old stmt: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+    }
+
+  /* Update the last VEC_PERM statement.  */
+  vec_perm_indices new_indices (new_sel, 2, nelts);
+  tree vectype = TREE_TYPE (gimple_assign_lhs (stmt));
+  tree m = vect_gen_perm_mask_checked (vectype, new_indices);
+  gimple_assign_set_rhs3 (stmt, m);
+  update_stmt (stmt);
+
+  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "New stmt: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+    }
+}
+
+/* Test if we can merge two simplifiable vec permute sequences.  */
+
+static bool
+can_merge_vec_perm_simplify_seqs_p (vec_perm_simplify_seq seq1,
+                                   vec_perm_simplify_seq seq2)
+{
+  unsigned HOST_WIDE_INT nelts = seq1->nelts;
+
+  /* Number of elements must be equal.  */
+  if (seq1->nelts != seq2->nelts)
+    return false;
+
+  /* We need vectors of the same type.  */
+  if (TREE_TYPE (gimple_assign_lhs (seq1->stmt))
+      != TREE_TYPE (gimple_assign_lhs (seq2->stmt)))
+    return false;
+
+  /* We require isomorphic operators.  */
+  if (((gimple_assign_rhs_code (seq1->v_x_stmt)
+       != gimple_assign_rhs_code (seq2->v_x_stmt))
+       || (gimple_assign_rhs_code (seq1->v_y_stmt)
+          != gimple_assign_rhs_code (seq2->v_y_stmt))))
+    return false;
+
+  /* We require equal selectors for v_1_stmt and v_2_stmt.
+     This ensures we don't introduce a complex permutation
+     in the merged sequence.  */
+  tree seq1_v_1_sel = gimple_assign_rhs3 (seq1->v_1_stmt);
+  tree seq1_v_2_sel = gimple_assign_rhs3 (seq1->v_2_stmt);
+  tree seq2_v_1_sel = gimple_assign_rhs3 (seq2->v_1_stmt);
+  tree seq2_v_2_sel = gimple_assign_rhs3 (seq2->v_2_stmt);
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      if ((TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_1_sel, i))
+          != TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_1_sel, i)))
+         || (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_2_sel, i))
+             != (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_2_sel, i)))))
+       return false;
+    }
+
+  /* We cannot have any dependencies between the sequences.
+     Therefore, v_in of both sequences must be defined before
+     v_1_stmt and v_2_stmt.  */
+  tree seq1_v_in = gimple_assign_rhs1 (seq1->v_1_stmt);
+  gassign *seq1_v_in_stmt = get_tree_def (seq1_v_in, false);
+  tree seq2_v_in = gimple_assign_rhs1 (seq2->v_1_stmt);
+  gassign *seq2_v_in_stmt = get_tree_def (seq2_v_in, false);
+
+  /* Don't merge if either v_in is not created by a statement.  */
+  if (!seq1_v_in_stmt || !seq2_v_in_stmt)
+    return false;
+
+  /* Assure that seq1_v_in is defined before seq2_v_1.  */
+  gimple_stmt_iterator gsi_lookup = gsi_for_stmt (seq1_v_in_stmt);
+  bool found = false;
+  while (!gsi_end_p (gsi_lookup))
+    {
+      gimple *s = gsi_stmt (gsi_lookup);
+      if (s == seq2_v_in_stmt || s == seq2->v_1_stmt)
+       {
+         found = true;
+         break;
+       }
+      gsi_next (&gsi_lookup);
+    }
+
+  if (!found)
+    return false;
+
+  /* Assure that seq2_v_in is defined before seq1_v_1.  */
+  gsi_lookup = gsi_for_stmt (seq2_v_in_stmt);
+  found = false;
+  while (!gsi_end_p (gsi_lookup))
+    {
+      gimple *s = gsi_stmt (gsi_lookup);
+      if (s == seq1_v_in_stmt || s == seq1->v_1_stmt)
+       {
+         found = true;
+         break;
+       }
+      gsi_next (&gsi_lookup);
+    }
+
+  if (!found)
+    return false;
+
+  if (dump_enabled_p ())
+    fprintf (dump_file, "Found vec perm simplify sequence pair.\n");
+
+  return true;
+}
+
+/* Merge the two given simplifiable vec permute sequences.  */
+
+static void
+merge_vec_perm_simplify_seqs (vec_perm_simplify_seq seq1,
+                             vec_perm_simplify_seq seq2)
+{
+  unsigned HOST_WIDE_INT nelts = seq1->nelts;
+
+  /* Calculate the lanes that the first permute needs.  */
+  tree mask1 = gimple_assign_rhs3 (seq1->stmt);
+  lane_map_t mask1_lanes;
+  unsigned HOST_WIDE_INT i;
+  for (i = 0; i < nelts; i++)
+    {
+      tree val = VECTOR_CST_ELT (mask1, i);
+      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts;
+      mask1_lanes.put (j, j);
+    }
+
+  /* Calculate the lanes that the second permute needs and rewrite
+     them to use the lanes that are unused from the first permute.  */
+  tree mask2 = gimple_assign_rhs3 (seq2->stmt);
+  lane_map_t mask2_lanes;
+  unsigned HOST_WIDE_INT lane_gen = 0;
+  for (i = 0; i < nelts; i++)
+    {
+      tree val = VECTOR_CST_ELT (mask2, i);
+      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts;
+      bool existed;
+      unsigned HOST_WIDE_INT &rewrite_lane
+       = mask2_lanes.get_or_insert (j, &existed);
+      if (existed)
+       continue;
+
+      while (mask1_lanes.get (lane_gen))
+       lane_gen++;
+      rewrite_lane = lane_gen;
+      lane_gen++;
+    }
+
+  /* Ensure we don't need more than one permute.  */
+  gcc_assert (i >= nelts);
+
+  /* Calculate the permutations based on MASK1_LANES/MASK2_LANES.  */
+  vec_perm_builder seq2_stmt_sel (nelts, nelts, 1);
+  vec_perm_builder seq1_v_1_stmt_sel (nelts, nelts, 1);
+  vec_perm_builder seq1_v_2_stmt_sel (nelts, nelts, 1);
+  for (i = 0; i < nelts; i++)
+    {
+      /* Calculate new mask for STMT2.  */
+      tree val = VECTOR_CST_ELT (mask2, i);
+      unsigned HOST_WIDE_INT lane
+       = TREE_INT_CST_LOW (val) & (2 * nelts - 1);
+      unsigned off = (lane < nelts) ? 0 : nelts;
+      unsigned HOST_WIDE_INT new_lane
+       = *mask2_lanes.get (lane - off) + off;
+      seq2_stmt_sel.quick_push (new_lane);
+
+      /* Calculate new masks for SRC1_1 and SRC1_2.  */
+      bool use_src1 = mask1_lanes.get (i);
+      tree mask1 = gimple_assign_rhs3 (use_src1 ? seq1->v_1_stmt
+                                               : seq2->v_1_stmt);
+      tree mask2 = gimple_assign_rhs3 (use_src1 ? seq1->v_2_stmt
+                                               : seq2->v_2_stmt);
+      unsigned HOST_WIDE_INT lane1
+       = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) % nelts;
+      unsigned HOST_WIDE_INT lane2
+       = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) % nelts;
+      seq1_v_1_stmt_sel.quick_push (lane1 + (use_src1 ? 0 : nelts));
+      seq1_v_2_stmt_sel.quick_push (lane2 + (use_src1 ? 0 : nelts));
+    }
+
+  /* We don't need to adjust seq1->stmt because its lanes consumption
+     was already tightened before entering this function.  */
+
+  /* Adjust seq2->stmt: copy RHS1/RHS2 from seq1->stmt and set new sel.  */
+  gimple_assign_set_rhs1 (seq2->stmt, gimple_assign_rhs1 (seq1->stmt));
+  gimple_assign_set_rhs2 (seq2->stmt, gimple_assign_rhs2 (seq1->stmt));
+  vec_perm_indices seq2_stmt_indices (seq2_stmt_sel, 2, nelts);
+  tree vectype = TREE_TYPE (gimple_assign_lhs (seq2->stmt));
+  tree mask = vect_gen_perm_mask_checked (vectype, seq2_stmt_indices);
+  gimple_assign_set_rhs3 (seq2->stmt, mask);
+  update_stmt (seq2->stmt);
+
+  /* Adjust seq1->v_1_stmt: copy RHS2 from seq2->v_1_stmt and set new sel.  */
+  gimple_assign_set_rhs2 (seq1->v_1_stmt, gimple_assign_rhs1 (seq2->v_1_stmt));
+  vec_perm_indices seq1_v_1_stmt_indices (seq1_v_1_stmt_sel, 2, nelts);
+  mask = vect_gen_perm_mask_checked (vectype, seq1_v_1_stmt_indices);
+  gimple_assign_set_rhs3 (seq1->v_1_stmt, mask);
+  update_stmt (seq1->v_1_stmt);
+
+  /* Adjust seq1->v_2_stmt: copy RHS2 from seq2->v_2_stmt and set new sel.  */
+  gimple_assign_set_rhs2 (seq1->v_2_stmt, gimple_assign_rhs1 (seq2->v_2_stmt));
+  vec_perm_indices seq1_v_2_stmt_indices (seq1_v_2_stmt_sel, 2, nelts);
+  mask = vect_gen_perm_mask_checked (vectype, seq1_v_2_stmt_indices);
+  gimple_assign_set_rhs3 (seq1->v_2_stmt, mask);
+  update_stmt (seq1->v_2_stmt);
+
+  /* At this point, we have four unmodified seq2 stmts, which will be
+     eliminated by DCE.  */
+
+  if (dump_enabled_p ())
+    fprintf (dump_file, "Vec perm simplify sequences have been merged.\n\n");
+}
+
 /* Main entry point for the forward propagation and statement combine
    optimizer.  */
 
@@ -3532,6 +4020,7 @@ pass_forwprop::execute (function *fun)
       basic_block bb = BASIC_BLOCK_FOR_FN (fun, postorder[i]);
       edge_iterator ei;
       edge e;
+      vec<vec_perm_simplify_seq> vec_perm_simplify_seq_list = vNULL;
 
       /* Skip processing not executable blocks.  We could improve
         single_use tracking by at least unlinking uses from unreachable
@@ -3901,10 +4390,59 @@ pass_forwprop::execute (function *fun)
              else
                gsi_next (&gsi);
            }
+         else if (code == VEC_PERM_EXPR)
+           {
+             /* Find vectorized sequences, where we can reduce
+                the lane utilization.  This compression will be
+                done later, if we find a matching pair of
+                sequences.  */
+             gassign *assign = dyn_cast <gassign *> (stmt);
+             vec_perm_simplify_seq seq;
+             if (recognise_vec_perm_simplify_seq (assign, &seq))
+               vec_perm_simplify_seq_list.safe_push (seq);
+
+             gsi_next (&gsi);
+         }
          else
            gsi_next (&gsi);
        }
 
+      /* If vec perm simplify sequences are found, then try to merge them.  */
+      if (dump_enabled_p () && !vec_perm_simplify_seq_list.is_empty ())
+       {
+         fprintf (dump_file, "Found %u vec perm simplify sequences.  "
+                  "Trying to merge them.\n\n",
+                  vec_perm_simplify_seq_list.length ());
+       }
+
+      while (!vec_perm_simplify_seq_list.is_empty ())
+       {
+         vec_perm_simplify_seq seq1 = vec_perm_simplify_seq_list[0];
+         vec_perm_simplify_seq_list.ordered_remove (0);
+
+         int i;
+         vec_perm_simplify_seq seq2;
+         FOR_EACH_VEC_ELT (vec_perm_simplify_seq_list, i, seq2)
+           {
+             if (can_merge_vec_perm_simplify_seqs_p (seq1, seq2))
+               {
+                 vec_perm_simplify_seq_list.ordered_remove (i);
+                 compress_vec_perm_simplify_seq (seq1);
+                 compress_vec_perm_simplify_seq (seq2);
+
+                 merge_vec_perm_simplify_seqs (seq1, seq2);
+
+                 seq2->new_sel.release ();
+                 XDELETE (seq2);
+                 break;
+               }
+           }
+
+         seq1->new_sel.release ();
+         XDELETE (seq1);
+       }
+      vec_perm_simplify_seq_list.release ();
+
       /* Combine stmts with the stmts defining their operands.
         Note we update GSI within the loop as necessary.  */
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
-- 
2.47.0

Reply via email to