On Fri, Nov 15, 2024 at 10:41 PM Andrew Pinski <pins...@gmail.com> wrote: > > On Fri, Nov 15, 2024 at 1:30 PM Christoph Müllner > <christoph.muell...@vrull.eu> wrote: > > > > This extends forwprop by yet another VEC_PERM optimization: > > It attempts to blend two isomorphic vector sequences by using the > > redundancy in the lane utilization in these sequences. > > This redundancy in lane utilization comes from the way how specific > > scalar statements end up vectorized: two VEC_PERMs on top, binary operations > > on both of them, and a final VEC_PERM to create the result. > > Here is an example of this sequence: > > > > v_in = {e0, e1, e2, e3} > > v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}> > > // v_1 = {e0, e2, e0, e2} > > v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}> > > // v_2 = {e1, e3, e1, e3} > > > > v_x = v_1 + v_2 > > // v_x = {e0+e1, e2+e3, e0+e1, e2+e3} > > v_y = v_1 - v_2 > > // v_y = {e0-e1, e2-e3, e0-e1, e2-e3} > > > > v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}> > > // v_out = {e0+e1, e2+e3, e0-e1, e2-e3} > > > > To remove the redundancy, lanes 2 and 3 can be freed, which allows to > > change the last statement into: > > v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}> > > // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3} > > > > The cost of eliminating the redundancy in the lane utilization is that > > lowering the VEC PERM expression could get more expensive because of > > tighter packing of the lanes. Therefore this optimization is not done > > alone, but in only in case we identify two such sequences that can be > > blended. > > > > Once all candidate sequences have been identified, we try to blend them, > > so that we can use the freed lanes for the second sequence. > > On success we convert 2x (2x BINOP + 1x VEC_PERM) to > > 2x VEC_PERM + 2x BINOP + 2x VEC_PERM traded for 4x VEC_PERM + 2x BINOP. > > > > The implemented transformation reuses (rewrites) the statements > > of the first sequence and the last VEC_PERM of the second sequence. > > The remaining four statements of the second statment are left untouched > > and will be eliminated by DCE later. > > > > This targets x264_pixel_satd_8x4, which calculates the sum of absolute > > transformed differences (SATD) using Hadamard transformation. > > We have seen 8% speedup on SPEC's x264 on a 5950X (x86-64) and 7% > > speedup on an AArch64 machine. > > > > Bootstrapped and reg-tested on x86-64 and AArch64 (all languages). > > This is more of a style issue rather than correct issue but > + if (!l->length ()) > + return; > > Maybe should be: > if (l->is_empty() ) > return; > > Which makes it more obvious of what is going on. > > Also this: > + FOR_EACH_VEC_ELT (seq->new_sel, i, idx) > + new_sel.quick_push (idx); > > Could just be: > for (auto idx : seq->new_sel) > new_sel.quick_push (idx); > > Using C++11 range-based for in this case makes it easier to understand > what is happening.
Fixed in a v5: https://gcc.gnu.org/pipermail/gcc-patches/2024-November/669025.html Thanks, Christoph > > > > gcc/ChangeLog: > > > > * tree-ssa-forwprop.cc (struct _vec_perm_simplify_seq): New data > > structure to store analysis results of a vec perm simplify sequence. > > (get_tree_def): New helper to get the defining statement of an > > SSA_NAME. > > (get_vect_selector_index_map): Helper to get an index map from the > > provided vector permute selector. > > (recognise_vec_perm_simplify_seq): Helper to recognise a > > vec perm simplify sequence. > > (narrow_vec_perm_simplify_seq): Helper to pack the lanes more > > tight. > > (can_blend_vec_perm_simplify_seqs_p): Test if two vec perm > > sequences can be blended. > > (blend_vec_perm_simplify_seqs): Helper to blend two vec perm > > simplify sequences. > > (process_vec_perm_simplify_seq_list): Helper to process a list > > of vec perm simplify sequences. > > (append_vec_perm_simplify_seq_list): Helper to add a vec perm > > simplify sequence to the list. > > (pass_forwprop::execute): Integrate new functionality. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.dg/tree-ssa/satd-hadamard.c: New test. > > * gcc.dg/tree-ssa/vector-10.c: New test. > > * gcc.dg/tree-ssa/vector-8.c: New test. > > * gcc.dg/tree-ssa/vector-9.c: New test. > > * gcc.target/aarch64/sve/satd-hadamard.c: New test. > > > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu> > > --- > > Changes from v3: > > * Fix test condition for writing to the dump file > > * Use gimple UIDs instead on expensive walks for comparing ordering. > > * Ensure to not blend across assignments to SSA_NAMES. > > * Restrict list to fix-sized vector with 8 entries. > > * Remove calls of expensive vec methods by restructuring the code. > > * Improved wording. > > > > Changes from v2: > > * Moved code to tree-ssa-forwprop.cc where similar VEC_PERM > > optimizations are implemented. > > * Test operand order less strict in case of commutative operators. > > * Made naming more consistent. > > * Added a test for testing dependencies between two sequences. > > * Removed the instruction reordering (no necessary without dependencies). > > * Added tests based on __builtin_shuffle (). > > > > Changes from v1: > > * Moved code from tree-vect-slp.cc into a new pass (from where it could > > be moved elsewhere). > > * Only deduplicate lanes if sequences will be merged later on. > > * Split functionality stricter into analysis and transformation parts. > > > > Manolis Tsamis was the patch's initial author before I took it over. > > > > gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c | 43 ++ > > gcc/testsuite/gcc.dg/tree-ssa/vector-10.c | 122 ++++ > > gcc/testsuite/gcc.dg/tree-ssa/vector-8.c | 34 ++ > > gcc/testsuite/gcc.dg/tree-ssa/vector-9.c | 34 ++ > > .../gcc.target/aarch64/sve/satd-hadamard.c | 3 + > > gcc/tree-ssa-forwprop.cc | 550 ++++++++++++++++++ > > 6 files changed, 786 insertions(+) > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-10.c > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-8.c > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-9.c > > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c > > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c > > b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c > > new file mode 100644 > > index 00000000000..576ef01628c > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c > > @@ -0,0 +1,43 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop4-details" } */ > > + > > +#include <stdint.h> > > + > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > + int t0 = s0 + s1;\ > > + int t1 = s0 - s1;\ > > + int t2 = s2 + s3;\ > > + int t3 = s2 - s3;\ > > + d0 = t0 + t2;\ > > + d1 = t1 + t3;\ > > + d2 = t0 - t2;\ > > + d3 = t1 - t3;\ > > +} > > + > > +int > > +x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t *pix2, > > int i_pix2) > > +{ > > + uint32_t tmp[4][4]; > > + uint32_t a0, a1, a2, a3; > > + int sum = 0; > > + int i; > > + > > + for (i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2) > > + { > > + a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > + a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > + a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > + a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > + HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, > > a3); > > + } > > + > > + for (i = 0; i < 4; i++) > > + { > > + HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], > > tmp[3][i]); > > + sum += a0 + a1 + a2 + a3; > > + } > > + > > + return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1; > > +} > > + > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop4" > > } } */ > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c > > new file mode 100644 > > index 00000000000..d5caebdf174 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c > > @@ -0,0 +1,122 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */ > > + > > +typedef int vec __attribute__((vector_size (4 * sizeof (int)))); > > + > > +void f1 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2) > > +{ > > + vec sel0 = { 0, 2, 0, 2 }; > > + vec sel1 = { 1, 3, 1, 3 }; > > + vec sel = { 0, 1, 6, 7 }; > > + vec v_1, v_2, v_x, v_y, v_out_1, v_out_2; > > + vec v_in_1 = *p_v_in_1; > > + vec v_in_2; > > + > > + /* First vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > + > > + /* Won't blend because v_in_2 is defined after v_1 above. */ > > + v_in_2 = *p_v_in_2; > > + /* Second vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > + > > + *p_v_out_1 = v_out_1; > > + *p_v_out_2 = v_out_2; > > +} > > + > > +void f2 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2) > > +{ > > + vec sel0 = { 0, 2, 0, 2 }; > > + vec sel1 = { 1, 3, 1, 3 }; > > + vec sel = { 0, 1, 6, 7 }; > > + vec v_1, v_2, v_x, v_y, v_out_1, v_out_2; > > + vec v_in_1 = *p_v_in_1; > > + vec v_in_2 = *p_v_in_2; > > + > > + /* First vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > + /* Won't blend because of this store between the sequences. */ > > + *p_v_out_1 = v_out_1; > > + > > + /* Second vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > + > > + *p_v_out_2 = v_out_2; > > +} > > + > > +void f3 (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2) > > +{ > > + vec sel0 = { 0, 2, 0, 2 }; > > + vec sel1 = { 1, 3, 1, 3 }; > > + vec sel = { 0, 1, 6, 7 }; > > + vec v_1, v_2, v_x, v_y, v_out_1, v_out_2; > > + vec v_in_1 = *p_v_in_1; > > + vec v_in_2 = *p_v_in_2; > > + > > + /* First vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > + > > + /* Second vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > + v_x = v_1 + v_2; > > + /* Won't blend because v_2 is RHS1 here. */ > > + v_y = v_2 - v_1; > > + v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > + > > + *p_v_out_1 = v_out_1; > > + *p_v_out_2 = v_out_2; > > +} > > + > > +extern vec foo (vec v); > > +void f4 (vec *p_v_in_1, vec *p_v_out_1, vec *p_v_out_2) > > +{ > > + vec sel0 = { 0, 2, 0, 2 }; > > + vec sel1 = { 1, 3, 1, 3 }; > > + vec sel = { 0, 1, 6, 7 }; > > + vec v_1, v_2, v_x, v_y, v_out_1, v_out_2; > > + vec v_in_1 = *p_v_in_1; > > + vec v_in_2; > > + > > + /* First vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > + > > + /* Won't merge because of dependency. */ > > + v_in_2 = foo (v_out_1); > > + > > + /* Second vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > + > > + *p_v_out_1 = v_out_1; > > + *p_v_out_2 = v_out_2; > > +} > > + > > +/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have been > > merged" "forwprop1" } } */ > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c > > new file mode 100644 > > index 00000000000..bc2269065e4 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c > > @@ -0,0 +1,34 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */ > > + > > +typedef int vec __attribute__((vector_size (4 * sizeof (int)))); > > + > > +void f (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2) > > +{ > > + vec sel0 = { 0, 2, 0, 2 }; > > + vec sel1 = { 1, 3, 1, 3 }; > > + vec sel = { 0, 1, 6, 7 }; > > + vec v_1, v_2, v_x, v_y, v_out_1, v_out_2; > > + vec v_in_1 = *p_v_in_1; > > + vec v_in_2 = *p_v_in_2; > > + > > + /* First vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > + > > + /* Second vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > + v_x = v_1 + v_2; > > + v_y = v_1 - v_2; > > + v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > + > > + *p_v_out_1 = v_out_1; > > + *p_v_out_2 = v_out_2; > > +} > > + > > +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been > > blended" "forwprop1" } } */ > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" > > } } */ > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c > > new file mode 100644 > > index 00000000000..e5f898e0281 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c > > @@ -0,0 +1,34 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */ > > + > > +typedef int vec __attribute__((vector_size (4 * sizeof (int)))); > > + > > +void f (vec *p_v_in_1, vec *p_v_in_2, vec *p_v_out_1, vec *p_v_out_2) > > +{ > > + vec sel0 = { 0, 2, 0, 2 }; > > + vec sel1 = { 1, 3, 1, 3 }; > > + vec sel = { 0, 1, 6, 7 }; > > + vec v_1, v_2, v_x, v_y, v_out_1, v_out_2; > > + vec v_in_1 = *p_v_in_1; > > + vec v_in_2 = *p_v_in_2; > > + > > + /* First vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > + v_x = v_1 * v_2; > > + v_y = v_2 - v_1; > > + v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > + > > + /* Second vec perm sequence. */ > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > + v_x = v_2 * v_1; > > + v_y = v_2 - v_1; > > + v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > + > > + *p_v_out_1 = v_out_1; > > + *p_v_out_2 = v_out_2; > > +} > > + > > +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been > > blended" "forwprop1" } } */ > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "forwprop1" > > } } */ > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c > > b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c > > new file mode 100644 > > index 00000000000..fcd140e3584 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c > > @@ -0,0 +1,3 @@ > > +#include "../../../gcc.dg/tree-ssa/satd-hadamard.c" > > + > > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */ > > diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc > > index 5c690a2b03e..1c755f89c29 100644 > > --- a/gcc/tree-ssa-forwprop.cc > > +++ b/gcc/tree-ssa-forwprop.cc > > @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3. If not see > > #include "optabs-tree.h" > > #include "insn-config.h" > > #include "recog.h" > > +#include "cfgloop.h" > > +#include "tree-vectorizer.h" > > #include "tree-vector-builder.h" > > #include "vec-perm-indices.h" > > #include "internal-fn.h" > > @@ -184,6 +186,29 @@ along with GCC; see the file COPYING3. If not see > > > > This will (of course) be extended as other needs arise. */ > > > > +/* Data structure that holds a lane utilization. */ > > +typedef hash_map<int_hash <unsigned HOST_WIDE_INT, -1U>, > > + unsigned HOST_WIDE_INT> lane_map_t; > > + > > +/* Data structure that contains simplifiable vectorized permute sequences. > > + See recognise_vec_perm_simplify_seq () for a description of the > > sequence. */ > > + > > +struct _vec_perm_simplify_seq > > +{ > > + /* Defining stmts of vectors in the sequence. */ > > + gassign *v_1_stmt; > > + gassign *v_2_stmt; > > + gassign *v_x_stmt; > > + gassign *v_y_stmt; > > + /* Final permute statment. */ > > + gassign *stmt; > > + /* New selector indices for stmt. */ > > + vec<unsigned HOST_WIDE_INT> new_sel; > > + /* Elements of each vector and selector. */ > > + unsigned HOST_WIDE_INT nelts; > > +}; > > +typedef struct _vec_perm_simplify_seq *vec_perm_simplify_seq; > > + > > static bool forward_propagate_addr_expr (tree, tree, bool); > > > > /* Set to true if we delete dead edges during the optimization. */ > > @@ -225,6 +250,26 @@ fwprop_invalidate_lattice (tree name) > > lattice[SSA_NAME_VERSION (name)] = NULL_TREE; > > } > > > > +/* If NAME is an SSA_NAME defined by an assignment, return that assignment. > > + If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL. */ > > + > > +static gassign * > > +get_tree_def (tree name, bool single_use_only) > > +{ > > + if (TREE_CODE (name) != SSA_NAME) > > + return NULL; > > + > > + gimple *def_stmt = SSA_NAME_DEF_STMT (name); > > + gassign *assign = dyn_cast <gassign *> (def_stmt); > > + > > + if (!assign) > > + return NULL; > > + > > + if (single_use_only && !has_single_use (name)) > > + return NULL; > > + > > + return assign; > > +} > > > > /* Get the statement we can propagate from into NAME skipping > > trivial copies. Returns the statement which defines the > > @@ -3460,6 +3505,489 @@ fwprop_ssa_val (tree name) > > return name; > > } > > > > +/* Get an index map from the provided vector permute selector > > + and return the number of unique indices. > > + E.g.: { 1, 3, 1, 3 } -> <0, 1, 0, 1>, 2 > > + { 0, 2, 0, 2 } -> <0, 1, 0, 1>, 2 > > + { 3, 2, 1, 0 } -> <0, 1, 2, 3>, 4. */ > > + > > +static unsigned HOST_WIDE_INT > > +get_vect_selector_index_map (tree sel, lane_map_t *index_map) > > +{ > > + gcc_assert (VECTOR_CST_NELTS (sel).is_constant ()); > > + unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant (); > > + unsigned HOST_WIDE_INT n = 0; > > + > > + for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++) > > + { > > + /* Extract the i-th value from the selector. */ > > + tree sel_cst_tree = VECTOR_CST_ELT (sel, i); > > + unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree); > > + > > + unsigned HOST_WIDE_INT j = 0; > > + for (; j <= i; j++) > > + { > > + tree prev_sel_cst_tree = VECTOR_CST_ELT (sel, j); > > + unsigned HOST_WIDE_INT prev_sel_cst > > + = TREE_INT_CST_LOW (prev_sel_cst_tree); > > + if (prev_sel_cst == sel_cst) > > + break; > > + } > > + index_map->put (i, j); > > + n += (i == j) ? 1 : 0; > > + } > > + > > + return n; > > +} > > + > > +/* Search for opportunities to free half of the lanes in the following > > pattern: > > + > > + v_in = {e0, e1, e2, e3} > > + v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}> > > + // v_1 = {e0, e2, e0, e2} > > + v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}> > > + // v_2 = {e1, e3, e1, e3} > > + > > + v_x = v_1 + v_2 > > + // v_x = {e0+e1, e2+e3, e0+e1, e2+e3} > > + v_y = v_1 - v_2 > > + // v_y = {e0-e1, e2-e3, e0-e1, e2-e3} > > + > > + v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}> > > + // v_out = {e0+e1, e2+e3, e0-e1, e2-e3} > > + > > + The last statement could be simplified to: > > + v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}> > > + // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3} > > + > > + Characteristic properties: > > + - v_1 and v_2 are created from the same input vector v_in and introduce > > the > > + lane duplication (in the selection operand) that we can eliminate. > > + - v_x and v_y are results from lane-preserving operations that use v_1 > > and > > + v_2 as inputs. > > + - v_out is created by selecting from duplicated lanes. */ > > + > > +static bool > > +recognise_vec_perm_simplify_seq (gassign *stmt, vec_perm_simplify_seq *seq) > > +{ > > + gcc_checking_assert (stmt); > > + gcc_checking_assert (gimple_assign_rhs_code (stmt) == VEC_PERM_EXPR); > > + > > + /* Decompose the final vec permute statement. */ > > + tree v_x = gimple_assign_rhs1 (stmt); > > + tree v_y = gimple_assign_rhs2 (stmt); > > + tree sel = gimple_assign_rhs3 (stmt); > > + > > + if (!VECTOR_CST_NELTS (sel).is_constant ()) > > + return false; > > + > > + unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant (); > > + > > + /* Lookup the definition of v_x and v_y. */ > > + gassign *v_x_stmt = get_tree_def (v_x, true); > > + gassign *v_y_stmt = get_tree_def (v_y, true); > > + if (!v_x_stmt || !v_y_stmt) > > + return false; > > + > > + /* Check the operations that define v_x and v_y. */ > > + if (TREE_CODE_CLASS (gimple_assign_rhs_code (v_x_stmt)) != tcc_binary > > + || TREE_CODE_CLASS (gimple_assign_rhs_code (v_y_stmt)) != tcc_binary) > > + return false; > > + > > + tree v_x_1 = gimple_assign_rhs1 (v_x_stmt); > > + tree v_x_2 = gimple_assign_rhs2 (v_x_stmt); > > + tree v_y_1 = gimple_assign_rhs1 (v_y_stmt); > > + tree v_y_2 = gimple_assign_rhs2 (v_y_stmt); > > + > > + if (v_x_stmt == v_y_stmt > > + || TREE_CODE (v_x_1) != SSA_NAME > > + || TREE_CODE (v_x_2) != SSA_NAME > > + || num_imm_uses (v_x_1) != 2 > > + || num_imm_uses (v_x_2) != 2) > > + return false; > > + > > + if (v_x_1 != v_y_1 || v_x_2 != v_y_2) > > + { > > + /* Allow operands of commutative operators to swap. */ > > + if (commutative_tree_code (gimple_assign_rhs_code (v_x_stmt))) > > + { > > + /* Keep v_x_1 the first operand for non-commutative operators. */ > > + v_x_1 = gimple_assign_rhs2 (v_x_stmt); > > + v_x_2 = gimple_assign_rhs1 (v_x_stmt); > > + if (v_x_1 != v_y_1 || v_x_2 != v_y_2) > > + return false; > > + } > > + else if (commutative_tree_code (gimple_assign_rhs_code (v_y_stmt))) > > + { > > + if (v_x_1 != v_y_2 || v_x_2 != v_y_1) > > + return false; > > + } > > + else > > + return false; > > + } > > + > > + gassign *v_1_stmt = get_tree_def (v_x_1, false); > > + gassign *v_2_stmt = get_tree_def (v_x_2, false); > > + if (!v_1_stmt || !v_2_stmt) > > + return false; > > + > > + if (gimple_assign_rhs_code (v_1_stmt) != VEC_PERM_EXPR > > + || gimple_assign_rhs_code (v_2_stmt) != VEC_PERM_EXPR) > > + return false; > > + > > + /* Decompose initial VEC_PERM_EXPRs. */ > > + tree v_in = gimple_assign_rhs1 (v_1_stmt); > > + tree v_1_sel = gimple_assign_rhs3 (v_1_stmt); > > + tree v_2_sel = gimple_assign_rhs3 (v_2_stmt); > > + if (v_in != gimple_assign_rhs2 (v_1_stmt) > > + || v_in != gimple_assign_rhs1 (v_2_stmt) > > + || v_in != gimple_assign_rhs2 (v_2_stmt)) > > + return false; > > + > > + if (!VECTOR_CST_NELTS (v_1_sel).is_constant () > > + || !VECTOR_CST_NELTS (v_2_sel).is_constant ()) > > + return false; > > + > > + if (nelts != VECTOR_CST_NELTS (v_1_sel).to_constant () > > + || nelts != VECTOR_CST_NELTS (v_2_sel).to_constant ()) > > + return false; > > + > > + /* Now check permutation selection operands. */ > > + lane_map_t v_1_lane_map, v_2_lane_map; > > + unsigned HOST_WIDE_INT v_1_lanes, v_2_lanes; > > + v_1_lanes = get_vect_selector_index_map (v_1_sel, &v_1_lane_map); > > + v_2_lanes = get_vect_selector_index_map (v_2_sel, &v_2_lane_map); > > + > > + /* Check if we could free up half of the lanes. */ > > + if (v_1_lanes != v_2_lanes || v_1_lanes > (nelts / 2)) > > + return false; > > + > > + if (dump_file) > > + { > > + fprintf (dump_file, "Found vec perm simplify sequence ending with: > > "); > > + print_gimple_stmt (dump_file, stmt, 0); > > + > > + if (dump_flags & TDF_DETAILS) > > + fprintf (dump_file, "\tPossible new selector: { "); > > + } > > + > > + *seq = XNEW (struct _vec_perm_simplify_seq); > > + (*seq)->stmt = stmt; > > + (*seq)->v_1_stmt = v_1_stmt; > > + (*seq)->v_2_stmt = v_2_stmt; > > + (*seq)->v_x_stmt = v_x_stmt; > > + (*seq)->v_y_stmt = v_y_stmt; > > + (*seq)->nelts = nelts; > > + (*seq)->new_sel = vNULL; > > + > > + for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++) > > + { > > + /* Extract the i-th value from the selector. */ > > + tree sel_cst_tree = VECTOR_CST_ELT (sel, i); > > + unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree); > > + > > + unsigned HOST_WIDE_INT j; > > + if (sel_cst < nelts) > > + j = *v_1_lane_map.get (sel_cst); > > + else > > + j = *v_2_lane_map.get (sel_cst - nelts) + nelts; > > + > > + (*seq)->new_sel.safe_push (j); > > + > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + { > > + fprintf (dump_file, "%lu", j); > > + if (i != (nelts -1)) > > + fprintf (dump_file, ", "); > > + } > > + } > > + > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + fprintf (dump_file, " }\n"); > > + > > + return true; > > +} > > + > > +/* Reduce the lane consumption of a simplifiable vec perm sequence. */ > > + > > +static void > > +narrow_vec_perm_simplify_seq (vec_perm_simplify_seq seq) > > +{ > > + unsigned HOST_WIDE_INT nelts = seq->nelts; > > + int i; > > + unsigned HOST_WIDE_INT idx; > > + vec_perm_builder new_sel (nelts, nelts, 1); > > + > > + FOR_EACH_VEC_ELT (seq->new_sel, i, idx) > > + new_sel.quick_push (idx); > > + > > + gassign *stmt = seq->stmt; > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + { > > + fprintf (dump_file, "Updating VEC_PERM statment:\n"); > > + fprintf (dump_file, "Old stmt: "); > > + print_gimple_stmt (dump_file, stmt, 0); > > + } > > + > > + /* Update the last VEC_PERM statement. */ > > + vec_perm_indices new_indices (new_sel, 2, nelts); > > + tree vectype = TREE_TYPE (gimple_assign_lhs (stmt)); > > + tree m = vect_gen_perm_mask_checked (vectype, new_indices); > > + gimple_assign_set_rhs3 (stmt, m); > > + update_stmt (stmt); > > + > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + { > > + fprintf (dump_file, "New stmt: "); > > + print_gimple_stmt (dump_file, stmt, 0); > > + } > > +} > > + > > +/* Test if we can blend two simplifiable vec permute sequences. > > + NEED_SWAP will be set, if sequences must be swapped for blending. */ > > + > > +static bool > > +can_blend_vec_perm_simplify_seqs_p (vec_perm_simplify_seq seq1, > > + vec_perm_simplify_seq seq2, > > + bool *need_swap) > > +{ > > + unsigned HOST_WIDE_INT nelts = seq1->nelts; > > + > > + gcc_assert (gimple_bb (seq1->stmt) == gimple_bb (seq2->stmt)); > > + > > + /* Number of elements must be equal. */ > > + if (seq1->nelts != seq2->nelts) > > + return false; > > + > > + /* We need vectors of the same type. */ > > + if (TREE_TYPE (gimple_assign_lhs (seq1->stmt)) > > + != TREE_TYPE (gimple_assign_lhs (seq2->stmt))) > > + return false; > > + > > + /* We require isomorphic operators. */ > > + if (((gimple_assign_rhs_code (seq1->v_x_stmt) > > + != gimple_assign_rhs_code (seq2->v_x_stmt)) > > + || (gimple_assign_rhs_code (seq1->v_y_stmt) > > + != gimple_assign_rhs_code (seq2->v_y_stmt)))) > > + return false; > > + > > + /* We require equal selectors for v_1_stmt and v_2_stmt. > > + This ensures we don't introduce a complex permutation in the blended > > + sequence. */ > > + tree seq1_v_1_sel = gimple_assign_rhs3 (seq1->v_1_stmt); > > + tree seq1_v_2_sel = gimple_assign_rhs3 (seq1->v_2_stmt); > > + tree seq2_v_1_sel = gimple_assign_rhs3 (seq2->v_1_stmt); > > + tree seq2_v_2_sel = gimple_assign_rhs3 (seq2->v_2_stmt); > > + for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++) > > + { > > + if ((TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_1_sel, i)) > > + != TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_1_sel, i))) > > + || (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_2_sel, i)) > > + != (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_2_sel, i))))) > > + return false; > > + } > > + > > + /* We cannot have any dependencies between the sequences. > > + > > + For merging, we will reuse seq1->v_1_stmt and seq1->v_2_stmt. > > + seq1's v_in is defined before these statements, but we need > > + to check if seq2's v_in is defined before as well. > > + > > + Further, we will reuse seq2->stmt. We need to ensure that > > + seq1->v_x_stmt and seq1->v_y_stmt are before that. */ > > + > > + tree seq2_v_in = gimple_assign_rhs1 (seq2->v_1_stmt); > > + gassign *seq2_v_in_stmt = get_tree_def (seq2_v_in, false); > > + if (!seq2_v_in_stmt > > + || (gimple_uid (seq2_v_in_stmt) > gimple_uid (seq1->v_1_stmt)) > > + || (gimple_uid (seq1->v_x_stmt) > gimple_uid (seq2->stmt)) > > + || (gimple_uid (seq1->v_y_stmt) > gimple_uid (seq2->stmt))) > > + { > > + tree seq1_v_in = gimple_assign_rhs1 (seq1->v_1_stmt); > > + gassign *seq1_v_in_stmt = get_tree_def (seq1_v_in, false); > > + /* Let's try to see if we succeed when swapping the sequences. */ > > + if (!seq1_v_in_stmt > > + || (gimple_uid (seq1_v_in_stmt) > gimple_uid (seq2->v_1_stmt)) > > + || (gimple_uid (seq2->v_x_stmt) > gimple_uid (seq1->stmt)) > > + || (gimple_uid (seq2->v_y_stmt) > gimple_uid (seq1->stmt))) > > + return false; > > + *need_swap = true; > > + } > > + else > > + *need_swap = false; > > + > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + fprintf (dump_file, "Found vec perm simplify sequence pair.\n"); > > + > > + return true; > > +} > > + > > +/* Merge the two given simplifiable vec permute sequences. */ > > + > > +static void > > +blend_vec_perm_simplify_seqs (vec_perm_simplify_seq seq1, > > + vec_perm_simplify_seq seq2) > > +{ > > + unsigned HOST_WIDE_INT nelts = seq1->nelts; > > + > > + /* Calculate the lanes that the first permute needs. */ > > + tree mask1 = gimple_assign_rhs3 (seq1->stmt); > > + lane_map_t mask1_lanes; > > + unsigned HOST_WIDE_INT i; > > + for (i = 0; i < nelts; i++) > > + { > > + tree val = VECTOR_CST_ELT (mask1, i); > > + unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts; > > + mask1_lanes.put (j, j); > > + } > > + > > + /* Calculate the lanes that the second permute needs and rewrite > > + them to use the lanes that are unused from the first permute. */ > > + tree mask2 = gimple_assign_rhs3 (seq2->stmt); > > + lane_map_t mask2_lanes; > > + unsigned HOST_WIDE_INT lane_gen = 0; > > + for (i = 0; i < nelts; i++) > > + { > > + tree val = VECTOR_CST_ELT (mask2, i); > > + unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts; > > + bool existed; > > + unsigned HOST_WIDE_INT &rewrite_lane > > + = mask2_lanes.get_or_insert (j, &existed); > > + if (existed) > > + continue; > > + > > + while (mask1_lanes.get (lane_gen)) > > + lane_gen++; > > + rewrite_lane = lane_gen; > > + lane_gen++; > > + } > > + > > + /* Ensure we don't need more than one permute. */ > > + gcc_assert (i >= nelts); > > + > > + /* Calculate the permutations based on MASK1_LANES/MASK2_LANES. */ > > + vec_perm_builder seq2_stmt_sel (nelts, nelts, 1); > > + vec_perm_builder seq1_v_1_stmt_sel (nelts, nelts, 1); > > + vec_perm_builder seq1_v_2_stmt_sel (nelts, nelts, 1); > > + for (i = 0; i < nelts; i++) > > + { > > + /* Calculate new mask for STMT2. */ > > + tree val = VECTOR_CST_ELT (mask2, i); > > + unsigned HOST_WIDE_INT lane > > + = TREE_INT_CST_LOW (val) & (2 * nelts - 1); > > + unsigned off = (lane < nelts) ? 0 : nelts; > > + unsigned HOST_WIDE_INT new_lane > > + = *mask2_lanes.get (lane - off) + off; > > + seq2_stmt_sel.quick_push (new_lane); > > + > > + /* Calculate new masks for SRC1_1 and SRC1_2. */ > > + bool use_src1 = mask1_lanes.get (i); > > + tree mask1 = gimple_assign_rhs3 (use_src1 ? seq1->v_1_stmt > > + : seq2->v_1_stmt); > > + tree mask2 = gimple_assign_rhs3 (use_src1 ? seq1->v_2_stmt > > + : seq2->v_2_stmt); > > + unsigned HOST_WIDE_INT lane1 > > + = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) % nelts; > > + unsigned HOST_WIDE_INT lane2 > > + = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) % nelts; > > + seq1_v_1_stmt_sel.quick_push (lane1 + (use_src1 ? 0 : nelts)); > > + seq1_v_2_stmt_sel.quick_push (lane2 + (use_src1 ? 0 : nelts)); > > + } > > + > > + /* We don't need to adjust seq1->stmt because its lanes consumption > > + was already tightened before entering this function. */ > > + > > + /* Adjust seq2->stmt: copy RHS1/RHS2 from seq1->stmt and set new sel. */ > > + gimple_assign_set_rhs1 (seq2->stmt, gimple_assign_rhs1 (seq1->stmt)); > > + gimple_assign_set_rhs2 (seq2->stmt, gimple_assign_rhs2 (seq1->stmt)); > > + vec_perm_indices seq2_stmt_indices (seq2_stmt_sel, 2, nelts); > > + tree vectype = TREE_TYPE (gimple_assign_lhs (seq2->stmt)); > > + tree mask = vect_gen_perm_mask_checked (vectype, seq2_stmt_indices); > > + gimple_assign_set_rhs3 (seq2->stmt, mask); > > + update_stmt (seq2->stmt); > > + > > + /* Adjust seq1->v_1_stmt: copy RHS2 from seq2->v_1_stmt and set new sel. > > */ > > + gimple_assign_set_rhs2 (seq1->v_1_stmt, gimple_assign_rhs1 > > (seq2->v_1_stmt)); > > + vec_perm_indices seq1_v_1_stmt_indices (seq1_v_1_stmt_sel, 2, nelts); > > + mask = vect_gen_perm_mask_checked (vectype, seq1_v_1_stmt_indices); > > + gimple_assign_set_rhs3 (seq1->v_1_stmt, mask); > > + update_stmt (seq1->v_1_stmt); > > + > > + /* Adjust seq1->v_2_stmt: copy RHS2 from seq2->v_2_stmt and set new sel. > > */ > > + gimple_assign_set_rhs2 (seq1->v_2_stmt, gimple_assign_rhs1 > > (seq2->v_2_stmt)); > > + vec_perm_indices seq1_v_2_stmt_indices (seq1_v_2_stmt_sel, 2, nelts); > > + mask = vect_gen_perm_mask_checked (vectype, seq1_v_2_stmt_indices); > > + gimple_assign_set_rhs3 (seq1->v_2_stmt, mask); > > + update_stmt (seq1->v_2_stmt); > > + > > + /* At this point, we have four unmodified seq2 stmts, which will be > > + eliminated by DCE. */ > > + > > + if (dump_file) > > + fprintf (dump_file, "Vec perm simplify sequences have been > > blended.\n\n"); > > +} > > + > > +/* Try to blend narrowed vec_perm_simplify_seqs pairwise. > > + The provided list will be empty after this call. */ > > + > > +static void > > +process_vec_perm_simplify_seq_list (vec<vec_perm_simplify_seq> *l) > > +{ > > + unsigned int i, j; > > + vec_perm_simplify_seq seq1, seq2; > > + > > + if (!l->length ()) > > + return; > > + > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + fprintf (dump_file, "Processing %u vec perm simplify sequences.\n", > > + l->length ()); > > + > > + FOR_EACH_VEC_ELT (*l, i, seq1) > > + { > > + if (i + 1 < l->length ()) > > + { > > + FOR_EACH_VEC_ELT_FROM (*l, j, seq2, i + 1) > > + { > > + bool swap = false; > > + if (can_blend_vec_perm_simplify_seqs_p (seq1, seq2, &swap)) > > + { > > + /* Compress lanes. */ > > + narrow_vec_perm_simplify_seq (seq1); > > + narrow_vec_perm_simplify_seq (seq2); > > + > > + /* Blend sequences. */ > > + blend_vec_perm_simplify_seqs (swap ? seq2 : seq1, > > + swap ? seq1 : seq2); > > + > > + /* We can use unordered_remove as we break the loop > > below. */ > > + l->unordered_remove (j); > > + seq2->new_sel.release (); > > + XDELETE (seq2); > > + break; > > + } > > + } > > + } > > + > > + /* We don't need to call l->remove for seq1. */ > > + seq1->new_sel.release (); > > + XDELETE (seq1); > > + } > > + > > + l->truncate (0); > > +} > > + > > +static void > > +append_vec_perm_simplify_seq_list (vec<vec_perm_simplify_seq> *l, > > + const vec_perm_simplify_seq &seq) > > +{ > > + /* If no space on list left, then process the list. */ > > + if (!l->space (1)) > > + process_vec_perm_simplify_seq_list (l); > > + > > + l->quick_push (seq); > > +} > > + > > /* Main entry point for the forward propagation and statement combine > > optimizer. */ > > > > @@ -3526,6 +4054,7 @@ pass_forwprop::execute (function *fun) > > auto_bitmap simple_dce_worklist; > > auto_bitmap need_ab_cleanup; > > to_purge = BITMAP_ALLOC (NULL); > > + auto_vec<vec_perm_simplify_seq, 8> vec_perm_simplify_seq_list; > > for (int i = 0; i < postorder_num; ++i) > > { > > gimple_stmt_iterator gsi; > > @@ -3605,14 +4134,18 @@ pass_forwprop::execute (function *fun) > > > > /* Apply forward propagation to all stmts in the basic-block. > > Note we update GSI within the loop as necessary. */ > > + unsigned int uid = 1; > > for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); ) > > { > > gimple *stmt = gsi_stmt (gsi); > > tree lhs, rhs; > > enum tree_code code; > > > > + gimple_set_uid (stmt, uid++); > > + > > if (!is_gimple_assign (stmt)) > > { > > + process_vec_perm_simplify_seq_list > > (&vec_perm_simplify_seq_list); > > gsi_next (&gsi); > > continue; > > } > > @@ -3620,9 +4153,11 @@ pass_forwprop::execute (function *fun) > > lhs = gimple_assign_lhs (stmt); > > rhs = gimple_assign_rhs1 (stmt); > > code = gimple_assign_rhs_code (stmt); > > + > > if (TREE_CODE (lhs) != SSA_NAME > > || has_zero_uses (lhs)) > > { > > + process_vec_perm_simplify_seq_list > > (&vec_perm_simplify_seq_list); > > gsi_next (&gsi); > > continue; > > } > > @@ -3901,10 +4436,25 @@ pass_forwprop::execute (function *fun) > > else > > gsi_next (&gsi); > > } > > + else if (code == VEC_PERM_EXPR) > > + { > > + /* Find vectorized sequences, where we can reduce the lane > > + utilization. The narrowing will be donw later and only > > + if we find a pair of sequences that can be blended. */ > > + gassign *assign = dyn_cast <gassign *> (stmt); > > + vec_perm_simplify_seq seq; > > + if (recognise_vec_perm_simplify_seq (assign, &seq)) > > + append_vec_perm_simplify_seq_list > > (&vec_perm_simplify_seq_list, > > + seq); > > + > > + gsi_next (&gsi); > > + } > > else > > gsi_next (&gsi); > > } > > > > + process_vec_perm_simplify_seq_list (&vec_perm_simplify_seq_list); > > + > > /* Combine stmts with the stmts defining their operands. > > Note we update GSI within the loop as necessary. */ > > for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) > > -- > > 2.47.0 > >