On Thu, Nov 14, 2024 at 3:55 PM Christoph Müllner <christoph.muell...@vrull.eu> wrote: > > On Thu, Nov 14, 2024 at 3:07 PM Richard Biener <rguent...@suse.de> wrote: > > > > On Thu, 14 Nov 2024, Christoph Müllner wrote: > > > > > This extends forwprop by yet another VEC_PERM optimization: > > > It attempts to merge two isomorphic vector sequences by using the > > > redundancy in the lane utilization in these sequences. > > > This redundancy in lane utilization comes from the way how specific > > > scalar statements end up vectorized: two VEC_PERMs on top, binary > > > operations > > > on both of them, and a final VEC_PERM to create the result. > > > Here is an example of this sequence: > > > > > > v_in = {e0, e1, e2, e3} > > > v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}> > > > // v_1 = {e0, e2, e0, e2} > > > v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}> > > > // v_2 = {e1, e3, e1, e3} > > > > The above permutes are just to identify lane duplicates? > > Yes. > > > > v_x = v_1 + v_2 > > > // v_x = {e0+e1, e2+e3, e0+e1, e2+e3} > > > v_y = v_1 - v_2 > > > // v_y = {e0-e1, e2-e3, e0-e1, e2-e3} > > > > > > v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}> > > > // v_out = {e0+e1, e2+e3, e0-e1, e2-e3} > > > > and the requirement is really that there are lanes that participate > > in the compute but end up being unused, detected by being unused in > > the final permute (and also not used anywhere else)? > > Yes. > > > > To remove the redundancy, lanes 2 and 3 can be freed, which allows to > > > change the last statement into: > > > v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}> > > > // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3} > > > > And this isn't the actual transform but instead the idea is that > > we have two such final permutes where we can re-use the same > > intermediate compute with relevant lanes "blended". Aka we > > implicitly go through the step narrowing the v_1, v_2, v_x and v_y > > vectors and then "vectorize" two such instances into a wider vector. > > Yes. > > > That said - I do not see that we need those two initial VEC_PERM > > if the inputs to the intermediate computes are not used elsewhere? > > From the final VEC_PERM we know which lanes are used and we should > > be able to arrange the instance merging permutes from those inputs > > (and not the inputs of the initial VEC_PERMs)? When generalizing > > like this the math is 2x (2x BINOP + 1x VEC_PERM) to > > 2x VEC_PERM + 2x BINOP + 2x VEC_PERM, 2x BINOP traded for 2x VEC_PERM > > with the hope of the added 2x VEC_PERM being mergeable with feeding > > VEC_PERMs? > > Yes, in general we can just drop all four initial VEC_PERMs and > create two new initial VEC_PERMs for the blended narrowed vectors. > > The patch takes a shortcut and reuses the existing initial VEC_PERMs > of the first > sequence. And the initial VEC_PERMs of the second sequences are left > untouched, > but will be eliminated later by DCE (together with the 2x BINOPs from the > second > sequence). > > > That said, this feels like it should be a backward live problem > > for lanes combined maybe with a forward redundancy scheme. The > > whole thing feels like it fits basic-block (re-)vectorization, > > there's also somewhat related "lane" tracking in the byteswap > > recognition pass (which I always wanted to become one detecting > > vector permutes as well). For ISAs with multiple vector sizes > > reducing the vector size without then combining two might be a > > win as well (but we'd need to know the "real" vector register size, > > like x86 exposes V4QImode but the vector size is at least V16QImode). > > I do not know how to decide if this land-narrowing is beneficial or not. > The v1 did this transformation unconditional, but you mentioned that this > might > not be beneficial in all cases. So, I restricted the lane compression to the > cases where we can blend later on. > > > I'm commenting on the patch independent on the above considerations > > below. > > > > Richard. > > > > > The cost of eliminating the redundancy in the lane utilization is that > > > lowering the VEC PERM expression could get more expensive because of > > > tighter packing of the lanes. Therefore this optimization is not done > > > alone, but in only in case we identify two such sequences that can be > > > merged. > > > > > > Once all candidate sequences have been identified, we try to pair them > > > up, so that we can use the freed lane for the second sequence. > > > On success, we replace 2x (2x VEC_PERM + 2x BINOP + 1x VEC_PERM) > > > instructions by 2x VEC_PERM + 2x BINOP + 2x VEC_PERM. > > > 2x VEC_PERM and 2x BINOP get eliminated. > > > > > > This targets x264_pixel_satd_8x4, which calculates the sum of absolute > > > transformed differences (SATD) using Hadamard transformation. > > > We have seen 8% speedup on SPEC's x264 on a 5950X (x86-64) and 7% > > > speedup on an AArch64 machine. > > > > > > Bootstrapped and reg-tested on x86-64 and AArch64 (all languages). > > > > > > gcc/ChangeLog: > > > > > > * tree-ssa-forwprop.cc (struct _vec_perm_simplify_seq): New data > > > structure to store analsyis results of a vec perm sequence. > > > (get_tree_def): New helper to get the defining statement of an > > > SSA_NAME. > > > (get_vect_selector_index_map): Helper to get an index map from > > > the provided vector permute selector. > > > (recognise_vec_perm_simplify_seq): Helper to recognise a > > > vec perm sequence. > > > (compress_vec_perm_simplify_seq): Helper to pack the lanes more > > > tight. > > > (can_merge_vec_perm_simplify_seqs_p): Test if two vec perm > > > sequences can > > > be merged. > > > (merge_vec_perm_simplify_seqs): Helper to merge two vec perm > > > sequences. > > > (pass_forwprop::execute): Add calls vec perm sequence functions. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.dg/tree-ssa/satd-hadamard.c: New test. > > > * gcc.dg/tree-ssa/vector-10.c: New test. > > > * gcc.dg/tree-ssa/vector-11.c: New test. > > > * gcc.dg/tree-ssa/vector-8.c: New test. > > > * gcc.dg/tree-ssa/vector-9.c: New test. > > > * gcc.target/aarch64/sve/satd-hadamard.c: New test. > > > > > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu> > > > --- > > > Changes from v2: > > > * Moved code to tree-ssa-forwprop.cc where similar VEC_PERM > > > optimizations are implemented. > > > * Test operand order less strict in case of commutative operators. > > > * Made naming more consistent. > > > * Added a test for testing dependencies between two sequences. > > > * Removed the instruction reordering (no necessary without dependencies). > > > * Added tests based on __builtin_shuffle (). > > > > > > Changes from v1: > > > * Moved code from tree-vect-slp.cc into a new pass (from where it could > > > be moved elsewhere). > > > * Only deduplicate lanes if sequences will be merged later on. > > > * Split functionality stricter into analysis and transformation parts. > > > > > > Manolis Tsamis was the patch's initial author before I took it over. > > > > > > gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c | 43 ++ > > > gcc/testsuite/gcc.dg/tree-ssa/vector-10.c | 32 ++ > > > gcc/testsuite/gcc.dg/tree-ssa/vector-11.c | 35 ++ > > > gcc/testsuite/gcc.dg/tree-ssa/vector-8.c | 31 + > > > gcc/testsuite/gcc.dg/tree-ssa/vector-9.c | 31 + > > > .../gcc.target/aarch64/sve/satd-hadamard.c | 3 + > > > gcc/tree-ssa-forwprop.cc | 538 ++++++++++++++++++ > > > 7 files changed, 713 insertions(+) > > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c > > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-10.c > > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-11.c > > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-8.c > > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vector-9.c > > > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c > > > > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c > > > b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c > > > new file mode 100644 > > > index 00000000000..576ef01628c > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c > > > @@ -0,0 +1,43 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop4-details" } */ > > > + > > > +#include <stdint.h> > > > + > > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > > + int t0 = s0 + s1;\ > > > + int t1 = s0 - s1;\ > > > + int t2 = s2 + s3;\ > > > + int t3 = s2 - s3;\ > > > + d0 = t0 + t2;\ > > > + d1 = t1 + t3;\ > > > + d2 = t0 - t2;\ > > > + d3 = t1 - t3;\ > > > +} > > > + > > > +int > > > +x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t > > > *pix2, int i_pix2) > > > +{ > > > + uint32_t tmp[4][4]; > > > + uint32_t a0, a1, a2, a3; > > > + int sum = 0; > > > + int i; > > > + > > > + for (i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2) > > > + { > > > + a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > > + a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > > + a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > > + a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > > + HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, > > > a3); > > > + } > > > + > > > + for (i = 0; i < 4; i++) > > > + { > > > + HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], > > > tmp[3][i]); > > > + sum += a0 + a1 + a2 + a3; > > > + } > > > + > > > + return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" > > > "forwprop4" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c > > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c > > > new file mode 100644 > > > index 00000000000..510a8e814d7 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-10.c > > > @@ -0,0 +1,32 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */ > > > + > > > +typedef int vec __attribute__((vector_size (4 * sizeof (int)))); > > > + > > > +void h (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2) > > > +{ > > > + vec sel0 = { 0, 2, 0, 2 }; > > > + vec sel1 = { 1, 3, 1, 3 }; > > > + vec sel = { 0, 1, 6, 7 }; > > > + vec v_1, v_2, v_x, v_y; > > > + vec v_in_1 = *p_v_in_1; > > > + vec v_in_2 = *p_v_in_2; > > > + > > > + /* First vec perm sequence. */ > > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > > + v_x = v_1 + v_2; > > > + v_y = v_1 - v_2; > > > + *v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > > + > > > + /* Second vec perm sequence. */ > > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > > + v_x = v_1 + v_2; > > > + /* Won't merge because v_2 is RHS1 here. */ > > > + v_y = v_2 - v_1; > > > + *v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > > +} > > > + > > > +/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have > > > been merged" "forwprop1" } } */ > > > +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" > > > "forwprop1" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c > > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c > > > new file mode 100644 > > > index 00000000000..e0ef92d9d56 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-11.c > > > @@ -0,0 +1,35 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */ > > > + > > > +typedef int vec __attribute__((vector_size (4 * sizeof (int)))); > > > + > > > +extern vec foo (vec *v); > > > +void i (vec *p_v_in_1, vec *v_out_1, vec *v_out_2) > > > +{ > > > + vec sel0 = { 0, 2, 0, 2 }; > > > + vec sel1 = { 1, 3, 1, 3 }; > > > + vec sel = { 0, 1, 6, 7 }; > > > + vec v_1, v_2, v_x, v_y; > > > + vec v_in_1 = *p_v_in_1; > > > + vec v_in_2; > > > + > > > + /* First vec perm sequence. */ > > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > > + v_x = v_1 + v_2; > > > + v_y = v_1 - v_2; > > > + *v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > > + > > > + /* Won't merge because of dependency. */ > > > + v_in_2 = foo (v_out_1); > > > + > > > + /* Second vec perm sequence. */ > > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > > + v_x = v_1 + v_2; > > > + v_y = v_1 - v_2; > > > + *v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > > +} > > > + > > > +/* { dg-final { scan-tree-dump-not "Vec perm simplify sequences have > > > been merged" "forwprop1" } } */ > > > +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" > > > "forwprop1" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c > > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c > > > new file mode 100644 > > > index 00000000000..6b7040d4047 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-8.c > > > @@ -0,0 +1,31 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */ > > > + > > > +typedef int vec __attribute__((vector_size (4 * sizeof (int)))); > > > + > > > +void f (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2) > > > +{ > > > + vec sel0 = { 0, 2, 0, 2 }; > > > + vec sel1 = { 1, 3, 1, 3 }; > > > + vec sel = { 0, 1, 6, 7 }; > > > + vec v_1, v_2, v_x, v_y; > > > + vec v_in_1 = *p_v_in_1; > > > + vec v_in_2 = *p_v_in_2; > > > + > > > + /* First vec perm sequence. */ > > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > > + v_x = v_1 + v_2; > > > + v_y = v_1 - v_2; > > > + *v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > > + > > > + /* Second vec perm sequence. */ > > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > > + v_x = v_1 + v_2; > > > + v_y = v_1 - v_2; > > > + *v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been > > > merged" "forwprop1" } } */ > > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" > > > "forwprop1" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c > > > b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c > > > new file mode 100644 > > > index 00000000000..bb8e53faacb > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/vector-9.c > > > @@ -0,0 +1,31 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -fdump-tree-forwprop1-details" } */ > > > + > > > +typedef int vec __attribute__((vector_size (4 * sizeof (int)))); > > > + > > > +void g (vec *p_v_in_1, vec *p_v_in_2, vec *v_out_1, vec *v_out_2) > > > +{ > > > + vec sel0 = { 0, 2, 0, 2 }; > > > + vec sel1 = { 1, 3, 1, 3 }; > > > + vec sel = { 0, 1, 6, 7 }; > > > + vec v_1, v_2, v_x, v_y; > > > + vec v_in_1 = *p_v_in_1; > > > + vec v_in_2 = *p_v_in_2; > > > + > > > + /* First vec perm sequence. */ > > > + v_1 = __builtin_shuffle (v_in_1, v_in_1, sel0); > > > + v_2 = __builtin_shuffle (v_in_1, v_in_1, sel1); > > > + v_x = v_1 * v_2; > > > + v_y = v_2 - v_1; > > > + *v_out_1 = __builtin_shuffle (v_x, v_y, sel); > > > + > > > + /* Second vec perm sequence. */ > > > + v_1 = __builtin_shuffle (v_in_2, v_in_2, sel0); > > > + v_2 = __builtin_shuffle (v_in_2, v_in_2, sel1); > > > + v_x = v_2 * v_1; > > > + v_y = v_2 - v_1; > > > + *v_out_2 = __builtin_shuffle (v_x, v_y, sel); > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been > > > merged" "forwprop1" } } */ > > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" > > > "forwprop1" } } */ > > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c > > > b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c > > > new file mode 100644 > > > index 00000000000..fcd140e3584 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c > > > @@ -0,0 +1,3 @@ > > > +#include "../../../gcc.dg/tree-ssa/satd-hadamard.c" > > > + > > > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */ > > > diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc > > > index cb1e86c6912..465fd278628 100644 > > > --- a/gcc/tree-ssa-forwprop.cc > > > +++ b/gcc/tree-ssa-forwprop.cc > > > @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3. If not see > > > #include "optabs-tree.h" > > > #include "insn-config.h" > > > #include "recog.h" > > > +#include "cfgloop.h" > > > +#include "tree-vectorizer.h" > > > #include "tree-vector-builder.h" > > > #include "vec-perm-indices.h" > > > #include "internal-fn.h" > > > @@ -184,6 +186,37 @@ along with GCC; see the file COPYING3. If not see > > > > > > This will (of course) be extended as other needs arise. */ > > > > > > +/* Data structure that holds a lane utilization. */ > > > +typedef hash_map<int_hash <unsigned HOST_WIDE_INT, -1U>, > > > + unsigned HOST_WIDE_INT> lane_map_t; > > > + > > > +/* Data structure that contains simplifiable vectorized permute > > > sequences. > > > + v_in = {e0, e1, e2, e3} > > > + v_1 = VEC_PERM <v_in, v_in, v_1_sel> > > > + v_2 = VEC_PERM <v_in, v_in, v_2_sel> > > > + v_y = v_1 - v_2 > > > + v_x = v_2 + v_1 > > > + v_out = VEC_PERM <v_x, v_y, v_out_sel> > > > + The sequence is simplifiable because it has the potential to free up > > > + half of the utilized lanes. This allows to merge two sequences if > > > + they use similar operators. */ > > > + > > > +struct _vec_perm_simplify_seq > > > +{ > > > + /* Defining stmts of vectors. */ > > > + gassign *v_1_stmt; > > > + gassign *v_2_stmt; > > > + gassign *v_x_stmt; > > > + gassign *v_y_stmt; > > > + /* Final permute statment. */ > > > + gassign *stmt; > > > + /* Elements of each vector and selector. */ > > > + unsigned HOST_WIDE_INT nelts; > > > + /* New selector indices for stmt. */ > > > + vec<unsigned HOST_WIDE_INT> new_sel; > > > +}; > > > +typedef struct _vec_perm_simplify_seq *vec_perm_simplify_seq; > > > + > > > static bool forward_propagate_addr_expr (tree, tree, bool); > > > > > > /* Set to true if we delete dead edges during the optimization. */ > > > @@ -225,6 +258,26 @@ fwprop_invalidate_lattice (tree name) > > > lattice[SSA_NAME_VERSION (name)] = NULL_TREE; > > > } > > > > > > +/* If NAME is an SSA_NAME defined by an assignment, return that > > > assignment. > > > + If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL. > > > */ > > > + > > > +static gassign * > > > +get_tree_def (tree name, bool single_use_only) > > > +{ > > > + if (TREE_CODE (name) != SSA_NAME) > > > + return NULL; > > > + > > > + gimple *def_stmt = SSA_NAME_DEF_STMT (name); > > > + gassign *assign = dyn_cast <gassign *> (def_stmt); > > > + > > > + if (!assign) > > > + return NULL; > > > + > > > + if (single_use_only && !has_single_use (name)) > > > + return NULL; > > > + > > > + return assign; > > > +} > > > > > > /* Get the statement we can propagate from into NAME skipping > > > trivial copies. Returns the statement which defines the > > > @@ -3460,6 +3513,441 @@ fwprop_ssa_val (tree name) > > > return name; > > > } > > > > > > +/* Get an index map from the provided vector permute selector > > > + and return the number of unique indices. > > > + E.g.: { 1, 3, 1, 3 } -> <0, 1, 0, 1>, 2 > > > + { 0, 2, 0, 2 } -> <0, 1, 0, 1>, 2 > > > + { 3, 2, 1, 0 } -> <0, 1, 2, 3>, 4. */ > > > + > > > +static unsigned HOST_WIDE_INT > > > +get_vect_selector_index_map (tree sel, lane_map_t *index_map) > > > +{ > > > + gcc_assert (VECTOR_CST_NELTS (sel).is_constant ()); > > > + unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant (); > > > + unsigned HOST_WIDE_INT n = 0; > > > + > > > + for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++) > > > + { > > > + /* Extract the i-th value from the selector. */ > > > + tree sel_cst_tree = VECTOR_CST_ELT (sel, i); > > > + unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree); > > > + > > > + unsigned HOST_WIDE_INT j = 0; > > > + for (; j <= i; j++) > > > + { > > > + tree prev_sel_cst_tree = VECTOR_CST_ELT (sel, j); > > > + unsigned HOST_WIDE_INT prev_sel_cst > > > + = TREE_INT_CST_LOW (prev_sel_cst_tree); > > > + if (prev_sel_cst == sel_cst) > > > + break; > > > + } > > > + index_map->put (i, j); > > > + n += (i == j) ? 1 : 0; > > > + } > > > + > > > + return n; > > > +} > > > + > > > +/* Search for opportunities to free half of the lanes in the following > > > pattern: > > > + > > > + v_in = {e0, e1, e2, e3} > > > + v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}> > > > + // v_1 = {e0, e2, e0, e2} > > > + v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}> > > > + // v_2 = {e1, e3, e1, e3} > > > + > > > + v_x = v_1 + v_2 > > > + // v_x = {e0+e1, e2+e3, e0+e1, e2+e3} > > > + v_y = v_1 - v_2 > > > + // v_y = {e0-e1, e2-e3, e0-e1, e2-e3} > > > + > > > + v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}> > > > + // v_out = {e0+e1, e2+e3, e0-e1, e2-e3} > > > + > > > + The last statement could be simplified to: > > > + v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}> > > > + // v_out' = {e0+e1, e2+e3, e0-e1, e2-e3} > > > + > > > + Characteristic properties: > > > + - v_1 and v_2 are created from the same input vector v_in and > > > introduce the > > > + lane duplication (in the selection operand) that we can eliminate. > > > + - v_x and v_y are results from lane-preserving operations that use > > > v_1 and > > > + v_2 as inputs. > > > + - v_out is created by selecting from duplicated lanes. */ > > > + > > > +static bool > > > +recognise_vec_perm_simplify_seq (gassign *stmt, vec_perm_simplify_seq > > > *seq) > > > +{ > > > + gcc_checking_assert (stmt); > > > + gcc_checking_assert (gimple_assign_rhs_code (stmt) == VEC_PERM_EXPR); > > > + > > > + /* Decompose the final vec permute statement. */ > > > + tree v_x = gimple_assign_rhs1 (stmt); > > > + tree v_y = gimple_assign_rhs2 (stmt); > > > + tree sel = gimple_assign_rhs3 (stmt); > > > + > > > + if (!VECTOR_CST_NELTS (sel).is_constant ()) > > > + return false; > > > + > > > + unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant (); > > > + > > > + /* Lookup the definition of v_x and v_y. */ > > > + gassign *v_x_stmt = get_tree_def (v_x, true); > > > + gassign *v_y_stmt = get_tree_def (v_y, true); > > > + if (!v_x_stmt || !v_y_stmt) > > > + return false; > > > + > > > + /* Check the operations that define v_x and v_y. */ > > > + if (TREE_CODE_CLASS (gimple_assign_rhs_code (v_x_stmt)) != tcc_binary > > > + || TREE_CODE_CLASS (gimple_assign_rhs_code (v_y_stmt)) != > > > tcc_binary) > > > + return false; > > > + > > > + tree v_x_1 = gimple_assign_rhs1 (v_x_stmt); > > > + tree v_x_2 = gimple_assign_rhs2 (v_x_stmt); > > > + tree v_y_1 = gimple_assign_rhs1 (v_y_stmt); > > > + tree v_y_2 = gimple_assign_rhs2 (v_y_stmt); > > > + > > > + if (v_x_stmt == v_y_stmt > > > + || TREE_CODE (v_x_1) != SSA_NAME > > > + || TREE_CODE (v_x_2) != SSA_NAME > > > + || num_imm_uses (v_x_1) != 2 > > > + || num_imm_uses (v_x_2) != 2) > > > + return false; > > > + > > > + if (v_x_1 != v_y_1 || v_x_2 != v_y_2) > > > + { > > > + /* Allow operands of commutative operators to swap. */ > > > + if (commutative_tree_code (gimple_assign_rhs_code (v_x_stmt))) > > > + { > > > + /* Keep v_x_1 the first operand for non-commutative operators. */ > > > + v_x_1 = gimple_assign_rhs2 (v_x_stmt); > > > + v_x_2 = gimple_assign_rhs1 (v_x_stmt); > > > + if (v_x_1 != v_y_1 || v_x_2 != v_y_2) > > > + return false; > > > + } > > > + else if (commutative_tree_code (gimple_assign_rhs_code (v_y_stmt))) > > > + { > > > + if (v_x_1 != v_y_2 || v_x_2 != v_y_1) > > > + return false; > > > + } > > > + else > > > + return false; > > > + } > > > + > > > + gassign *v_1_stmt = get_tree_def (v_x_1, false); > > > + gassign *v_2_stmt = get_tree_def (v_x_2, false); > > > + if (!v_1_stmt || !v_2_stmt) > > > + return false; > > > + > > > + if (gimple_assign_rhs_code (v_1_stmt) != VEC_PERM_EXPR > > > + || gimple_assign_rhs_code (v_2_stmt) != VEC_PERM_EXPR) > > > + return false; > > > + > > > + /* Decompose initial VEC_PERM_EXPRs. */ > > > + tree v_in = gimple_assign_rhs1 (v_1_stmt); > > > + tree v_1_sel = gimple_assign_rhs3 (v_1_stmt); > > > + tree v_2_sel = gimple_assign_rhs3 (v_2_stmt); > > > + if (v_in != gimple_assign_rhs2 (v_1_stmt) > > > + || v_in != gimple_assign_rhs1 (v_2_stmt) > > > + || v_in != gimple_assign_rhs2 (v_2_stmt)) > > > + return false; > > > + > > > + if (!VECTOR_CST_NELTS (v_1_sel).is_constant () > > > + || !VECTOR_CST_NELTS (v_2_sel).is_constant ()) > > > + return false; > > > + > > > + if (nelts != VECTOR_CST_NELTS (v_1_sel).to_constant () > > > + || nelts != VECTOR_CST_NELTS (v_2_sel).to_constant ()) > > > + return false; > > > + > > > + /* Now check permutation selection operands. */ > > > + lane_map_t v_1_lane_map, v_2_lane_map; > > > + unsigned HOST_WIDE_INT v_1_lanes, v_2_lanes; > > > + v_1_lanes = get_vect_selector_index_map (v_1_sel, &v_1_lane_map); > > > + v_2_lanes = get_vect_selector_index_map (v_2_sel, &v_2_lane_map); > > > + > > > + /* Check if we could free up half of the lanes. */ > > > + if (v_1_lanes != v_2_lanes || v_1_lanes > (nelts / 2)) > > > + return false; > > > + > > > + if (dump_enabled_p ()) > > > + { > > > + fprintf (dump_file, "Found vec perm simplify sequence ending with: > > > "); > > > + print_gimple_stmt (dump_file, stmt, 0); > > > + > > > + if (dump_flags & TDF_DETAILS) > > > + fprintf (dump_file, "\tPossible new selector: { "); > > > + } > > > + > > > + *seq = XNEW (struct _vec_perm_simplify_seq); > > > + (*seq)->stmt = stmt; > > > + (*seq)->v_1_stmt = v_1_stmt; > > > + (*seq)->v_2_stmt = v_2_stmt; > > > + (*seq)->v_x_stmt = v_x_stmt; > > > + (*seq)->v_y_stmt = v_y_stmt; > > > + (*seq)->nelts = nelts; > > > + (*seq)->new_sel = vNULL; > > > + > > > + for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++) > > > + { > > > + /* Extract the i-th value from the selector. */ > > > + tree sel_cst_tree = VECTOR_CST_ELT (sel, i); > > > + unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree); > > > + > > > + unsigned HOST_WIDE_INT j; > > > + if (sel_cst < nelts) > > > + j = *v_1_lane_map.get (sel_cst); > > > + else > > > + j = *v_2_lane_map.get (sel_cst - nelts) + nelts; > > > + > > > + (*seq)->new_sel.safe_push (j); > > > + > > > + if (dump_enabled_p () && (dump_flags & TDF_DETAILS)) > > > + { > > > + fprintf (dump_file, "%lu", j); > > > + if (i != (nelts -1)) > > > + fprintf (dump_file, ", "); > > > + } > > > + } > > > + > > > + if (dump_enabled_p () && (dump_flags & TDF_DETAILS)) > > > + fprintf (dump_file, " }\n"); > > > + > > > + return true; > > > +} > > > + > > > +/* Reduce the lane consumption of a simplifiable vec perm sequence. */ > > > + > > > +static void > > > +compress_vec_perm_simplify_seq (vec_perm_simplify_seq seq) > > > +{ > > > + unsigned HOST_WIDE_INT nelts = seq->nelts; > > > + int i; > > > + unsigned HOST_WIDE_INT idx; > > > + vec_perm_builder new_sel (nelts, nelts, 1); > > > + > > > + FOR_EACH_VEC_ELT (seq->new_sel, i, idx) > > > + new_sel.quick_push (idx); > > > + > > > + gassign *stmt = seq->stmt; > > > + if (dump_enabled_p () && (dump_flags & TDF_DETAILS)) > > > + { > > > + fprintf (dump_file, "Updating VEC_PERM statment:\n"); > > > + fprintf (dump_file, "Old stmt: "); > > > + print_gimple_stmt (dump_file, stmt, 0); > > > + } > > > + > > > + /* Update the last VEC_PERM statement. */ > > > + vec_perm_indices new_indices (new_sel, 2, nelts); > > > + tree vectype = TREE_TYPE (gimple_assign_lhs (stmt)); > > > + tree m = vect_gen_perm_mask_checked (vectype, new_indices); > > > + gimple_assign_set_rhs3 (stmt, m); > > > + update_stmt (stmt); > > > + > > > + if (dump_enabled_p () && (dump_flags & TDF_DETAILS)) > > > + { > > > + fprintf (dump_file, "New stmt: "); > > > + print_gimple_stmt (dump_file, stmt, 0); > > > + } > > > +} > > > + > > > +/* Test if we can merge two simplifiable vec permute sequences. */ > > > + > > > +static bool > > > +can_merge_vec_perm_simplify_seqs_p (vec_perm_simplify_seq seq1, > > > + vec_perm_simplify_seq seq2) > > > +{ > > > + unsigned HOST_WIDE_INT nelts = seq1->nelts; > > > + > > > + /* Number of elements must be equal. */ > > > + if (seq1->nelts != seq2->nelts) > > > + return false; > > > + > > > + /* We need vectors of the same type. */ > > > + if (TREE_TYPE (gimple_assign_lhs (seq1->stmt)) > > > + != TREE_TYPE (gimple_assign_lhs (seq2->stmt))) > > > + return false; > > > + > > > + /* We require isomorphic operators. */ > > > + if (((gimple_assign_rhs_code (seq1->v_x_stmt) > > > + != gimple_assign_rhs_code (seq2->v_x_stmt)) > > > + || (gimple_assign_rhs_code (seq1->v_y_stmt) > > > + != gimple_assign_rhs_code (seq2->v_y_stmt)))) > > > + return false; > > > + > > > + /* We require equal selectors for v_1_stmt and v_2_stmt. > > > + This ensures we don't introduce a complex permutation > > > + in the merged sequence. */ > > > + tree seq1_v_1_sel = gimple_assign_rhs3 (seq1->v_1_stmt); > > > + tree seq1_v_2_sel = gimple_assign_rhs3 (seq1->v_2_stmt); > > > + tree seq2_v_1_sel = gimple_assign_rhs3 (seq2->v_1_stmt); > > > + tree seq2_v_2_sel = gimple_assign_rhs3 (seq2->v_2_stmt); > > > + for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++) > > > + { > > > + if ((TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_1_sel, i)) > > > + != TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_1_sel, i))) > > > + || (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1_v_2_sel, i)) > > > + != (TREE_INT_CST_LOW (VECTOR_CST_ELT (seq2_v_2_sel, i))))) > > > + return false; > > > + } > > > + > > > + /* We cannot have any dependencies between the sequences. > > > + Therefore, v_in of both sequences must be defined before > > > + v_1_stmt and v_2_stmt. */ > > > + tree seq1_v_in = gimple_assign_rhs1 (seq1->v_1_stmt); > > > + gassign *seq1_v_in_stmt = get_tree_def (seq1_v_in, false); > > > + tree seq2_v_in = gimple_assign_rhs1 (seq2->v_1_stmt); > > > + gassign *seq2_v_in_stmt = get_tree_def (seq2_v_in, false); > > > + > > > + /* Don't merge if either v_in is not created by a statement. */ > > > + if (!seq1_v_in_stmt || !seq2_v_in_stmt) > > > + return false; > > > + > > > + /* Assure that seq1_v_in is defined before seq2_v_1. */ > > > + gimple_stmt_iterator gsi_lookup = gsi_for_stmt (seq1_v_in_stmt); > > > + bool found = false; > > > + while (!gsi_end_p (gsi_lookup)) > > > + { > > > + gimple *s = gsi_stmt (gsi_lookup); > > > + if (s == seq2_v_in_stmt || s == seq2->v_1_stmt) > > > + { > > > + found = true; > > > + break; > > > + } > > > + gsi_next (&gsi_lookup); > > > > Ick. I think we should simply assign gimple_uid () during the stmt > > walk of forwprop instead of linearly walking stmts. > > Will do. > > > I also think we want to avoid combining across any stmts with > > side-effects (calls, asm(), I guess even stores?), which we can do ... > > Will do... > > > > > > + } > > > + > > > + if (!found) > > > + return false; > > > + > > > + /* Assure that seq2_v_in is defined before seq1_v_1. */ > > > + gsi_lookup = gsi_for_stmt (seq2_v_in_stmt); > > > + found = false; > > > + while (!gsi_end_p (gsi_lookup)) > > > + { > > > + gimple *s = gsi_stmt (gsi_lookup); > > > + if (s == seq1_v_in_stmt || s == seq1->v_1_stmt) > > > + { > > > + found = true; > > > + break; > > > + } > > > + gsi_next (&gsi_lookup); > > > + } > > > + > > > + if (!found) > > > + return false; > > > + > > > + if (dump_enabled_p ()) > > > + fprintf (dump_file, "Found vec perm simplify sequence pair.\n"); > > > + > > > + return true; > > > +} > > > + > > > +/* Merge the two given simplifiable vec permute sequences. */ > > > + > > > +static void > > > +merge_vec_perm_simplify_seqs (vec_perm_simplify_seq seq1, > > > + vec_perm_simplify_seq seq2) > > > +{ > > > + unsigned HOST_WIDE_INT nelts = seq1->nelts; > > > + > > > + /* Calculate the lanes that the first permute needs. */ > > > + tree mask1 = gimple_assign_rhs3 (seq1->stmt); > > > + lane_map_t mask1_lanes; > > > + unsigned HOST_WIDE_INT i; > > > + for (i = 0; i < nelts; i++) > > > + { > > > + tree val = VECTOR_CST_ELT (mask1, i); > > > + unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts; > > > + mask1_lanes.put (j, j); > > > + } > > > + > > > + /* Calculate the lanes that the second permute needs and rewrite > > > + them to use the lanes that are unused from the first permute. */ > > > + tree mask2 = gimple_assign_rhs3 (seq2->stmt); > > > + lane_map_t mask2_lanes; > > > + unsigned HOST_WIDE_INT lane_gen = 0; > > > + for (i = 0; i < nelts; i++) > > > + { > > > + tree val = VECTOR_CST_ELT (mask2, i); > > > + unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) % nelts; > > > + bool existed; > > > + unsigned HOST_WIDE_INT &rewrite_lane > > > + = mask2_lanes.get_or_insert (j, &existed); > > > + if (existed) > > > + continue; > > > + > > > + while (mask1_lanes.get (lane_gen)) > > > + lane_gen++; > > > + rewrite_lane = lane_gen; > > > + lane_gen++; > > > + } > > > + > > > + /* Ensure we don't need more than one permute. */ > > > + gcc_assert (i >= nelts); > > > + > > > + /* Calculate the permutations based on MASK1_LANES/MASK2_LANES. */ > > > + vec_perm_builder seq2_stmt_sel (nelts, nelts, 1); > > > + vec_perm_builder seq1_v_1_stmt_sel (nelts, nelts, 1); > > > + vec_perm_builder seq1_v_2_stmt_sel (nelts, nelts, 1); > > > + for (i = 0; i < nelts; i++) > > > + { > > > + /* Calculate new mask for STMT2. */ > > > + tree val = VECTOR_CST_ELT (mask2, i); > > > + unsigned HOST_WIDE_INT lane > > > + = TREE_INT_CST_LOW (val) & (2 * nelts - 1); > > > + unsigned off = (lane < nelts) ? 0 : nelts; > > > + unsigned HOST_WIDE_INT new_lane > > > + = *mask2_lanes.get (lane - off) + off; > > > + seq2_stmt_sel.quick_push (new_lane); > > > + > > > + /* Calculate new masks for SRC1_1 and SRC1_2. */ > > > + bool use_src1 = mask1_lanes.get (i); > > > + tree mask1 = gimple_assign_rhs3 (use_src1 ? seq1->v_1_stmt > > > + : seq2->v_1_stmt); > > > + tree mask2 = gimple_assign_rhs3 (use_src1 ? seq1->v_2_stmt > > > + : seq2->v_2_stmt); > > > + unsigned HOST_WIDE_INT lane1 > > > + = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) % nelts; > > > + unsigned HOST_WIDE_INT lane2 > > > + = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) % nelts; > > > + seq1_v_1_stmt_sel.quick_push (lane1 + (use_src1 ? 0 : nelts)); > > > + seq1_v_2_stmt_sel.quick_push (lane2 + (use_src1 ? 0 : nelts)); > > > + } > > > + > > > + /* We don't need to adjust seq1->stmt because its lanes consumption > > > + was already tightened before entering this function. */ > > > + > > > + /* Adjust seq2->stmt: copy RHS1/RHS2 from seq1->stmt and set new sel. > > > */ > > > + gimple_assign_set_rhs1 (seq2->stmt, gimple_assign_rhs1 (seq1->stmt)); > > > + gimple_assign_set_rhs2 (seq2->stmt, gimple_assign_rhs2 (seq1->stmt)); > > > + vec_perm_indices seq2_stmt_indices (seq2_stmt_sel, 2, nelts); > > > + tree vectype = TREE_TYPE (gimple_assign_lhs (seq2->stmt)); > > > + tree mask = vect_gen_perm_mask_checked (vectype, seq2_stmt_indices); > > > + gimple_assign_set_rhs3 (seq2->stmt, mask); > > > + update_stmt (seq2->stmt); > > > + > > > + /* Adjust seq1->v_1_stmt: copy RHS2 from seq2->v_1_stmt and set new > > > sel. */ > > > + gimple_assign_set_rhs2 (seq1->v_1_stmt, gimple_assign_rhs1 > > > (seq2->v_1_stmt)); > > > + vec_perm_indices seq1_v_1_stmt_indices (seq1_v_1_stmt_sel, 2, nelts); > > > + mask = vect_gen_perm_mask_checked (vectype, seq1_v_1_stmt_indices); > > > + gimple_assign_set_rhs3 (seq1->v_1_stmt, mask); > > > + update_stmt (seq1->v_1_stmt); > > > + > > > + /* Adjust seq1->v_2_stmt: copy RHS2 from seq2->v_2_stmt and set new > > > sel. */ > > > + gimple_assign_set_rhs2 (seq1->v_2_stmt, gimple_assign_rhs1 > > > (seq2->v_2_stmt)); > > > + vec_perm_indices seq1_v_2_stmt_indices (seq1_v_2_stmt_sel, 2, nelts); > > > + mask = vect_gen_perm_mask_checked (vectype, seq1_v_2_stmt_indices); > > > + gimple_assign_set_rhs3 (seq1->v_2_stmt, mask); > > > + update_stmt (seq1->v_2_stmt); > > > + > > > + /* At this point, we have four unmodified seq2 stmts, which will be > > > + eliminated by DCE. */ > > > + > > > + if (dump_enabled_p ()) > > > + fprintf (dump_file, "Vec perm simplify sequences have been > > > merged.\n\n"); > > > +} > > > + > > > /* Main entry point for the forward propagation and statement combine > > > optimizer. */ > > > > > > @@ -3532,6 +4020,7 @@ pass_forwprop::execute (function *fun) > > > basic_block bb = BASIC_BLOCK_FOR_FN (fun, postorder[i]); > > > edge_iterator ei; > > > edge e; > > > + vec<vec_perm_simplify_seq> vec_perm_simplify_seq_list = vNULL; > > > > > > /* Skip processing not executable blocks. We could improve > > > single_use tracking by at least unlinking uses from unreachable > > > @@ -3901,10 +4390,59 @@ pass_forwprop::execute (function *fun) > > > else > > > gsi_next (&gsi); > > > } > > > + else if (code == VEC_PERM_EXPR) > > > + { > > > + /* Find vectorized sequences, where we can reduce > > > + the lane utilization. This compression will be > > > + done later, if we find a matching pair of > > > + sequences. */ > > > + gassign *assign = dyn_cast <gassign *> (stmt); > > > + vec_perm_simplify_seq seq; > > > + if (recognise_vec_perm_simplify_seq (assign, &seq)) > > > + vec_perm_simplify_seq_list.safe_push (seq); > > > + > > > + gsi_next (&gsi); > > > + } > > > else > > > gsi_next (&gsi); > > > } > > > > ... in the above loop by combining/truncating vec_perm_simplify_seq_list > > when we hit such a "blockage" stmt? I'll note the pattern matching > > in recognise_vec_perm_simplify_seq isn't protected against leaving > > the basic-block or crossing such "blockace" stmt. > > This means that the merging/blending cannot be on BB level, > but must be triggered whenever a "blockage" stmt is found > (otherwise we would end up with multiple lists per BB). > Thanks for this idea! > > > > > > + /* If vec perm simplify sequences are found, then try to merge > > > them. */ > > > + if (dump_enabled_p () && !vec_perm_simplify_seq_list.is_empty ()) > > > + { > > > + fprintf (dump_file, "Found %u vec perm simplify sequences. " > > > + "Trying to merge them.\n\n", > > > + vec_perm_simplify_seq_list.length ()); > > > + } > > > + > > > + while (!vec_perm_simplify_seq_list.is_empty ()) > > > + { > > > + vec_perm_simplify_seq seq1 = vec_perm_simplify_seq_list[0]; > > > + vec_perm_simplify_seq_list.ordered_remove (0); > > > > ordered_remove is expensive, I wonder if this can be improved. > > Note I think we want an upper limit on the size of the vector > > anyway, either with a --param or a hard-coded number. I'd simply > > do this processing once we hit the limit and start over, even when > > the basic-block isn't finished yet. > > So, using an array of pointers to vec_perm_simplify_seq is preferred. > While it is possible to exploit this limit in an artifical test case, > I haven't seen real-world code where this would be the case. > I will go use an array with a hard-coded maximum of 8 sequences.
I misunderstood you here yesterday. So, instead of using the O(n) methods of vec, I now allocate the vec with a fixed size with auto_vec<vec_perm_simplify_seq, 8> vec_perm_simplify_seq_list; and then only use quick_push(), unordered_remove() and truncate(0). Revised patch (including fixes for all mentioned items) can be found here: https://gcc.gnu.org/pipermail/gcc-patches/2024-November/669020.html Sorry, that this comes so late. dump_enabled_p () wanted to teach me a lesson (that it does not replace a test for dump_file unless one wants to debug interesting effects). Thanks, Christoph > > I will also switch the naming a bit to match your words: > * lane compression -> narrowing > * sequences merging -> blending > > Thanks! > > > > > > + > > > + int i; > > > + vec_perm_simplify_seq seq2; > > > + FOR_EACH_VEC_ELT (vec_perm_simplify_seq_list, i, seq2) > > > + { > > > + if (can_merge_vec_perm_simplify_seqs_p (seq1, seq2)) > > > + { > > > + vec_perm_simplify_seq_list.ordered_remove (i); > > > + compress_vec_perm_simplify_seq (seq1); > > > + compress_vec_perm_simplify_seq (seq2); > > > + > > > + merge_vec_perm_simplify_seqs (seq1, seq2); > > > + > > > + seq2->new_sel.release (); > > > + XDELETE (seq2); > > > + break; > > > + } > > > + } > > > + > > > + seq1->new_sel.release (); > > > + XDELETE (seq1); > > > + } > > > + vec_perm_simplify_seq_list.release (); > > > + > > > /* Combine stmts with the stmts defining their operands. > > > Note we update GSI within the loop as necessary. */ > > > for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)