On Fri, 24 Jun 2022, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Thu, 23 Jun 2022, Richard Sandiford wrote: > >> In a reduction pair like: > >> > >> typedef float T; > >> > >> void > >> f1 (T *x) > >> { > >> T res1 = 0; > >> T res2 = 0; > >> for (int i = 0; i < 100; ++i) > >> { > >> res1 += x[i * 2]; > >> res2 += x[i * 2 + 1]; > >> } > >> x[0] = res1; > >> x[1] = res2; > >> } > >> > >> it isn't easy to predict whether the initial reduction group will > >> be { res1, res2 } or { res2, res1 }. If the initial group ends up > >> being { res2, res1 }, vect_optimize_slp will try to swap it around > >> in order to remove an unncessary permute on the load. This gives: > >> > >> f1: > >> .LFB0: > >> .cfi_startproc > >> movi v0.4s, 0 > >> mov x1, x0 > >> add x2, x0, 800 > >> .p2align 3,,7 > >> .L2: > >> ldr q1, [x1], 16 > >> fadd v0.4s, v0.4s, v1.4s > >> cmp x2, x1 > >> bne .L2 > >> dup d1, v0.d[1] > >> fadd v0.2s, v0.2s, v1.2s > >> str d0, [x0] > >> ret > >> > >> But there are no specific ordering constraints for non-consecutive > >> loads. For example, in: > >> > >> void > >> f2 (T *x) > >> { > >> T res1 = 0; > >> T res2 = 0; > >> for (int i = 0; i < 100; ++i) > >> { > >> res1 += x[i * 4]; > >> res2 += x[i * 4 + 2]; > >> } > >> x[0] = res1; > >> x[1] = res2; > >> } > >> > >> the reduction group happens to be { res2, res1 }, and so we get > >> a load permutation of { 2, 0, 6, 4 }. On aarch64 this gives: > >> > >> f2: > >> .LFB1: > >> .cfi_startproc > >> adrp x2, .LC0 > >> mov x1, x0 > >> movi v2.4s, 0 > >> ldr q3, [x2, #:lo12:.LC0] > >> add x2, x0, 1568 > >> .p2align 3,,7 > >> .L6: > >> ldp q0, q1, [x1] > >> add x1, x1, 32 > >> tbl v0.16b, {v0.16b - v1.16b}, v3.16b > >> fadd v2.4s, v2.4s, v0.4s > >> cmp x1, x2 > >> bne .L6 > >> ...untidy code, 18 insns... > >> ret > >> > >> where we have to load the permute selector from memory and use > >> the general TBL instruction. If the order happened to be { res1, res2 } > >> instead we would generate the much tidier: > >> > >> f2: > >> .LFB1: > >> .cfi_startproc > >> movi v1.4s, 0 > >> mov x1, x0 > >> add x2, x0, 1568 > >> .p2align 3,,7 > >> .L6: > >> ldp q0, q2, [x1] > >> add x1, x1, 32 > >> uzp1 v0.4s, v0.4s, v2.4s > >> fadd v1.4s, v1.4s, v0.4s > >> cmp x1, x2 > >> bne .L6 > >> ldr d0, [x0, 1568] > >> ldr d5, [x0, 1576] > >> ldr s2, [x0, 1584] > >> dup d3, v1.d[1] > >> ldr s4, [x0, 1592] > >> zip1 v0.2s, v0.2s, v5.2s > >> ins v2.s[1], v4.s[0] > >> fadd v0.2s, v0.2s, v2.2s > >> fadd v0.2s, v0.2s, v1.2s > >> fadd v0.2s, v0.2s, v3.2s > >> str d0, [x0] > >> ret > >> > >> This WIP patch makes vect_optimize_slp try to put load permutations > >> into index order. However, this new transform might be a loss if it > >> forces permutations elsewhere. For example, given: > >> > >> void > >> f3 (T *restrict x, T *restrict y, T *restrict z) > >> { > >> for (int i = 0; i < 100; ++i) > >> { > >> x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2]; > >> x[i * 2 + 1] = y[i * 4] + z[i * 4]; > >> } > >> } > >> > >> we would generate: > >> > >> .L9: > >> ldr q0, [x1, x3] > >> ldr q3, [x6, x3] > >> ldr q1, [x2, x3] > >> ldr q2, [x5, x3] > >> add x3, x3, 32 > >> uzp1 v0.4s, v0.4s, v3.4s > >> uzp1 v1.4s, v1.4s, v2.4s > >> fadd v0.4s, v0.4s, v1.4s > >> rev64 v0.4s, v0.4s > >> str q0, [x4], 16 > >> cmp x3, 1568 > >> bne .L9 > >> > >> (3 permutes) rather than: > >> > >> .L9: > >> ldr q0, [x1, x3] > >> ldr q1, [x6, x3] > >> ldr q2, [x2, x3] > >> ldr q3, [x5, x3] > >> tbl v0.16b, {v0.16b - v1.16b}, v4.16b > >> add x3, x3, 32 > >> tbl v2.16b, {v2.16b - v3.16b}, v4.16b > >> fadd v0.4s, v0.4s, v2.4s > >> str q0, [x4], 16 > >> cmp x3, 1568 > >> bne .L9 > >> > >> (2 permutes). > >> > >> One simple(ish) fix would be to conditionalise the new case on > >> whether all "roots" of the load are reduction groups. Alternatively, > >> we could treat the new case as a soft preference and back out if it > >> would require any new VEC_PERM_EXPR nodes to be created. This would > >> require a propagation back from uses to definitions. > >> > >> WDYT? Does this look like a feasible direction? If so, any thoughts > >> on when the new case should be enabled? > >> > >> The patch keeps the bijection requirement, since relaxing that seems > >> like separate work. > >> > >> Tested on aarch64-linux-gnu, no regressions. > >> > >> Thanks, > >> Richard > >> > >> --- > >> gcc/tree-vect-slp.cc | 41 ++++++++++++++--------------------------- > >> 1 file changed, 14 insertions(+), 27 deletions(-) > >> > >> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > >> index dab5daddcc5..fde35d8c954 100644 > >> --- a/gcc/tree-vect-slp.cc > >> +++ b/gcc/tree-vect-slp.cc > >> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3. If not see > >> <http://www.gnu.org/licenses/>. */ > >> > >> #include "config.h" > >> +#define INCLUDE_ALGORITHM > >> #include "system.h" > >> #include "coretypes.h" > >> #include "backend.h" > >> @@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo) > >> if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt)) > >> continue; > >> dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt); > >> - unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0; > >> - bool any_permute = false; > >> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > >> - { > >> - unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j]; > >> - imin = MIN (imin, idx); > >> - imax = MAX (imax, idx); > >> - if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j) > >> - any_permute = true; > >> - } > >> - /* If there's no permute no need to split one out. */ > >> - if (!any_permute) > >> - continue; > >> - /* If the span doesn't match we'd disrupt VF computation, avoid > >> - that for now. */ > >> - if (imax - imin + 1 != SLP_TREE_LANES (node)) > >> + > >> + auto_vec<unsigned> sorted; > >> + sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node)); > >> + std::sort (sorted.begin (), sorted.end ()); > >> + if (std::equal (sorted.begin (), sorted.end (), > >> + SLP_TREE_LOAD_PERMUTATION (node).begin ())) > >> continue; > >> > >> /* For now only handle true permutes, like > >> vect_attempt_slp_rearrange_stmts did. This allows us to be lazy > >> when permuting constants and invariants keeping the permute > >> bijective. */ > >> - auto_sbitmap load_index (SLP_TREE_LANES (node)); > >> - bitmap_clear (load_index); > >> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > >> - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin); > >> - unsigned j; > >> - for (j = 0; j < SLP_TREE_LANES (node); ++j) > >> - if (!bitmap_bit_p (load_index, j)) > >> - break; > >> - if (j != SLP_TREE_LANES (node)) > >> + if (std::adjacent_find (sorted.begin (), sorted.end ()) != > >> sorted.end ()) > >> continue; > > > > So without the std:: rewrite this is the only change, where > > previously we rejected { 0, 2, ... } because '1' was missing > > but now we only reject duplicates? > > All three changes work together really. I thought the std:: stuff > made things simpler, but the things that use std:: would still need > to be rewritten if the patch didn't use std::. > > For a load permute (Lo) like { 2, 0, 6, 4 } we want the lane permute (La) > to be { 1, 0, 3, 2 }, so that the new load permute (Lo') is { 0, 2, 4, 6 }. > > So the ?SLP_TREE_LOAD_PERMUTATION (node)[j] - imin? below is no longer > enough when computing La, since subtracting the minimum doesn't have > the necessary compressive effect. We instead need to set La[i] to the > index of Lo[i] in Lo'. And the easiest way of doing that seemed to be > to compute Lo' (by sorting) and then doing a binary search for each > element. Having the sorted array also makes it easy to see whether > the permute is a no-op and whether the same source element appears > twice. > > > As the comment says this > > is what vect_attempt_slp_rearrange_stmts did so we would need > > to adjust the comment as well. > > I thought the comment meant the that permute entered into the perm > array is a bijection on [0, SLP_TREE_LANES (node) - 1]. That's still > true with the patch, and so for example we still don't need to worry > about: > > /* ??? But for constants once we want to handle > non-bijective permutes we have to verify the permute, > when unifying lanes, will not unify different constants. > For example see gcc.dg/vect/bb-slp-14.c for a case > that would break. */ > > (I hope).
Yes. I think the comment meant that permutes that only select a part of the vector are not supported, but I'd have to go back to the original vect_attempt_slp_rearrange_stmts (which was quite limited) to check. The idea there was that the "unpermuted" vector load would be always a natively supported operation and thus always better than the permuted one. I think extending on that is OK. Ideally targets would expose multi-step permutes here so we can better assess cost. > > (I'm unsure about the std:: stuff but mostly because I'm not > > familiar with it and its semantics) > > > > The current "propagation" isn't really designed to find > > optimal solutions, it rather relies on the initial > > permute "anticipation" we set up and simply propagates > > those up as far as possible. I think at some point we need > > to come up with something more elaborate and rewrite this all. > > > > It might also be that a target handles { 2, 0 } but not > > { 0, 2 } while we know it handles the unpermuted case always. > > OK. I guess we should reject new permutes if vect_transform_slp_perm_load > is false for them and true for the original. Yes, for example. Or somehow "compute" a permute that makes the permute supported (since when it ends up not supported we disqualify the vectorization later). > > Can you think of a way to express this as dataflow problem > > capturing finding the optimal solution as well? > > I think it's inevitably going to be a two-way thing: propagate what's > best for producers as far as possible, then propagate back what's > best for consumers as far as possible, or the other way around. Yeah, with the additional constraint what the target handles. Richard. > Thanks, > Richard > > > > >> vec<unsigned> perm = vNULL; > >> perm.safe_grow (SLP_TREE_LANES (node), true); > >> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > >> - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > >> + { > >> + auto it = std::lower_bound (sorted.begin (), sorted.end (), > >> + SLP_TREE_LOAD_PERMUTATION (node)[j]); > >> + perm[j] = it - sorted.begin (); > >> + } > >> perms.safe_push (perm); > >> vertices[idx].perm_in = perms.length () - 1; > >> vertices[idx].perm_out = perms.length () - 1; > >> @@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo, > >> { > >> stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; > >> int vec_index = 0; > >> - tree vectype = STMT_VINFO_VECTYPE (stmt_info); > >> + tree vectype = SLP_TREE_VECTYPE (node); > > > > that looks like an obvious fix. > > > > Richard. > > > >> unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length (); > >> unsigned int mask_element; > >> machine_mode mode; > >> > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstra