On January 20, 2021 12:17:35 PM GMT+01:00, Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: >duplicate_and_interleave is the main fallback way of loading >a repeating sequence of elements into variable-length vectors. >The code handles cases in which the number of elements in the >sequence is potentially several times greater than the number >of elements in a vector. > >Let: > >- NE be the (compile-time) number of elements in the sequence >- NR be the (compile-time) number of vector results and >- VE be the (run-time) number of elements in each vector > >The basic approach is to duplicate each element into a >separate vector, giving NE vectors in total, then use >log2(NE) rows of NE permutes to generate NE results. > >In the worst case — when VE has no known compile-time factor >and NR >= NE — all of these permutes are necessary. However, >if VE is known to be a multiple of 2**F, then each of the >first F permute rows produce duplicate results; specifically, >the high permute for a given pair is the same as the low permute. >The code dealt with this by reusing the low result for the >high result. This part was OK. > >However, having duplicate results from one row meant that the >next row did duplicate work. The redundancies would be optimised >away by later passes, but the code tried to avoid generating them >in the first place. This is the part that went wrong. > >Specifically, NR is typically less than NE when some permutes are >redundant, so the code tried to use NR to reduce the amount of work >performed. The problem was that, although it correctly calculated >a conservative bound on how many results were needed in each row, >it chose the wrong results for anything other than the final row. > >This doesn't usually matter for fully-packed SVE vectors. We first >try to coalesce smaller elements into larger ones, so normally >VE ends up being 2**VQ (where VQ is the number of 128-bit blocks >in an SVE vector). In that situation we'd only apply the faulty >optimisation to the final row, i.e. the case it handled correctly. >E.g. for things like: > > void > f (long *x) > { > for (int i = 0; i < 100; i += 8) > { > x[i] += 1; > x[i + 1] += 2; > x[i + 2] += 3; > x[i + 3] += 4; > x[i + 4] += 5; > x[i + 5] += 6; > x[i + 6] += 7; > x[i + 7] += 8; > } > } > >(already tested by the testsuite), we'd have 3 rows of permutes >producing 4 vector results. The schemne produced: > >1st row: 8 results from 4 permutes, highs duplicates of lows >2nd row: 8 results from 8 permutes (half of which are actually >redundant) >3rd row: 4 results from 4 permutes > >However, coalescing elements is trickier for unpacked vectors, >and at the moment we don't try to do it (see the GET_MODE_SIZE >check in can_duplicate_and_interleave_p). Unpacked vectors >therefore stress the code in ways that packed vectors didn't. > >The patch fixes this by removing the redundancies from each row, >rather than trying to work around them later. This also removes >the redundant work in the second row of the example above. > >Tested on aarch64-linux-gnu, aarch64_be-elf and (redundantly) >x86_64-linux-gnu. OK for trunk and branches?
Ok. Thanks, Richard. >Thanks, >Richard > > >gcc/ > PR tree-optimization/98535 > * tree-vect-slp.c (duplicate_and_interleave): Use quick_grow_cleared. > If the high and low permutes are the same, remove the high permutes > from the working set and only continue with the low ones. >--- > .../gcc.target/aarch64/sve/pr98535.c | 18 +++++++ > gcc/tree-vect-slp.c | 49 +++++++++++-------- > 2 files changed, 46 insertions(+), 21 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/pr98535.c > >diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c >index 1787ad74268..4465cf7494e 100644 >--- a/gcc/tree-vect-slp.c >+++ b/gcc/tree-vect-slp.c >@@ -5063,7 +5063,7 @@ duplicate_and_interleave (vec_info *vinfo, >gimple_seq *seq, tree vector_type, > > tree_vector_builder partial_elts; > auto_vec<tree, 32> pieces (nvectors * 2); >- pieces.quick_grow (nvectors * 2); >+ pieces.quick_grow_cleared (nvectors * 2); > for (unsigned int i = 0; i < nvectors; ++i) > { >/* (2) Replace ELTS[0:NELTS] with ELTS'[0:NELTS'], where each element >of >@@ -5082,53 +5082,60 @@ duplicate_and_interleave (vec_info *vinfo, >gimple_seq *seq, tree vector_type, > /* (4) Use a tree of VEC_PERM_EXPRs to create a single VM with the > correct byte contents. > >- We need to repeat the following operation log2(nvectors) times: >+ Conceptually, we need to repeat the following operation >log2(nvectors) >+ times, where hi_start = nvectors / 2: > > out[i * 2] = VEC_PERM_EXPR (in[i], in[i + hi_start], lo_permute); > out[i * 2 + 1] = VEC_PERM_EXPR (in[i], in[i + hi_start], hi_permute); > > However, if each input repeats every N elements and the VF is >- a multiple of N * 2, the HI result is the same as the LO. */ >+ a multiple of N * 2, the HI result is the same as the LO result. >+ This will be true for the first N1 iterations of the outer loop, >+ followed by N2 iterations for which both the LO and HI results >+ are needed. I.e.: >+ >+ N1 + N2 = log2(nvectors) >+ >+ Each "N1 iteration" doubles the number of redundant vectors and >the >+ effect of the process as a whole is to have a sequence of >nvectors/2**N1 >+ vectors that repeats 2**N1 times. Rather than generate these >redundant >+ vectors, we halve the number of vectors for each N1 iteration. >*/ > unsigned int in_start = 0; > unsigned int out_start = nvectors; >- unsigned int hi_start = nvectors / 2; >- /* A bound on the number of outputs needed to produce NRESULTS >results >- in the final iteration. */ >- unsigned int noutputs_bound = nvectors * nresults; >+ unsigned int new_nvectors = nvectors; > for (unsigned int in_repeat = 1; in_repeat < nvectors; in_repeat *= 2) > { >- noutputs_bound /= 2; >- unsigned int limit = MIN (noutputs_bound, nvectors); >- for (unsigned int i = 0; i < limit; ++i) >+ unsigned int hi_start = new_nvectors / 2; >+ unsigned int out_i = 0; >+ for (unsigned int in_i = 0; in_i < new_nvectors; ++in_i) > { >- if ((i & 1) != 0 >+ if ((in_i & 1) != 0 > && multiple_p (TYPE_VECTOR_SUBPARTS (new_vector_type), > 2 * in_repeat)) >- { >- pieces[out_start + i] = pieces[out_start + i - 1]; >- continue; >- } >+ continue; > > tree output = make_ssa_name (new_vector_type); >- tree input1 = pieces[in_start + (i / 2)]; >- tree input2 = pieces[in_start + (i / 2) + hi_start]; >+ tree input1 = pieces[in_start + (in_i / 2)]; >+ tree input2 = pieces[in_start + (in_i / 2) + hi_start]; > gassign *stmt = gimple_build_assign (output, VEC_PERM_EXPR, > input1, input2, >- permutes[i & 1]); >+ permutes[in_i & 1]); > gimple_seq_add_stmt (seq, stmt); >- pieces[out_start + i] = output; >+ pieces[out_start + out_i] = output; >+ out_i += 1; > } > std::swap (in_start, out_start); >+ new_nvectors = out_i; > } > >/* (5) Use VIEW_CONVERT_EXPR to cast the final VM to the required type. > */ > results.reserve (nresults); > for (unsigned int i = 0; i < nresults; ++i) >- if (i < nvectors) >+ if (i < new_nvectors) > results.quick_push (gimple_build (seq, VIEW_CONVERT_EXPR, vector_type, > pieces[in_start + i])); > else >- results.quick_push (results[i - nvectors]); >+ results.quick_push (results[i - new_nvectors]); > } > > >diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c >b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c >new file mode 100644 >index 00000000000..6873a38734d >--- /dev/null >+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c >@@ -0,0 +1,18 @@ >+/* { dg-options "-O3 -mtune=neoverse-v1" } */ >+ >+typedef short a; >+ >+typedef struct { >+ a b, c, d, e; >+} f; >+ >+f *g; >+ >+long h; >+ >+void >+i() { >+ f j; >+ for (; h; h++) >+ *g++ = j; >+}