On January 20, 2021 12:17:35 PM GMT+01:00, Richard Sandiford via Gcc-patches 
<gcc-patches@gcc.gnu.org> wrote:
>duplicate_and_interleave is the main fallback way of loading
>a repeating sequence of elements into variable-length vectors.
>The code handles cases in which the number of elements in the
>sequence is potentially several times greater than the number
>of elements in a vector.
>
>Let:
>
>- NE be the (compile-time) number of elements in the sequence
>- NR be the (compile-time) number of vector results and
>- VE be the (run-time) number of elements in each vector
>
>The basic approach is to duplicate each element into a
>separate vector, giving NE vectors in total, then use
>log2(NE) rows of NE permutes to generate NE results.
>
>In the worst case — when VE has no known compile-time factor
>and NR >= NE — all of these permutes are necessary.  However,
>if VE is known to be a multiple of 2**F, then each of the
>first F permute rows produce duplicate results; specifically,
>the high permute for a given pair is the same as the low permute.
>The code dealt with this by reusing the low result for the
>high result.  This part was OK.
>
>However, having duplicate results from one row meant that the
>next row did duplicate work.  The redundancies would be optimised
>away by later passes, but the code tried to avoid generating them
>in the first place.  This is the part that went wrong.
>
>Specifically, NR is typically less than NE when some permutes are
>redundant, so the code tried to use NR to reduce the amount of work
>performed.  The problem was that, although it correctly calculated
>a conservative bound on how many results were needed in each row,
>it chose the wrong results for anything other than the final row.
>
>This doesn't usually matter for fully-packed SVE vectors.  We first
>try to coalesce smaller elements into larger ones, so normally
>VE ends up being 2**VQ (where VQ is the number of 128-bit blocks
>in an SVE vector).  In that situation we'd only apply the faulty
>optimisation to the final row, i.e. the case it handled correctly.
>E.g. for things like:
>
>  void
>  f (long *x)
>  {
>    for (int i = 0; i < 100; i += 8)
>      {
>        x[i] += 1;
>        x[i + 1] += 2;
>        x[i + 2] += 3;
>        x[i + 3] += 4;
>        x[i + 4] += 5;
>        x[i + 5] += 6;
>        x[i + 6] += 7;
>        x[i + 7] += 8;
>      }
>  }
>
>(already tested by the testsuite), we'd have 3 rows of permutes
>producing 4 vector results.  The schemne produced:
>
>1st row: 8 results from 4 permutes, highs duplicates of lows
>2nd row: 8 results from 8 permutes (half of which are actually
>redundant)
>3rd row: 4 results from 4 permutes
>
>However, coalescing elements is trickier for unpacked vectors,
>and at the moment we don't try to do it (see the GET_MODE_SIZE
>check in can_duplicate_and_interleave_p).  Unpacked vectors
>therefore stress the code in ways that packed vectors didn't.
>
>The patch fixes this by removing the redundancies from each row,
>rather than trying to work around them later.  This also removes
>the redundant work in the second row of the example above.
>
>Tested on aarch64-linux-gnu, aarch64_be-elf and (redundantly)
>x86_64-linux-gnu.  OK for trunk and branches?

Ok. 

Thanks, 
Richard. 

>Thanks,
>Richard
>
>
>gcc/
>       PR tree-optimization/98535
>       * tree-vect-slp.c (duplicate_and_interleave): Use quick_grow_cleared.
>       If the high and low permutes are the same, remove the high permutes
>       from the working set and only continue with the low ones.
>---
> .../gcc.target/aarch64/sve/pr98535.c          | 18 +++++++
> gcc/tree-vect-slp.c                           | 49 +++++++++++--------
> 2 files changed, 46 insertions(+), 21 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
>
>diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
>index 1787ad74268..4465cf7494e 100644
>--- a/gcc/tree-vect-slp.c
>+++ b/gcc/tree-vect-slp.c
>@@ -5063,7 +5063,7 @@ duplicate_and_interleave (vec_info *vinfo,
>gimple_seq *seq, tree vector_type,
> 
>   tree_vector_builder partial_elts;
>   auto_vec<tree, 32> pieces (nvectors * 2);
>-  pieces.quick_grow (nvectors * 2);
>+  pieces.quick_grow_cleared (nvectors * 2);
>   for (unsigned int i = 0; i < nvectors; ++i)
>     {
>/* (2) Replace ELTS[0:NELTS] with ELTS'[0:NELTS'], where each element
>of
>@@ -5082,53 +5082,60 @@ duplicate_and_interleave (vec_info *vinfo,
>gimple_seq *seq, tree vector_type,
>   /* (4) Use a tree of VEC_PERM_EXPRs to create a single VM with the
>        correct byte contents.
> 
>-     We need to repeat the following operation log2(nvectors) times:
>+     Conceptually, we need to repeat the following operation
>log2(nvectors)
>+     times, where hi_start = nvectors / 2:
> 
>       out[i * 2] = VEC_PERM_EXPR (in[i], in[i + hi_start], lo_permute);
>       out[i * 2 + 1] = VEC_PERM_EXPR (in[i], in[i + hi_start], hi_permute);
> 
>      However, if each input repeats every N elements and the VF is
>-     a multiple of N * 2, the HI result is the same as the LO.  */
>+     a multiple of N * 2, the HI result is the same as the LO result.
>+     This will be true for the first N1 iterations of the outer loop,
>+     followed by N2 iterations for which both the LO and HI results
>+     are needed.  I.e.:
>+
>+      N1 + N2 = log2(nvectors)
>+
>+     Each "N1 iteration" doubles the number of redundant vectors and
>the
>+     effect of the process as a whole is to have a sequence of
>nvectors/2**N1
>+     vectors that repeats 2**N1 times.  Rather than generate these
>redundant
>+     vectors, we halve the number of vectors for each N1 iteration. 
>*/
>   unsigned int in_start = 0;
>   unsigned int out_start = nvectors;
>-  unsigned int hi_start = nvectors / 2;
>-  /* A bound on the number of outputs needed to produce NRESULTS
>results
>-     in the final iteration.  */
>-  unsigned int noutputs_bound = nvectors * nresults;
>+  unsigned int new_nvectors = nvectors;
> for (unsigned int in_repeat = 1; in_repeat < nvectors; in_repeat *= 2)
>     {
>-      noutputs_bound /= 2;
>-      unsigned int limit = MIN (noutputs_bound, nvectors);
>-      for (unsigned int i = 0; i < limit; ++i)
>+      unsigned int hi_start = new_nvectors / 2;
>+      unsigned int out_i = 0;
>+      for (unsigned int in_i = 0; in_i < new_nvectors; ++in_i)
>       {
>-        if ((i & 1) != 0
>+        if ((in_i & 1) != 0
>             && multiple_p (TYPE_VECTOR_SUBPARTS (new_vector_type),
>                            2 * in_repeat))
>-          {
>-            pieces[out_start + i] = pieces[out_start + i - 1];
>-            continue;
>-          }
>+          continue;
> 
>         tree output = make_ssa_name (new_vector_type);
>-        tree input1 = pieces[in_start + (i / 2)];
>-        tree input2 = pieces[in_start + (i / 2) + hi_start];
>+        tree input1 = pieces[in_start + (in_i / 2)];
>+        tree input2 = pieces[in_start + (in_i / 2) + hi_start];
>         gassign *stmt = gimple_build_assign (output, VEC_PERM_EXPR,
>                                              input1, input2,
>-                                             permutes[i & 1]);
>+                                             permutes[in_i & 1]);
>         gimple_seq_add_stmt (seq, stmt);
>-        pieces[out_start + i] = output;
>+        pieces[out_start + out_i] = output;
>+        out_i += 1;
>       }
>       std::swap (in_start, out_start);
>+      new_nvectors = out_i;
>     }
> 
>/* (5) Use VIEW_CONVERT_EXPR to cast the final VM to the required type.
> */
>   results.reserve (nresults);
>   for (unsigned int i = 0; i < nresults; ++i)
>-    if (i < nvectors)
>+    if (i < new_nvectors)
> results.quick_push (gimple_build (seq, VIEW_CONVERT_EXPR, vector_type,
>                                       pieces[in_start + i]));
>     else
>-      results.quick_push (results[i - nvectors]);
>+      results.quick_push (results[i - new_nvectors]);
> }
> 
> 
>diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
>b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
>new file mode 100644
>index 00000000000..6873a38734d
>--- /dev/null
>+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
>@@ -0,0 +1,18 @@
>+/* { dg-options "-O3 -mtune=neoverse-v1" } */
>+
>+typedef short a;
>+
>+typedef struct {
>+  a b, c, d, e;
>+} f;
>+
>+f *g;
>+
>+long h;
>+
>+void
>+i() {
>+  f j;
>+  for (; h; h++)
>+    *g++ = j;
>+}

Reply via email to