Hi Richi,
So I'm trying to look at what IVOPTs does right now and how it might be
able to help us. Looking at these two code examples:
#include <stddef.h>
#if 0
int foo(short * a, short * b, unsigned int n)
{
int sum = 0;
for (unsigned int i = 0; i < n; ++i)
sum += a[i] + b[i];
return sum;
}
#else
int bar (short * a, short *b, unsigned int n)
{
int sum = 0;
unsigned int i = 0;
for (; i < (n / 16); i += 1)
{
// Iterates [0, 16, .., (n/16 * 16) * 16]
// Example n = 127,
// iterates [0, 16, 32, 48, 64, 80, 96, 112]
sum += a[i*16] + b[i*16];
}
for (size_t j = (size_t) ((n / 16) * 16); j < n; ++j)
{
// Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]
// Example n = 127,
// j starts at (127/16) * 16 = 7 * 16 = 112,
// So iterates over [112, 113, 114, 115, ..., 127]
sum += a[j] + b[j];
}
return sum;
}
#endif
Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the
following options '-O3 -march=armv8-a -fno-tree-vectorize
-fdump-tree-ivopts-all -fno-unroll-loops'. See godbolt link here:
https://godbolt.org/z/MEf6j6ebM
I tried to see what IVOPTs would make of this and it is able to analyze
the IVs but it doesn't realize (not even sure it tries) that one IV's
end (loop 1) could be used as the base for the other (loop 2). I don't
know if this is where you'd want such optimizations to be made, on one
side I think it would be great as it would also help with non-vectorized
loops as you allured to.
However, if you compile the top test case (#if 1) and let the
tree-vectorizer have a go you will see different behaviours for
different vectorization approaches, so for:
'-O3 -march=armv8-a', using NEON and epilogue vectorization it seems
IVOPTs only picks up on one loop.
If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1'
it will detect two loops. This may well be because in fact epilogue
vectorization 'un-loops' it because it knows it will only have to do one
iteration of the vectorized epilogue. vect-partial-vector-usage=1 could
have done the same, but because we are dealing with polymorphic vector
modes it fails to, I have a hack that works for
vect-partial-vector-usage to avoid it, but I think we can probably do
better and try to reason about boundaries in poly_int's rather than
integers (TBC).
Anyway I diverge. Back to the main question of this patch. How do you
suggest I go about this? Is there a way to make IVOPTS aware of the
'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then
teach it to merge IV's if one ends where the other begins?
On 04/05/2021 10:56, Richard Biener wrote:
On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
Hi,
The aim of this RFC is to explore a way of cleaning up the codegen around
data_references. To be specific, I'd like to reuse the main-loop's updated
data_reference as the base_address for the epilogue's corresponding
data_reference, rather than use the niters. We have found this leads to
better codegen in the vectorized epilogue loops.
The approach in this RFC creates a map if iv_updates which always contain an
updated pointer that is caputed in vectorizable_{load,store}, an iv_update may
also contain a skip_edge in case we decide the vectorization can be skipped in
'vect_do_peeling'. During the epilogue update this map of iv_updates is then
checked to see if it contains an entry for a data_reference and it is used
accordingly and if not it reverts back to the old behavior of using the niters
to advance the data_reference.
The motivation for this work is to improve codegen for the option `--param
vect-partial-vector-usage=1` for SVE. We found that one of the main problems
for the codegen here was coming from unnecessary conversions caused by the way
we update the data_references in the epilogue.
This patch passes regression tests in aarch64-linux-gnu, but the codegen is
still not optimal in some cases. Specifically those where we have a scalar
epilogue, as this does not use the data_reference's and will rely on the
gimple scalar code, thus constructing again a memory access using the niters.
This is a limitation for which I haven't quite worked out a solution yet and
does cause some minor regressions due to unfortunate spills.
Let me know what you think and if you have ideas of how we can better achieve
this.
Hmm, so the patch adds a kludge to improve the kludge we have in place ;)
I think it might be interesting to create a C testcase mimicing the
update problem without involving the vectorizer. That way we can
see how the various components involved behave (FRE + ivopts most
specifically).
That said, a cleaner approach to dealing with this would be to
explicitely track the IVs we generate for vectorized DRs, eventually
factoring that out from vectorizable_{store,load} so we can simply
carry over the actual pointer IV final value to the epilogue as
initial value. For each DR group we'd create a single IV (we can
even do better in case we have load + store of the "same" group).
We already kind-of track things via the ivexpr_map, but I'm not sure
if this lazly populated map can be reliably re-used to "re-populate"
the epilogue one (walk the map, create epilogue IVs with the appropriate
initial value & adjustd upate).
Richard.
Kind regards,
Andre Vieira