Hi Richi,

So I'm trying to look at what IVOPTs does right now and how it might be able to help us. Looking at these two code examples:
#include <stddef.h>
#if 0
int foo(short * a, short * b, unsigned int n)
{
    int sum = 0;
    for (unsigned int i = 0; i < n; ++i)
        sum += a[i] + b[i];

    return sum;
}


#else

int bar (short * a, short *b, unsigned int n)
{
    int sum = 0;
    unsigned int i = 0;
    for (; i < (n / 16); i += 1)
    {
        // Iterates [0, 16, .., (n/16 * 16) * 16]
        // Example n = 127,
        // iterates [0, 16, 32, 48, 64, 80, 96, 112]
        sum += a[i*16] + b[i*16];
    }
    for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j)
    {
        // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]
        // Example n = 127,
        // j starts at (127/16) * 16 = 7 * 16 = 112,
        // So iterates over [112, 113, 114, 115, ..., 127]
        sum += a[j] + b[j];
    }
    return sum;
}
#endif

Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the following options '-O3 -march=armv8-a -fno-tree-vectorize -fdump-tree-ivopts-all -fno-unroll-loops'. See godbolt link here: https://godbolt.org/z/MEf6j6ebM

I tried to see what IVOPTs would make of this and it is able to analyze the IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1) could be used as the base for the other (loop 2). I don't know if this is where you'd want such optimizations to be made, on one side I think it would be great as it would also help with non-vectorized loops as you allured to.

However, if you compile the top test case (#if 1) and let the tree-vectorizer have a go you will see different behaviours for different vectorization approaches, so for: '-O3 -march=armv8-a', using NEON and epilogue vectorization it seems IVOPTs only picks up on one loop. If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' it will detect two loops. This may well be because in fact epilogue vectorization 'un-loops' it because it knows it will only have to do one iteration of the vectorized epilogue. vect-partial-vector-usage=1 could have done the same, but because we are dealing with polymorphic vector modes it fails to, I have a hack that works for vect-partial-vector-usage to avoid it, but I think we can probably do better and try to reason about boundaries in poly_int's rather than integers (TBC).

Anyway I diverge. Back to the main question of this patch. How do you suggest I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's if one ends where the other begins?

On 04/05/2021 10:56, Richard Biener wrote:
On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:

Hi,

The aim of this RFC is to explore a way of cleaning up the codegen around
data_references.  To be specific, I'd like to reuse the main-loop's updated
data_reference as the base_address for the epilogue's corresponding
data_reference, rather than use the niters.  We have found this leads to
better codegen in the vectorized epilogue loops.

The approach in this RFC creates a map if iv_updates which always contain an
updated pointer that is caputed in vectorizable_{load,store}, an iv_update may
also contain a skip_edge in case we decide the vectorization can be skipped in
'vect_do_peeling'. During the epilogue update this map of iv_updates is then
checked to see if it contains an entry for a data_reference and it is used
accordingly and if not it reverts back to the old behavior of using the niters
to advance the data_reference.

The motivation for this work is to improve codegen for the option `--param
vect-partial-vector-usage=1` for SVE. We found that one of the main problems
for the codegen here was coming from unnecessary conversions caused by the way
we update the data_references in the epilogue.

This patch passes regression tests in aarch64-linux-gnu, but the codegen is
still not optimal in some cases. Specifically those where we have a scalar
epilogue, as this does not use the data_reference's and will rely on the
gimple scalar code, thus constructing again a memory access using the niters.
This is a limitation for which I haven't quite worked out a solution yet and
does cause some minor regressions due to unfortunate spills.

Let me know what you think and if you have ideas of how we can better achieve
this.
Hmm, so the patch adds a kludge to improve the kludge we have in place ;)

I think it might be interesting to create a C testcase mimicing the
update problem without involving the vectorizer.  That way we can
see how the various components involved behave (FRE + ivopts most
specifically).

That said, a cleaner approach to dealing with this would be to
explicitely track the IVs we generate for vectorized DRs, eventually
factoring that out from vectorizable_{store,load} so we can simply
carry over the actual pointer IV final value to the epilogue as
initial value.  For each DR group we'd create a single IV (we can
even do better in case we have load + store of the "same" group).

We already kind-of track things via the ivexpr_map, but I'm not sure
if this lazly populated map can be reliably re-used to "re-populate"
the epilogue one (walk the map, create epilogue IVs with the appropriate
initial value & adjustd upate).

Richard.

Kind regards,
Andre Vieira



Reply via email to