On Tue, Nov 10, 2015 at 1:45 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Mon, Nov 9, 2015 at 3:59 PM, Alan Lawrence <alan.lawre...@arm.com> wrote: >> On 06/11/15 12:55, Richard Biener wrote: >>> >>>> + /* GROUP_GAP of the first group now has to skip over the second group >>>> too. */ >>>> + GROUP_GAP (first_vinfo) += group2_size; >>> >>> Please add a MSG_NOTE debug printf stating that we split the group and >>> at which element. >> >> Done. >> >>> I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)) >>> otherwise this could be SLP reductions where there is no way the split >>> group would enable vectorization. >> >> Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) >> check >> sufficed for that, as per e.g. the section above >> /* Create a node (a root of the SLP tree) for the packed grouped stores. */ >> But done. >> >>> Note that BB vectorization now also very aggressively falls back to >>> considering >>> non-matches being "external". >>> >>> Not sure why that doesn't trigger for your testcases ;) >> >> I tested against trunk r229944, on which all of my scan-tree-dump's were >> failing.... >> >>> I'm comfortable with the i < group_size half of the patch. For the other >>> piece >>> I'd like to see more compile-time / success data from, say, building >>> SPEC CPU 2006. >> >> Well, as a couple of quick data points, a compile of SPEC2000 on >> aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I >> have: >> >> 3080 successes without patch; >> +79 successes from the "i < vectorization_factor" part of the patch (on its >> own) >> +90 successes from the (i>=vectorization_factor) && "i < group_size" part >> (on its own) >> +(79 from first) +(90 from second) + (an additional 62) from both parts >> together. >> >> And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer >> -mcpu=cortex-a57): >> 11979 successes without patch; >> + 499 from the "i < vectorization_factor" part >> + 264 from the (i >= vectorization factor) && (i < group_size)" part >> + extra 336 if both parts combined. > > Thanks > >> I haven't done any significant measurements of compile-time yet. >> >> (snipping this bit out-of-order) >>> Hmm. This is of course pretty bad for compile-time for the non-matching >>> case as that effectively will always split into N pieces if we feed it >>> garbage (that is, without being sure that at least one pice _will_ >>> vectorize). >>> >>> OTOH with the current way of computing "matches" we'd restrict ourselves >>> to cases where the first stmt we look at (and match to) happens to be >>> the operation that in the end will form a matching group. >> ... >>> Eventually we'd want to improve the "matches" return >>> to include the affected stmts (ok, that'll be not very easy) so we can do >>> a cheap "if we split here will it fix that particular mismatch" check. >> >> Yes, I think there are a bunch of things we can do here, that would be more >> powerful than the simple approach I used here. The biggest limiting factor >> will >> probably be (lack of) permutation, i.e. if we only SLP stores to consecutive >> addresses. > > Yeah, doing anything else requires us to use masked or strided stores though. > >>> So, please split the patch and I suggest to leave the controversical part >>> for next stage1 together with some improvement on the SLP build process >>> itself? >> >> Here's a reduced version with just the second case, >> bootstrapped+check-gcc/g++ on x86_64. >> >> gcc/ChangeLog: >> >> * tree-vect-slp.c (vect_split_slp_store_group): New. >> (vect_analyze_slp_instance): During basic block SLP, recurse on >> subgroups if vect_build_slp_tree fails after 1st vector. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic. >> * gcc.dg/vect/bb-slp-subgroups-1.c: New. >> * gcc.dg/vect/bb-slp-subgroups-2.c: New. >> * gcc.dg/vect/bb-slp-subgroups-3.c: New. >> * gcc.dg/vect/bb-slp-subgroups-4.c: New. >> --- >> gcc/testsuite/gcc.dg/vect/bb-slp-7.c | 10 ++-- >> gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++ >> gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++ >> gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++ >> gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++ >> gcc/tree-vect-slp.c | 74 >> +++++++++++++++++++++++++- >> 6 files changed, 246 insertions(+), 6 deletions(-) >> create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c >> create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c >> create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c >> create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c >> >> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c >> b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c >> index ab54a48..b8bef8c 100644 >> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c >> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c >> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y) >> unsigned int *pout = &out[0]; >> unsigned int a0, a1, a2, a3; >> >> - /* Non isomorphic. */ >> + /* Non isomorphic, even 64-bit subgroups. */ >> a0 = *pin++ + 23; >> - a1 = *pin++ + 142; >> + a1 = *pin++ * 142; >> a2 = *pin++ + 2; >> a3 = *pin++ * 31; >> - >> + >> *pout++ = a0 * x; >> *pout++ = a1 * y; >> *pout++ = a2 * x; >> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y) >> >> /* Check results. */ >> if (out[0] != (in[0] + 23) * x >> - || out[1] != (in[1] + 142) * y >> + || out[1] != (in[1] * 142) * y >> || out[2] != (in[2] + 2) * x >> || out[3] != (in[3] * 31) * y) >> abort(); >> @@ -47,4 +47,4 @@ int main (void) >> } >> >> /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } >> */ >> - >> + >> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c >> b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c >> new file mode 100644 >> index 0000000..39c23c3 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c >> @@ -0,0 +1,44 @@ >> +/* { dg-require-effective-target vect_int } */ >> +/* PR tree-optimization/67682. */ >> + >> +#include "tree-vect.h" >> + >> +int __attribute__((__aligned__(8))) a[8]; >> +int __attribute__((__aligned__(8))) b[4]; >> + >> +__attribute__ ((noinline)) void >> +test () >> +{ >> + a[0] = b[0]; >> + a[1] = b[1]; >> + a[2] = b[2]; >> + a[3] = b[3]; >> + a[4] = 0; >> + a[5] = 0; >> + a[6] = 0; >> + a[7] = 0; >> +} >> + >> +int >> +main (int argc, char **argv) >> +{ >> + check_vect (); >> + >> + for (int i = 0; i < 8; i++) >> + a[i] = 1; >> + for (int i = 0; i < 4; i++) >> + b[i] = i + 4; >> + __asm__ volatile ("" : : : "memory"); >> + test (a, b); >> + __asm__ volatile ("" : : : "memory"); >> + for (int i = 0; i < 4; i++) >> + if (a[i] != i+4) >> + abort (); >> + for (int i = 4; i < 8; i++) >> + if (a[i] != 0) >> + abort (); >> + return 0; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using >> SLP" 1 "slp2" } } */ >> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } >> */ >> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c >> b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c >> new file mode 100644 >> index 0000000..06099fd >> --- /dev/null >> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c >> @@ -0,0 +1,42 @@ >> +/* { dg-require-effective-target vect_int } */ >> +/* PR tree-optimization/67682. */ >> + >> +#include "tree-vect.h" >> + >> +int __attribute__((__aligned__(8))) a[8]; >> +int __attribute__((__aligned__(8))) b[8]; >> + >> +__attribute__ ((noinline)) void >> +test () >> +{ >> + a[0] = b[0]; >> + a[1] = b[1] + 1; >> + a[2] = b[2] * 2; >> + a[3] = b[3] + 3; >> + >> + a[4] = b[4] + 4; >> + a[5] = b[5] + 4; >> + a[6] = b[6] + 4; >> + a[7] = b[7] + 4; >> +} >> + >> +int >> +main (int argc, char **argv) >> +{ >> + check_vect (); >> + >> + for (int i = 0; i < 8; i++) >> + b[i] = i + 1; >> + __asm__ volatile ("" : : : "memory"); >> + test (a, b); >> + __asm__ volatile ("" : : : "memory"); >> + if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7)) >> + abort (); >> + for (int i = 4; i < 8; i++) >> + if (a[i] != i + 5) >> + abort (); >> + return 0; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using >> SLP" 1 "slp2" } } */ >> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } >> */ >> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c >> b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c >> new file mode 100644 >> index 0000000..13c51f3 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c >> @@ -0,0 +1,41 @@ >> +/* { dg-require-effective-target vect_int } */ >> +/* PR tree-optimization/67682. */ >> + >> +#include "tree-vect.h" >> + >> +int __attribute__((__aligned__(8))) a[8]; >> +int __attribute__((__aligned__(8))) b[4]; >> + >> +__attribute__ ((noinline)) void >> +test () >> +{ >> + a[0] = b[2] + 1; >> + a[1] = b[0] + 2; >> + a[2] = b[1] + 3; >> + a[3] = b[1] + 4; >> + a[4] = b[3] * 3; >> + a[5] = b[0] * 4; >> + a[6] = b[2] * 5; >> + a[7] = b[1] * 7; >> +} >> + >> +int >> +main (int argc, char **argv) >> +{ >> + check_vect (); >> + >> + for (int i = 0; i < 8; i++) >> + a[i] = 1; >> + for (int i = 0; i < 4; i++) >> + b[i] = i + 4; >> + __asm__ volatile ("" : : : "memory"); >> + test (a, b); >> + __asm__ volatile ("" : : : "memory"); >> + if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9) >> + || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35)) >> + abort (); >> + return 0; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using >> SLP" 1 "slp2" } } */ >> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } >> */ >> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c >> b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c >> new file mode 100644 >> index 0000000..6ae9a89 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c >> @@ -0,0 +1,41 @@ >> +/* { dg-require-effective-target vect_int } */ >> +/* PR tree-optimization/67682. */ >> + >> +#include "tree-vect.h" >> + >> +int __attribute__((__aligned__(8))) a[8]; >> +int __attribute__((__aligned__(8))) b[8]; >> + >> +__attribute__ ((noinline)) void >> +test () >> +{ >> + a[0] = b[0] + 1; >> + a[1] = b[1] + 2; >> + a[2] = b[2] + 3; >> + a[3] = b[3] + 4; >> + a[4] = b[0] * 3; >> + a[5] = b[2] * 4; >> + a[6] = b[4] * 5; >> + a[7] = b[6] * 7; >> +} >> + >> +int >> +main (int argc, char **argv) >> +{ >> + check_vect (); >> + >> + for (int i = 0; i < 8; i++) >> + a[i] = 1; >> + for (int i = 0; i < 8; i++) >> + b[i] = i + 4; >> + __asm__ volatile ("" : : : "memory"); >> + test (a, b); >> + __asm__ volatile ("" : : : "memory"); >> + if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11) >> + || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70)) >> + abort (); >> + return 0; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using >> SLP" 1 "slp2" } } */ >> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } >> */ >> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c >> index cfdfc29..05d4b35 100644 >> --- a/gcc/tree-vect-slp.c >> +++ b/gcc/tree-vect-slp.c >> @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void >> *data) >> body_cost_vec.release (); >> } >> >> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two >> groups: >> + one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing >> + the first GROUP1_SIZE stmts, since stores are consecutive), the second >> + containing the remainder. >> + Return the first stmt in the second group. */ >> + >> +static gimple * >> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size) >> +{ >> + stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt); >> + gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt); >> + gcc_assert (group1_size > 0); >> + int group2_size = GROUP_SIZE (first_vinfo) - group1_size; >> + gcc_assert (group2_size > 0); >> + GROUP_SIZE (first_vinfo) = group1_size; >> + >> + gimple *stmt = first_stmt; >> + for (unsigned i = group1_size; i > 1; i--) >> + { >> + stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); >> + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); >> + } >> + /* STMT is now the last element of the first group. */ >> + gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); >> + GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0; >> + >> + GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size; >> + for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt >> (stmt))) >> + { >> + GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2; >> + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); >> + } >> + >> + /* For the second group, the GROUP_GAP is that before the original group, >> + plus skipping over the first vector. */ >> + GROUP_GAP (vinfo_for_stmt (group2)) = >> + GROUP_GAP (first_vinfo) + group1_size; >> + >> + /* GROUP_GAP of the first group now has to skip over the second group >> too. */ >> + GROUP_GAP (first_vinfo) += group2_size; >> + >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and >> %d\n", >> + group1_size, group2_size); >> + >> + return group2; >> +} >> + >> /* Analyze an SLP instance starting from a group of grouped stores. Call >> vect_build_slp_tree to build a tree of packed stmts if possible. >> Return FALSE if it's impossible to SLP any stmt in the loop. */ >> @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo, >> tree vectype, scalar_type = NULL_TREE; >> gimple *next; >> unsigned int vectorization_factor = 0; >> - int i; >> + unsigned int i; >> unsigned int max_nunits = 0; >> vec<slp_tree> loads; >> struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)); >> @@ -1811,6 +1859,30 @@ vect_analyze_slp_instance (vec_info *vinfo, >> vect_free_slp_tree (node); >> loads.release (); >> >> + /* For basic block SLP, try to break the group up into multiples of the >> + vectorization factor. */ >> + if (is_a <bb_vec_info> (vinfo) >> + && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) >> + && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))) >> + { >> + /* We consider breaking the group only on VF boundaries from the >> existing >> + start. */ >> + for (i = 0; i < group_size; i++) >> + if (!matches[i]) break; >> + >> + if (i >= vectorization_factor && i < group_size) >> + { >> + /* Split into two groups at the first vector boundary before i. */ >> + gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == >> 0); > > Just noticing this... if we have a vectorization factor of 4 and matches > is 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 > and > 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail? > > Thus shouldn't we split either only if i % vectorization_factor is 0 or > if not, split "twice", dropping the intermediate surely non-matching > group of vectorization_factor size? After all if we split like with the > patch then the upper half will _not_ be splitted again with the > simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again). > > So I expect that the statistics will be the same if we restrict splitting > to the i % vectorization_factor == 0 case, or rather split where we do > now but only re-analyze group2 if i % vectorization_factor == 0 holds? > > Ok with that change. Improvements on that incrementally.
Btw, it just occurs to me that the whole thing is equivalent to splitting the store-group into vector-size pieces up-front? That way we do the maximum splitting up-frond and avoid any redundant work? The patch is still ok as said, just the above may be a simple thing to explore. Thanks, Richard. > Thanks, > Richard. > >> + i &= ~(vectorization_factor - 1); >> + gimple *grp2start = vect_split_slp_store_group (stmt, i); >> + return vect_analyze_slp_instance (vinfo, stmt, max_tree_size) >> + | vect_analyze_slp_instance (vinfo, grp2start, >> max_tree_size); >> + } >> + /* Even though the first vector did not all match, we might be able >> to SLP >> + (some) of the remainder. FORNOW ignore this possibility. */ >> + } >> + >> return false; >> } >> >> -- >> 1.9.1 >>