On Tue, 20 Aug 2024, Tamar Christina wrote: > Hi, > > I've been working on a prototype of moving early break to SLP. > > As we've discussed on IRC I've decided to first try adding the gconds as roots > and start SLP discovery using them as roots. > > This works great and doesn't require any changed to build_slp, it also has the > additional benefit in that we can easily (as a follow up) add groups of > gconds and then try to SLP the roots together if the operations are the same > and then decompose the tree based on the roots if not. > > So it looks like using the roots are the best approach. However I've hit some > issues that I could solve, but would require me to modify large chunks of code > and would like your input before I start. > > 1. roots are currently not analyzed or code-gened through vectorizable_*. > this is because it looks like only things used as roots so far are things > that all targets support (like constructors) or that will be lowered by > veclower later. This is easy to fix I can work roots into the analysis > part in vect_slp_analyze_node_operations and pass enough information to > vectorize_slp_instance_root_stmt to be able to use > vectorizable_early_break. > I have a prototype of this currently working but it's a hack and need to do > it properly if it's the way you'd like to go.
There is currently no "explicit" separate analysis of the root but only vect_slp_analyze_operations doing &cost_vec) /* CTOR instances require vectorized defs for the SLP tree root. */ || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor && (SLP_TREE_DEF_TYPE (SLP_INSTANCE_TREE (instance)) != vect_internal_def /* Make sure we vectorized with the expected type. */ || !useless_type_conversion_p (TREE_TYPE (TREE_TYPE (gimple_assign_rhs1 (instance->root_stmts[0]->stmt))), TREE_TYPE (SLP_TREE_VECTYPE (SLP_INSTANCE_TREE (instance)))))) /* Check we can vectorize the reduction. */ || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) for the transform phase we do have vectorize_slp_instance_root_stmt (called by vect_schedule_slp). Both do not really fit the vectorizable_* API since how the root looks like really depends on the SLP instance kind. So it would be above where you'd hook in the required code, adding a slp_inst_kind_early_break or so. Factoring the analysis part into a vectorizable_slp_instance_root () function would be an improvement of course. > 2. consider the loop: > > #ifndef N > #define N 800 > #endif > unsigned vect_a[N]; > unsigned vect_b[N]; > > unsigned test4(unsigned x) > { > unsigned ret = 0; > for (int i = 0; i < N; i++) > { > vect_b[i] = x + i; > if (vect_a[i]*2 != x) > break; > vect_a[i] = x; > > } > return ret; > } > > The build part looks like: > > note: === vect_analyze_slp === > note: Analyzing vectorizable control flow: if (patt_6 != 0) > note: Starting SLP discovery for > note: patt_6 = _4 != x_9(D); > note: starting SLP discovery for node 0x5141280 > note: Build SLP for patt_6 = _4 != x_9(D); > note: precomputed vectype: vector(4) <signed-boolean:32> > note: nunits = 4 > note: vect_is_simple_use: operand x_9(D), type of def: external > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, > +INF] MASK 0xfffffffe VALUE 0x0 > _3 * 2, type of def: internal > note: starting SLP discovery for node 0x51413a0 > note: Build SLP for _4 = _3 * 2; > note: precomputed vectype: vector(4) unsigned int > note: nunits = 4 > note: vect_is_simple_use: operand # VUSE <.MEM_10> > vect_aD.4416[i_15], type of def: internal > note: vect_is_simple_use: operand 2, type of def: constant > note: vect_is_simple_use: operand # VUSE <.MEM_10> > vect_aD.4416[i_15], type of def: internal > note: vect_is_simple_use: operand 2, type of def: constant > note: starting SLP discovery for node 0x5141430 > note: Build SLP for _3 = vect_a[i_15]; > note: precomputed vectype: vector(4) unsigned int > note: nunits = 4 > note: SLP discovery for node 0x5141430 succeeded > note: SLP discovery for node 0x51413a0 succeeded > note: SLP discovery for node 0x5141280 succeeded > note: SLP size 3 vs. limit 10. > note: Final SLP tree for instance 0x5208e30: > note: node 0x5141280 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32> > note: op template: patt_6 = _4 != x_9(D); > note: stmt 0 patt_6 = _4 != x_9(D); > note: children 0x5141310 0x51413a0 > note: node (external) 0x5141310 (max_nunits=1, refcnt=1) > note: { x_9(D) } > note: node 0x51413a0 (max_nunits=4, refcnt=2) vector(4) unsigned int > note: op template: _4 = _3 * 2; > note: stmt 0 _4 = _3 * 2; > note: children 0x5141430 0x51414c0 > note: node 0x5141430 (max_nunits=4, refcnt=2) vector(4) unsigned int > note: op template: _3 = vect_a[i_15]; > note: stmt 0 _3 = vect_a[i_15]; > note: load permutation { 0 } > note: node (constant) 0x51414c0 (max_nunits=1, refcnt=1) > note: { 2 } > > and codegen: > > note: ------>vectorizing statement: patt_6 = _4 != x_9(D); > note: transform statement. > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, > +INF] MASK 0xfffffffe VALUE 0x0 > _3 * 2, type of def: internal > note: vect_is_simple_use: vectype vector(4) unsigned int > note: vect_is_simple_use: operand x_9(D), type of def: external > note: vect_get_vec_defs_for_operand: _4 > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, > +INF] MASK 0xfffffffe VALUE 0x0 > _3 * 2, type of def: internal > note: def_stmt = _4 = _3 * 2; > note: vect_get_vec_defs_for_operand: x_9(D) > note: vect_is_simple_use: operand x_9(D), type of def: external > note: created new init_stmt: vect_cst__72 = {x_9(D), x_9(D), x_9(D), x_9(D)}; > note: add new stmt: mask_patt_6.25_73 = vect__4.24_71 != vect_cst__72; > note: ------>vectorizing statement: if (patt_6 != 0) > note: transform statement. > note: === vectorizable_early_exit === > note: vect_is_simple_use: operand _4 != x_9(D), type of def: internal > note: vect_is_simple_use: vectype vector(4) <signed-boolean:32> > note: transform early-exit. > note: vect_is_simple_use: operand _4 != x_9(D), type of def: internal > note: vect_is_simple_use: vectype vector(4) <signed-boolean:32> > note: vect_is_simple_use: operand 0, type of def: constant > note: vect_get_vec_defs_for_operand: patt_6 > note: vect_is_simple_use: operand _4 != x_9(D), type of def: internal > note: def_stmt = patt_6 = _4 != x_9(D); > note: vect_get_vec_defs_for_operand: 0 > note: vect_is_simple_use: operand 0, type of def: constant > note: created new init_stmt: vect_cst__74 = { 0, 0, 0, 0 }; > note: add new stmt: cmp_75 = mask_patt_6.25_73 ^ vect_cst__74; > > So far so good. > > However, things go wrong during SLP vect_detect_hybrid_slp analysis > > note: === vect_update_vf_for_slp === > note: Loop contains SLP and non-SLP stmts > note: Updating vectorization factor to 4. > note: vectorization_factor = 4, niters = 800 > > This has a couple of reasons: > > 1. The stores are non-grouped stores and so are never considered for SLP. Yeah, that's an unmerged part of the all-SLP migration (I _think_ I have posted a patch to do this). > Now I've temporarily worked around this by doing during vect_analyze_slp: > > /* Find SLP sequences starting from non-grouped stores. */ > for (auto dr : LOOP_VINFO_DATAREFS (vinfo)) > if (DR_IS_WRITE (dr)) > { > stmt_vec_info dr_info = vinfo->lookup_stmt (DR_STMT (dr)); > if (!dr_info) > continue; > > vect_analyze_slp_instance (vinfo, bst_map, dr_info, > slp_inst_kind_store, max_tree_size, > &limit); > } > > So it follows single lane stores. But I'm not sure I understand why this is > needed. I thought that your earlier work to transition to SLP only would have > already covered single stream stores. Nope, only single-stream inverleaved stores (single element interleaving). I've refrained from adding the "rest" yet (but it will look similar as to what you do above). > The above works, but I am unsure if that's the best solution, or if I'm > missing > something. Just bad timing ;) I keep being distracted from working on the remaining bits for all-SLP. > 2. The second part that goes wrong is that due to the same IV being used by > the early exit and the main exit, the main exit is now pulled into > analysis: > > note: === vect_detect_hybrid_slp === > note: Processing hybrid candidate : ivtmp_14 = ivtmp_7 - 1; > note: Found loop_vect use: if (ivtmp_14 != 0) > note: Processing hybrid candidate : i_12 = i_15 + 1; > note: Marked SLP consumed stmt pure: i_12 = i_15 + 1; > note: Processing hybrid candidate : ivtmp_7 = PHI <ivtmp_14(6), 800(2)> > note: Found loop_vect use: ivtmp_14 = ivtmp_7 - 1; > note: Processing hybrid candidate : if (patt_6 != 0) > note: Found loop_vect sink: if (patt_6 != 0) > note: marking hybrid: patt_6 = _4 != x_9(D); > note: marking hybrid: _4 = _3 * 2; > note: marking hybrid: _3 = vect_a[i_15]; > note: marking hybrid: i_15 = PHI <i_12(6), 0(2)> > note: marking hybrid: i_12 = i_15 + 1; > > Is the solution here that I treat LOOP_VINFO_IV_EXIT as a sink as well, and > forcibly ignore it? > > I think this would match what the analysis code later does: > > note: ==> examining statement: if (ivtmp_14 != 0) > note: irrelevant. > > This is the part I'm having the most trouble with. Today I believe we never > analyse the main loop exit because nothing pulls it into the analysis. Probably ivcanon ensures the IV is in it's own isolated use-def cycle, otherwise I don't see how we'd run into this for example when we have a vectorizable induction based on the same IV and stored into a SLP memory group? >From reading both above eventually hybrid detection should ignore !STMT_VINFO_RELEVANT loop_vect uses ... (luckily hybrid detection will go away when we're only-SLP). > 3. I believe I also need to analyse roots during VF, i.e. > vect_determine_vectorization_factor shows: > > note: ==> examining statement: if (_4 != x_9(D)) > note: skip. > note: ==> examining pattern def stmt: patt_17 = _4 != x_9(D); > note: precomputed vectype: vector(2) <signed-boolean:32> > note: nunits = 2 > > which does not seem right. Why's that not right? For reference below is what I have in my dev tree for the non-grouped store SLP. Thanks, Richard. >From 6fea9f34bd218437fc2d08da38f3883cac59947e Mon Sep 17 00:00:00 2001 From: Richard Biener <rguent...@suse.de> Date: Fri, 29 Sep 2023 12:54:17 +0200 Subject: [PATCH] Handle non-grouped stores as single-lane SLP To: gcc-patches@gcc.gnu.org The following enables single-lane loop SLP discovery for non-grouped stores and adjusts vectorizable_store to properly handle those. For gfortran.dg/vect/vect-8.f90 we vectorize one additional loop, not running into the "not falling back to strided accesses" bail-out. I have not investigated in detail. Similar for gcc.dg/vect/slp-19c.c. The gcc.dg/vect/O3-pr39675-2.c and gcc.dg/vect/slp-19[abc].c SLPs depend on the load permute lowering as the single-lane store we now want to handle is fed from a single lane from groups of size four. I've updated the expected number of SLPs but they FAIL. For gfortran.dg/vect/fast-math-mgrid-resid.f predictive commoning now unrolls the loop, the vectorization factor is the same. I think association during SLP build might be the reason for the difference. There is a set of i386 target assembler test FAILs, gcc.target/i386/pr88531-2[bc].c in particular fail because the target cannot identify SLP emulated gathers, see another mail from me. Others need adjustment, I've adjusted one with this patch only. * tree-vect-slp.cc (vect_analyze_slp): Perform single-lane loop SLP discovery for non-grouped stores. * tree-vect-stmts.cc (vectorizable_store): Always set vec_num for SLP. * gcc.dg/vect/O3-pr39675-2.c: Adjust expected number of SLP. * gcc.dg/vect/fast-math-vect-call-1.c: Likewise. * gcc.dg/vect/no-scevccp-slp-31.c: Likewise. * gcc.dg/vect/slp-12b.c: Likewise. * gcc.dg/vect/slp-12c.c: Likewise. * gcc.dg/vect/slp-19a.c: Likewise. * gcc.dg/vect/slp-19b.c: Likewise. * gcc.dg/vect/slp-19c.c: Likewise. * gcc.dg/vect/slp-4-big-array.c: Likewise. * gcc.dg/vect/slp-4.c: Likewise. * gcc.dg/vect/slp-5.c: Likewise. * gcc.dg/vect/slp-7.c: Likewise. * gcc.dg/vect/slp-perm-7.c: Likewise. * gcc.dg/vect/slp-37.c: Likewise. * gcc.dg/vect/vect-outer-slp-3.c: Disable vectorization of initialization loop. * gcc.dg/vect/slp-reduc-5.c: Likewise. * gcc.dg/vect/no-scevccp-outer-12.c: Un-XFAIL. SLP can handle inner loop inductions with multiple vector stmt copies. * gfortran.dg/vect/vect-8.f90: Adjust expected number of vectorized loops. * gfortran.dg/vect/fast-math-mgrid-resid.f: Expect predictive commoning with unrolling. * gcc.target/i386/vectorize1.c: Adjust what we scan for. --- gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c | 2 +- .../gcc.dg/vect/fast-math-vect-call-1.c | 2 +- .../gcc.dg/vect/no-scevccp-outer-12.c | 3 +-- gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c | 5 ++-- gcc/testsuite/gcc.dg/vect/slp-12b.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-12c.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-19a.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-19b.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-19c.c | 4 ++-- gcc/testsuite/gcc.dg/vect/slp-37.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-4-big-array.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-4.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-5.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-7.c | 4 ++-- gcc/testsuite/gcc.dg/vect/slp-perm-7.c | 4 ++-- gcc/testsuite/gcc.dg/vect/slp-reduc-5.c | 3 ++- gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c | 1 + gcc/testsuite/gcc.target/i386/vectorize1.c | 4 ++-- .../gfortran.dg/vect/fast-math-mgrid-resid.f | 2 +- gcc/testsuite/gfortran.dg/vect/vect-8.f90 | 2 +- gcc/tree-vect-slp.cc | 23 +++++++++++++++++++ gcc/tree-vect-stmts.cc | 11 +++++---- 22 files changed, 57 insertions(+), 29 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c b/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c index c3f0f6dc1be..ddaac56cc0b 100644 --- a/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c +++ b/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c @@ -27,5 +27,5 @@ foo () } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_strided4 } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_strided4 } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_strided4 } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c b/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c index ad22f6e82b3..6c9b7c37b6e 100644 --- a/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c +++ b/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c @@ -101,4 +101,4 @@ main () } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 4 "vect" { target { vect_call_copysignf && vect_call_sqrtf } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target { { vect_call_copysignf && vect_call_sqrtf } && vect_perm3_int } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target { { vect_call_copysignf && vect_call_sqrtf } && vect_perm3_int } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c index c2d3031bc0c..6ace6ad022e 100644 --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c @@ -46,5 +46,4 @@ int main (void) return 0; } -/* Until we support multiple types in the inner loop */ -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { xfail { ! { aarch64*-*-* riscv*-*-* } } } } } */ +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c index 22817a57ef8..f6ac5f60298 100644 --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c @@ -53,6 +53,7 @@ int main (void) return 0; } +/* We cannot handle grouped accesses in outer loops. */ +/* { dg-final { scan-tree-dump-not "OUTER LOOP VECTORIZED" "vect" } } */ /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" } } */ - +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-12b.c b/gcc/testsuite/gcc.dg/vect/slp-12b.c index e2ea24d6c53..8e06e3bfa93 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-12b.c +++ b/gcc/testsuite/gcc.dg/vect/slp-12b.c @@ -47,6 +47,6 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided2 && vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided2 && vect_int_mult } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_strided2 && vect_int_mult } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_strided2 && vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided2 && vect_int_mult } } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-12c.c b/gcc/testsuite/gcc.dg/vect/slp-12c.c index 9c48dff3bf4..a3536e3053b 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-12c.c +++ b/gcc/testsuite/gcc.dg/vect/slp-12c.c @@ -49,5 +49,5 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! vect_int_mult } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_int_mult } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_int_mult } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! vect_int_mult } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-19a.c b/gcc/testsuite/gcc.dg/vect/slp-19a.c index ca7a0a8e456..6c21416046d 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-19a.c +++ b/gcc/testsuite/gcc.dg/vect/slp-19a.c @@ -57,5 +57,5 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_strided8 } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! vect_strided8 } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_strided8 } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_strided8 } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! vect_strided8} } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-19b.c b/gcc/testsuite/gcc.dg/vect/slp-19b.c index 4d53ac698db..10b84aab3b5 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-19b.c +++ b/gcc/testsuite/gcc.dg/vect/slp-19b.c @@ -54,5 +54,5 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_strided4 } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! vect_strided4 } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_strided4 } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_strided4 } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! vect_strided4 } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-19c.c b/gcc/testsuite/gcc.dg/vect/slp-19c.c index 188ab37a0b6..84869cadc89 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-19c.c +++ b/gcc/testsuite/gcc.dg/vect/slp-19c.c @@ -105,5 +105,5 @@ int main (void) return 0; } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-37.c b/gcc/testsuite/gcc.dg/vect/slp-37.c index caee2bb508f..8a430e63847 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-37.c +++ b/gcc/testsuite/gcc.dg/vect/slp-37.c @@ -60,4 +60,4 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_hw_misalign } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_hw_misalign } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_hw_misalign } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c b/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c index fcda45ff368..f738a613324 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c +++ b/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c @@ -131,5 +131,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-4.c b/gcc/testsuite/gcc.dg/vect/slp-4.c index 29e741df02b..1ecad7415ef 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-4.c +++ b/gcc/testsuite/gcc.dg/vect/slp-4.c @@ -125,5 +125,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-5.c b/gcc/testsuite/gcc.dg/vect/slp-5.c index 6d51f6a7323..484898c2afd 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-5.c +++ b/gcc/testsuite/gcc.dg/vect/slp-5.c @@ -124,5 +124,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 5 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-7.c b/gcc/testsuite/gcc.dg/vect/slp-7.c index 2845a99dedf..f83fdc96d16 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-7.c +++ b/gcc/testsuite/gcc.dg/vect/slp-7.c @@ -125,6 +125,6 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target vect_short_mult } } }*/ /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! { vect_short_mult } } } } }*/ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_short_mult } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { ! { vect_short_mult } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 5 "vect" { target vect_short_mult } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target { ! { vect_short_mult } } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c index df13c37bc75..c3d903e5b11 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c @@ -97,8 +97,8 @@ int main (int argc, const char* argv[]) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_perm } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm3_int && { ! vect_load_lanes } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_perm3_int && { ! vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_load_lanes } } } */ /* { dg-final { scan-tree-dump "Built SLP cancelled: can use load/store-lanes" "vect" { target { vect_perm3_int && vect_load_lanes } } } } */ /* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } } } */ /* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c b/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c index 11f5a7414cf..0cde79d9e49 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c +++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c @@ -36,6 +36,7 @@ int main (void) check_vect (); +#pragma GCC novector for (i = 0; i < N; i++) c[i] = (i+3) * -1; @@ -44,6 +45,6 @@ int main (void) return 0; } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { xfail vect_no_int_min_max } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_min_max } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_min_max } } } */ /* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 0 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c index 3dce51426b5..d315db5632b 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c +++ b/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c @@ -30,6 +30,7 @@ int main () { check_vect (); +#pragma GCC novector for (int i = 0; i < 40; ++i) image[i] = 1.; diff --git a/gcc/testsuite/gcc.target/i386/vectorize1.c b/gcc/testsuite/gcc.target/i386/vectorize1.c index f3b9bfba382..14a8c5f28b3 100644 --- a/gcc/testsuite/gcc.target/i386/vectorize1.c +++ b/gcc/testsuite/gcc.target/i386/vectorize1.c @@ -1,6 +1,6 @@ /* PR middle-end/28915 */ /* { dg-do compile } */ -/* { dg-options "-msse -O2 -ftree-vectorize -fdump-tree-vect" } */ +/* { dg-options "-msse -O2 -ftree-vectorize -fdump-tree-vect-optimized" } */ extern char lanip[3][40]; typedef struct @@ -17,4 +17,4 @@ int set_names (void) tt1.t[ln] = lanip[1]; } -/* { dg-final { scan-tree-dump "vect_cst" "vect" } } */ +/* { dg-final { scan-tree-dump "optimized: loop vectorized" "vect" } } */ diff --git a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f index 2e548748296..9dda5087551 100644 --- a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f +++ b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f @@ -43,5 +43,5 @@ C ! vectorized loop. If vector factor is 2, the vectorized loop can ! be predictive commoned, we check if predictive commoning PHI node ! is created with vector(2) type. -! { dg-final { scan-tree-dump "Executing predictive commoning without unrolling" "pcom" { xfail vect_variable_length } } } +! { dg-final { scan-tree-dump "Unrolling 2 times" "pcom" { xfail vect_variable_length } } } ! { dg-final { scan-tree-dump "vectp_u.*__lsm.* = PHI <.*vectp_u.*__lsm" "pcom" { xfail vect_variable_length } } } diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 b/gcc/testsuite/gfortran.dg/vect/vect-8.f90 index f77ec9fb87a..283c36e0ebe 100644 --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90 @@ -708,5 +708,5 @@ END SUBROUTINE kernel ! { dg-final { scan-tree-dump-times "vectorized 2\[56\] loops" 1 "vect" { target aarch64_sve } } } ! { dg-final { scan-tree-dump-times "vectorized 2\[45\] loops" 1 "vect" { target { aarch64*-*-* && { ! aarch64_sve } } } } } -! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } } +! { dg-final { scan-tree-dump-times "vectorized 2\[345\] loops" 1 "vect" { target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } } ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { { ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } } diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 5c8b1beda38..f40a530c183 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -4335,6 +4335,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, opt_result vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) { + loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo); unsigned int i; stmt_vec_info first_element; slp_instance instance; @@ -4351,6 +4352,28 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) vect_analyze_slp_instance (vinfo, bst_map, first_element, slp_inst_kind_store, max_tree_size, &limit); + /* For loops also start SLP discovery from non-grouped stores. */ + if (loop_vinfo) + { + data_reference_p dr; + FOR_EACH_VEC_ELT (vinfo->shared->datarefs, i, dr) + if (DR_IS_WRITE (dr)) + { + stmt_vec_info stmt_info = vinfo->lookup_dr (dr)->stmt; + /* Grouped stores are already handled above. */ + if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) + continue; + vec<stmt_vec_info> stmts; + vec<stmt_vec_info> roots = vNULL; + vec<tree> remain = vNULL; + stmts.create (1); + stmts.quick_push (stmt_info); + vect_build_slp_instance (vinfo, slp_inst_kind_store, + stmts, roots, remain, max_tree_size, + &limit, bst_map, NULL); + } + } + if (bb_vec_info bb_vinfo = dyn_cast <bb_vec_info> (vinfo)) { for (unsigned i = 0; i < bb_vinfo->roots.length (); ++i) diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 153348000b2..c743a77f946 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -8334,10 +8334,12 @@ vectorizable_store (vec_info *vinfo, return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies); } - if (grouped_store) + if (grouped_store || slp) { /* FORNOW */ - gcc_assert (!loop || !nested_in_vect_loop_p (loop, stmt_info)); + gcc_assert (!grouped_store + || !loop + || !nested_in_vect_loop_p (loop, stmt_info)); if (slp) { @@ -8346,8 +8348,9 @@ vectorizable_store (vec_info *vinfo, group. */ vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); first_stmt_info = SLP_TREE_SCALAR_STMTS (slp_node)[0]; - gcc_assert (DR_GROUP_FIRST_ELEMENT (first_stmt_info) - == first_stmt_info); + gcc_assert (!STMT_VINFO_GROUPED_ACCESS (first_stmt_info) + || (DR_GROUP_FIRST_ELEMENT (first_stmt_info) + == first_stmt_info)); first_dr_info = STMT_VINFO_DR_INFO (first_stmt_info); op = vect_get_store_rhs (first_stmt_info); } -- 2.43.0