Re: [RFC] Support single lane SLP early break

Richard Biener Wed, 21 Aug 2024 04:11:56 -0700

On Tue, 20 Aug 2024, Tamar Christina wrote:

> Hi,
> 
> I've been working on a prototype of moving early break to SLP.
> 
> As we've discussed on IRC I've decided to first try adding the gconds as roots
> and start SLP discovery using them as roots.
> 
> This works great and doesn't require any changed to build_slp, it also has the
> additional benefit in that we can easily (as a follow up) add groups of
> gconds and then try to SLP the roots together if the operations are the same
> and then decompose the tree based on the roots if not.
> 
> So it looks like using the roots are the best approach. However I've hit some
> issues that I could solve, but would require me to modify large chunks of code
> and would like your input before I start.
> 
> 1. roots are currently not analyzed or code-gened through vectorizable_*.
>    this is because it looks like only things used as roots so far are things
>    that all targets support (like constructors) or that will be lowered by
>    veclower later.  This is easy to fix  I can work roots into the analysis
>    part in vect_slp_analyze_node_operations and pass enough information to
>    vectorize_slp_instance_root_stmt to be able to use 
> vectorizable_early_break.
>    I have a prototype of this currently working but it's a hack and need to do
>    it properly if it's the way you'd like to go.


There is currently no "explicit" separate analysis of the root but only
vect_slp_analyze_operations doing

                                             &cost_vec)
          /* CTOR instances require vectorized defs for the SLP tree root.  
*/
          || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor
              && (SLP_TREE_DEF_TYPE (SLP_INSTANCE_TREE (instance))
                  != vect_internal_def
                  /* Make sure we vectorized with the expected type.  */
                  || !useless_type_conversion_p
                        (TREE_TYPE (TREE_TYPE (gimple_assign_rhs1
                                              
(instance->root_stmts[0]->stmt))),
                         TREE_TYPE (SLP_TREE_VECTYPE
                                            (SLP_INSTANCE_TREE 
(instance))))))
          /* Check we can vectorize the reduction.  */
          || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc
              && !vectorizable_bb_reduc_epilogue (instance, &cost_vec)))

for the transform phase we do have vectorize_slp_instance_root_stmt
(called by vect_schedule_slp).  Both do not really fit the
vectorizable_* API since how the root looks like really depends on
the SLP instance kind.

So it would be above where you'd hook in the required code, adding
a slp_inst_kind_early_break or so.  Factoring the analysis part
into a vectorizable_slp_instance_root () function would be an
improvement of course.

> 2.  consider the loop:
> 
> #ifndef N
> #define N 800
> #endif
> unsigned vect_a[N];
> unsigned vect_b[N];
> 
> unsigned test4(unsigned x)
> {
>  unsigned ret = 0;
>  for (int i = 0; i < N; i++)
>  {
>    vect_b[i] = x + i;
>    if (vect_a[i]*2 != x)
>      break;
>    vect_a[i] = x;
> 
>  }
>  return ret;
> }
> 
> The build part looks like:
> 
> note:   === vect_analyze_slp ===
> note:   Analyzing vectorizable control flow: if (patt_6 != 0)
> note:   Starting SLP discovery for
> note:     patt_6 = _4 != x_9(D);
> note:   starting SLP discovery for node 0x5141280
> note:   Build SLP for patt_6 = _4 != x_9(D);
> note:   precomputed vectype: vector(4) <signed-boolean:32>
> note:   nunits = 4
> note:   vect_is_simple_use: operand x_9(D), type of def: external
> note:   vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
> +INF] MASK 0xfffffffe VALUE 0x0
> _3 * 2, type of def: internal
> note:   starting SLP discovery for node 0x51413a0
> note:   Build SLP for _4 = _3 * 2;
> note:   precomputed vectype: vector(4) unsigned int
> note:   nunits = 4
> note:   vect_is_simple_use: operand # VUSE <.MEM_10>
> vect_aD.4416[i_15], type of def: internal
> note:   vect_is_simple_use: operand 2, type of def: constant
> note:   vect_is_simple_use: operand # VUSE <.MEM_10>
> vect_aD.4416[i_15], type of def: internal
> note:   vect_is_simple_use: operand 2, type of def: constant
> note:   starting SLP discovery for node 0x5141430
> note:   Build SLP for _3 = vect_a[i_15];
> note:   precomputed vectype: vector(4) unsigned int
> note:   nunits = 4
> note:   SLP discovery for node 0x5141430 succeeded
> note:   SLP discovery for node 0x51413a0 succeeded
> note:   SLP discovery for node 0x5141280 succeeded
> note:   SLP size 3 vs. limit 10.
> note:   Final SLP tree for instance 0x5208e30:
> note:   node 0x5141280 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32>
> note:   op template: patt_6 = _4 != x_9(D);
> note:      stmt 0 patt_6 = _4 != x_9(D);
> note:      children 0x5141310 0x51413a0
> note:   node (external) 0x5141310 (max_nunits=1, refcnt=1)
> note:      { x_9(D) }
> note:   node 0x51413a0 (max_nunits=4, refcnt=2) vector(4) unsigned int
> note:   op template: _4 = _3 * 2;
> note:      stmt 0 _4 = _3 * 2;
> note:      children 0x5141430 0x51414c0
> note:   node 0x5141430 (max_nunits=4, refcnt=2) vector(4) unsigned int
> note:   op template: _3 = vect_a[i_15];
> note:      stmt 0 _3 = vect_a[i_15];
> note:      load permutation { 0 }
> note:   node (constant) 0x51414c0 (max_nunits=1, refcnt=1)
> note:      { 2 }
> 
> and codegen:
> 
> note:  ------>vectorizing statement: patt_6 = _4 != x_9(D);
> note:  transform statement.
> note:  vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
> +INF] MASK 0xfffffffe VALUE 0x0
>        _3 * 2, type of def: internal
> note:  vect_is_simple_use: vectype vector(4) unsigned int
> note:  vect_is_simple_use: operand x_9(D), type of def: external
> note:  vect_get_vec_defs_for_operand: _4
> note:  vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
> +INF] MASK 0xfffffffe VALUE 0x0
>        _3 * 2, type of def: internal
> note:    def_stmt =  _4 = _3 * 2;
> note:  vect_get_vec_defs_for_operand: x_9(D)
> note:  vect_is_simple_use: operand x_9(D), type of def: external
> note:  created new init_stmt: vect_cst__72 = {x_9(D), x_9(D), x_9(D), x_9(D)};
> note:  add new stmt: mask_patt_6.25_73 = vect__4.24_71 != vect_cst__72;
> note:  ------>vectorizing statement: if (patt_6 != 0)
> note:  transform statement.
> note:   === vectorizable_early_exit ===
> note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
> note:   vect_is_simple_use: vectype vector(4) <signed-boolean:32>
> note:   transform early-exit.
> note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
> note:   vect_is_simple_use: vectype vector(4) <signed-boolean:32>
> note:   vect_is_simple_use: operand 0, type of def: constant
> note:   vect_get_vec_defs_for_operand: patt_6
> note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
> note:     def_stmt =  patt_6 = _4 != x_9(D);
> note:   vect_get_vec_defs_for_operand: 0
> note:   vect_is_simple_use: operand 0, type of def: constant
> note:   created new init_stmt: vect_cst__74 = { 0, 0, 0, 0 };
> note:   add new stmt: cmp_75 = mask_patt_6.25_73 ^ vect_cst__74;
> 
> So far so good.
> 
> However, things go wrong during SLP vect_detect_hybrid_slp analysis
> 
> note:   === vect_update_vf_for_slp ===
> note:   Loop contains SLP and non-SLP stmts
> note:   Updating vectorization factor to 4.
> note:  vectorization_factor = 4, niters = 800
> 
> This has a couple of reasons:
> 
> 1. The stores are non-grouped stores and so are never considered for SLP.

Yeah, that's an unmerged part of the all-SLP migration (I _think_ I have
posted a patch to do this).

> Now I've temporarily worked around this by doing during vect_analyze_slp:
> 
> /* Find SLP sequences starting from non-grouped stores.  */
> for (auto dr : LOOP_VINFO_DATAREFS (vinfo))
>       if (DR_IS_WRITE (dr))
>         {
>           stmt_vec_info dr_info = vinfo->lookup_stmt (DR_STMT (dr));
>           if (!dr_info)
>             continue;
> 
>           vect_analyze_slp_instance (vinfo, bst_map, dr_info,
>                                      slp_inst_kind_store, max_tree_size,
>                                      &limit);
>         }
> 
> So it follows single lane stores.  But I'm not sure I understand why this is
> needed.  I thought that your earlier work to transition to SLP only would have
> already covered single stream stores.

Nope, only single-stream inverleaved stores (single element interleaving).

I've refrained from adding the "rest" yet (but it will look similar as to
what you do above).

> The above works, but I am unsure if that's the best solution, or if I'm 
> missing
> something.

Just bad timing ;)  I keep being distracted from working on the
remaining bits for all-SLP.

> 2. The second part that goes wrong is that due to the same IV being used by
>     the early exit and the main exit, the main exit is now pulled into 
> analysis:
> 
> note:   === vect_detect_hybrid_slp ===
> note:   Processing hybrid candidate : ivtmp_14 = ivtmp_7 - 1;
> note:   Found loop_vect use: if (ivtmp_14 != 0)
> note:   Processing hybrid candidate : i_12 = i_15 + 1;
> note:   Marked SLP consumed stmt pure: i_12 = i_15 + 1;
> note:   Processing hybrid candidate : ivtmp_7 = PHI <ivtmp_14(6), 800(2)>
> note:   Found loop_vect use: ivtmp_14 = ivtmp_7 - 1;
> note:   Processing hybrid candidate : if (patt_6 != 0)
> note:   Found loop_vect sink: if (patt_6 != 0)
> note:   marking hybrid: patt_6 = _4 != x_9(D);
> note:   marking hybrid: _4 = _3 * 2;
> note:   marking hybrid: _3 = vect_a[i_15];
> note:   marking hybrid: i_15 = PHI <i_12(6), 0(2)>
> note:   marking hybrid: i_12 = i_15 + 1;
> 
> Is the solution here that I treat LOOP_VINFO_IV_EXIT as a sink as well, and
> forcibly ignore it?
> 
> I think this would match what the analysis code later does:
> 
> note:   ==> examining statement: if (ivtmp_14 != 0)
> note:   irrelevant.
> 
> This is the part I'm having the most trouble with.  Today I believe we never
> analyse the main loop exit because nothing pulls it into the analysis.

Probably ivcanon ensures the IV is in it's own isolated use-def cycle,
otherwise I don't see how we'd run into this for example when we have
a vectorizable induction based on the same IV and stored into a
SLP memory group?

>From reading both above eventually hybrid detection should ignore
!STMT_VINFO_RELEVANT loop_vect uses ... (luckily hybrid detection
will go away when we're only-SLP).

> 3. I believe I also need to analyse roots during VF, i.e.
>    vect_determine_vectorization_factor shows:
> 
> note:   ==> examining statement: if (_4 != x_9(D))
> note:   skip.
> note:   ==> examining pattern def stmt: patt_17 = _4 != x_9(D);
> note:   precomputed vectype: vector(2) <signed-boolean:32>
> note:   nunits = 2
> 
> which does not seem right.

Why's that not right?

For reference below is what I have in my dev tree for the non-grouped
store SLP.

Thanks,
Richard.

>From 6fea9f34bd218437fc2d08da38f3883cac59947e Mon Sep 17 00:00:00 2001
From: Richard Biener <rguent...@suse.de>
Date: Fri, 29 Sep 2023 12:54:17 +0200
Subject: [PATCH] Handle non-grouped stores as single-lane SLP
To: gcc-patches@gcc.gnu.org

The following enables single-lane loop SLP discovery for non-grouped stores
and adjusts vectorizable_store to properly handle those.

For gfortran.dg/vect/vect-8.f90 we vectorize one additional loop,
not running into the "not falling back to strided accesses" bail-out.
I have not investigated in detail.  Similar for gcc.dg/vect/slp-19c.c.

The gcc.dg/vect/O3-pr39675-2.c and gcc.dg/vect/slp-19[abc].c SLPs
depend on the load permute lowering as the single-lane store we
now want to handle is fed from a single lane from groups of size four.
I've updated the expected number of SLPs but they FAIL.

For gfortran.dg/vect/fast-math-mgrid-resid.f predictive commoning
now unrolls the loop, the vectorization factor is the same.  I think
association during SLP build might be the reason for the difference.

There is a set of i386 target assembler test FAILs,
gcc.target/i386/pr88531-2[bc].c in particular fail because the
target cannot identify SLP emulated gathers, see another mail from me.
Others need adjustment, I've adjusted one with this patch only.

        * tree-vect-slp.cc (vect_analyze_slp): Perform single-lane
        loop SLP discovery for non-grouped stores.
        * tree-vect-stmts.cc (vectorizable_store): Always set
        vec_num for SLP.

        * gcc.dg/vect/O3-pr39675-2.c: Adjust expected number of SLP.
        * gcc.dg/vect/fast-math-vect-call-1.c: Likewise.
        * gcc.dg/vect/no-scevccp-slp-31.c: Likewise.
        * gcc.dg/vect/slp-12b.c: Likewise.
        * gcc.dg/vect/slp-12c.c: Likewise.
        * gcc.dg/vect/slp-19a.c: Likewise.
        * gcc.dg/vect/slp-19b.c: Likewise.
        * gcc.dg/vect/slp-19c.c: Likewise.
        * gcc.dg/vect/slp-4-big-array.c: Likewise.
        * gcc.dg/vect/slp-4.c: Likewise.
        * gcc.dg/vect/slp-5.c: Likewise.
        * gcc.dg/vect/slp-7.c: Likewise.
        * gcc.dg/vect/slp-perm-7.c: Likewise.
        * gcc.dg/vect/slp-37.c: Likewise.
        * gcc.dg/vect/vect-outer-slp-3.c: Disable vectorization of
        initialization loop.
        * gcc.dg/vect/slp-reduc-5.c: Likewise.
        * gcc.dg/vect/no-scevccp-outer-12.c: Un-XFAIL.  SLP can handle
        inner loop inductions with multiple vector stmt copies.
        * gfortran.dg/vect/vect-8.f90: Adjust expected number of
        vectorized loops.
        * gfortran.dg/vect/fast-math-mgrid-resid.f: Expect predictive
        commoning with unrolling.
        * gcc.target/i386/vectorize1.c: Adjust what we scan for.
---
 gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c      |  2 +-
 .../gcc.dg/vect/fast-math-vect-call-1.c       |  2 +-
 .../gcc.dg/vect/no-scevccp-outer-12.c         |  3 +--
 gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c |  5 ++--
 gcc/testsuite/gcc.dg/vect/slp-12b.c           |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-12c.c           |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-19a.c           |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-19b.c           |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-19c.c           |  4 ++--
 gcc/testsuite/gcc.dg/vect/slp-37.c            |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-4-big-array.c   |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-4.c             |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-5.c             |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-7.c             |  4 ++--
 gcc/testsuite/gcc.dg/vect/slp-perm-7.c        |  4 ++--
 gcc/testsuite/gcc.dg/vect/slp-reduc-5.c       |  3 ++-
 gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c  |  1 +
 gcc/testsuite/gcc.target/i386/vectorize1.c    |  4 ++--
 .../gfortran.dg/vect/fast-math-mgrid-resid.f  |  2 +-
 gcc/testsuite/gfortran.dg/vect/vect-8.f90     |  2 +-
 gcc/tree-vect-slp.cc                          | 23 +++++++++++++++++++
 gcc/tree-vect-stmts.cc                        | 11 +++++----
 22 files changed, 57 insertions(+), 29 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c 
b/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c
index c3f0f6dc1be..ddaac56cc0b 100644
--- a/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c
+++ b/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c
@@ -27,5 +27,5 @@ foo ()
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target 
vect_strided4 } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_strided4 } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_strided4 } } } */
   
diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c 
b/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c
index ad22f6e82b3..6c9b7c37b6e 100644
--- a/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c
+++ b/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c
@@ -101,4 +101,4 @@ main ()
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 4 "vect" { target { 
vect_call_copysignf && vect_call_sqrtf } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { 
target { { vect_call_copysignf && vect_call_sqrtf } && vect_perm3_int } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { 
target { { vect_call_copysignf && vect_call_sqrtf } && vect_perm3_int } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c 
b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c
index c2d3031bc0c..6ace6ad022e 100644
--- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c
+++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c
@@ -46,5 +46,4 @@ int main (void)
   return 0;
 }
 
-/* Until we support multiple types in the inner loop  */
-/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { xfail 
{ ! { aarch64*-*-* riscv*-*-* } } } } } */
+/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c 
b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
index 22817a57ef8..f6ac5f60298 100644
--- a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
+++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
@@ -53,6 +53,7 @@ int main (void)
   return 0;
 }
 
+/* We cannot handle grouped accesses in outer loops.  */
+/* { dg-final { scan-tree-dump-not "OUTER LOOP VECTORIZED" "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  } 
} */
-  
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect"  } 
} */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-12b.c 
b/gcc/testsuite/gcc.dg/vect/slp-12b.c
index e2ea24d6c53..8e06e3bfa93 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-12b.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-12b.c
@@ -47,6 +47,6 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target { 
vect_strided2 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"  { target { 
! { vect_strided2 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect"  { 
target { vect_strided2 && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect"  { 
target { vect_strided2 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  { 
target { ! { vect_strided2 && vect_int_mult } } } } } */
   
diff --git a/gcc/testsuite/gcc.dg/vect/slp-12c.c 
b/gcc/testsuite/gcc.dg/vect/slp-12c.c
index 9c48dff3bf4..a3536e3053b 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-12c.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-12c.c
@@ -49,5 +49,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target { 
vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"  { target { 
! vect_int_mult } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_int_mult } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_int_mult } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 
target { ! vect_int_mult } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-19a.c 
b/gcc/testsuite/gcc.dg/vect/slp-19a.c
index ca7a0a8e456..6c21416046d 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-19a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-19a.c
@@ -57,5 +57,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_strided8 } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { 
! vect_strided8 } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_strided8 } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_strided8 } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 
target { ! vect_strided8} } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-19b.c 
b/gcc/testsuite/gcc.dg/vect/slp-19b.c
index 4d53ac698db..10b84aab3b5 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-19b.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-19b.c
@@ -54,5 +54,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_strided4 } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { 
! vect_strided4 } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_strided4 } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_strided4 } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 
target { ! vect_strided4 } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-19c.c 
b/gcc/testsuite/gcc.dg/vect/slp-19c.c
index 188ab37a0b6..84869cadc89 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-19c.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-19c.c
@@ -105,5 +105,5 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } 
} */
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" } 
} */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-37.c 
b/gcc/testsuite/gcc.dg/vect/slp-37.c
index caee2bb508f..8a430e63847 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-37.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-37.c
@@ -60,4 +60,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_hw_misalign } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_hw_misalign } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_hw_misalign } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c 
b/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c
index fcda45ff368..f738a613324 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c
@@ -131,5 +131,5 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect"  } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect"  } 
} */
 
diff --git a/gcc/testsuite/gcc.dg/vect/slp-4.c 
b/gcc/testsuite/gcc.dg/vect/slp-4.c
index 29e741df02b..1ecad7415ef 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-4.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-4.c
@@ -125,5 +125,5 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect"  } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect"  } 
} */
   
diff --git a/gcc/testsuite/gcc.dg/vect/slp-5.c 
b/gcc/testsuite/gcc.dg/vect/slp-5.c
index 6d51f6a7323..484898c2afd 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-5.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-5.c
@@ -124,5 +124,5 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect"  } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 5 "vect"  } 
} */
   
diff --git a/gcc/testsuite/gcc.dg/vect/slp-7.c 
b/gcc/testsuite/gcc.dg/vect/slp-7.c
index 2845a99dedf..f83fdc96d16 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-7.c
@@ -125,6 +125,6 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  { target 
vect_short_mult } } }*/
 /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect"  { target { 
! { vect_short_mult } } } } }*/
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect"  { 
target vect_short_mult } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect"  { 
target { ! { vect_short_mult } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 5 "vect"  { 
target vect_short_mult } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect"  { 
target { ! { vect_short_mult } } } } } */
  
diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c 
b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
index df13c37bc75..c3d903e5b11 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
@@ -97,8 +97,8 @@ int main (int argc, const char* argv[])
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target 
vect_perm } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { vect_perm3_int && { ! vect_load_lanes } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 
target vect_load_lanes } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target { vect_perm3_int && { ! vect_load_lanes } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_load_lanes } } } */
 /* { dg-final { scan-tree-dump "Built SLP cancelled: can use load/store-lanes" 
"vect" { target { vect_perm3_int && vect_load_lanes } } } } */
 /* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } 
} } */
 /* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } 
} } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c 
b/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
index 11f5a7414cf..0cde79d9e49 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
@@ -36,6 +36,7 @@ int main (void)
 
   check_vect ();
 
+#pragma GCC novector
   for (i = 0; i < N; i++)
     c[i] = (i+3) * -1;
 
@@ -44,6 +45,6 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { xfail 
vect_no_int_min_max } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail 
vect_no_int_min_max } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
xfail vect_no_int_min_max } } } */
 /* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 0 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c 
b/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c
index 3dce51426b5..d315db5632b 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c
@@ -30,6 +30,7 @@ int main ()
 {
   check_vect ();
 
+#pragma GCC novector
   for (int i = 0; i < 40; ++i)
     image[i] = 1.;
 
diff --git a/gcc/testsuite/gcc.target/i386/vectorize1.c 
b/gcc/testsuite/gcc.target/i386/vectorize1.c
index f3b9bfba382..14a8c5f28b3 100644
--- a/gcc/testsuite/gcc.target/i386/vectorize1.c
+++ b/gcc/testsuite/gcc.target/i386/vectorize1.c
@@ -1,6 +1,6 @@
 /* PR middle-end/28915 */
 /* { dg-do compile } */
-/* { dg-options "-msse -O2 -ftree-vectorize -fdump-tree-vect" } */
+/* { dg-options "-msse -O2 -ftree-vectorize -fdump-tree-vect-optimized" } */
 
 extern char lanip[3][40];
 typedef struct
@@ -17,4 +17,4 @@ int set_names (void)
       tt1.t[ln] = lanip[1];
 }
 
-/* { dg-final { scan-tree-dump "vect_cst" "vect" } } */
+/* { dg-final { scan-tree-dump "optimized: loop vectorized" "vect" } } */
diff --git a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f 
b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
index 2e548748296..9dda5087551 100644
--- a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
+++ b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
@@ -43,5 +43,5 @@ C
 ! vectorized loop.  If vector factor is 2, the vectorized loop can
 ! be predictive commoned, we check if predictive commoning PHI node
 ! is created with vector(2) type.
-! { dg-final { scan-tree-dump "Executing predictive commoning without 
unrolling" "pcom" { xfail vect_variable_length } } }
+! { dg-final { scan-tree-dump "Unrolling 2 times" "pcom" { xfail 
vect_variable_length } } }
 ! { dg-final { scan-tree-dump "vectp_u.*__lsm.* = PHI <.*vectp_u.*__lsm" 
"pcom" { xfail vect_variable_length } } }
diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 
b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
index f77ec9fb87a..283c36e0ebe 100644
--- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
+++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
@@ -708,5 +708,5 @@ END SUBROUTINE kernel
 
 ! { dg-final { scan-tree-dump-times "vectorized 2\[56\] loops" 1 "vect" { 
target aarch64_sve } } }
 ! { dg-final { scan-tree-dump-times "vectorized 2\[45\] loops" 1 "vect" { 
target { aarch64*-*-* && { ! aarch64_sve } } } } }
-! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { 
target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
+! { dg-final { scan-tree-dump-times "vectorized 2\[345\] loops" 1 "vect" { 
target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
 ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { 
{ ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } }
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 5c8b1beda38..f40a530c183 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -4335,6 +4335,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 opt_result
 vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 {
+  loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
   unsigned int i;
   stmt_vec_info first_element;
   slp_instance instance;
@@ -4351,6 +4352,28 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
max_tree_size)
     vect_analyze_slp_instance (vinfo, bst_map, first_element,
                               slp_inst_kind_store, max_tree_size, &limit);
 
+  /* For loops also start SLP discovery from non-grouped stores.  */
+  if (loop_vinfo)
+    {
+      data_reference_p dr;
+      FOR_EACH_VEC_ELT (vinfo->shared->datarefs, i, dr)
+       if (DR_IS_WRITE (dr))
+         {
+           stmt_vec_info stmt_info = vinfo->lookup_dr (dr)->stmt;
+           /* Grouped stores are already handled above.  */
+           if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+             continue;
+           vec<stmt_vec_info> stmts;
+           vec<stmt_vec_info> roots = vNULL;
+           vec<tree> remain = vNULL;
+           stmts.create (1);
+           stmts.quick_push (stmt_info);
+           vect_build_slp_instance (vinfo, slp_inst_kind_store,
+                                    stmts, roots, remain, max_tree_size,
+                                    &limit, bst_map, NULL);
+         }
+    }
+
   if (bb_vec_info bb_vinfo = dyn_cast <bb_vec_info> (vinfo))
     {
       for (unsigned i = 0; i < bb_vinfo->roots.length (); ++i)
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 153348000b2..c743a77f946 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8334,10 +8334,12 @@ vectorizable_store (vec_info *vinfo,
       return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, 
ncopies);
     }
 
-  if (grouped_store)
+  if (grouped_store || slp)
     {
       /* FORNOW */
-      gcc_assert (!loop || !nested_in_vect_loop_p (loop, stmt_info));
+      gcc_assert (!grouped_store
+                 || !loop
+                 || !nested_in_vect_loop_p (loop, stmt_info));
 
       if (slp)
         {
@@ -8346,8 +8348,9 @@ vectorizable_store (vec_info *vinfo,
              group.  */
           vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
          first_stmt_info = SLP_TREE_SCALAR_STMTS (slp_node)[0];
-         gcc_assert (DR_GROUP_FIRST_ELEMENT (first_stmt_info)
-                     == first_stmt_info);
+         gcc_assert (!STMT_VINFO_GROUPED_ACCESS (first_stmt_info)
+                     || (DR_GROUP_FIRST_ELEMENT (first_stmt_info)
+                         == first_stmt_info));
          first_dr_info = STMT_VINFO_DR_INFO (first_stmt_info);
          op = vect_get_store_rhs (first_stmt_info);
         } 
-- 
2.43.0

Re: [RFC] Support single lane SLP early break

Reply via email to