RE: [RFC] Support single lane SLP early break

Tamar Christina Thu, 22 Aug 2024 14:42:34 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Wednesday, August 21, 2024 12:12 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [RFC] Support single lane SLP early break
> 
> On Tue, 20 Aug 2024, Tamar Christina wrote:
> 
> > Hi,
> >
> > I've been working on a prototype of moving early break to SLP.
> >
> > As we've discussed on IRC I've decided to first try adding the gconds as 
> > roots
> > and start SLP discovery using them as roots.
> >
> > This works great and doesn't require any changed to build_slp, it also has 
> > the
> > additional benefit in that we can easily (as a follow up) add groups of
> > gconds and then try to SLP the roots together if the operations are the same
> > and then decompose the tree based on the roots if not.
> >
> > So it looks like using the roots are the best approach. However I've hit 
> > some
> > issues that I could solve, but would require me to modify large chunks of 
> > code
> > and would like your input before I start.
> >
> > 1. roots are currently not analyzed or code-gened through vectorizable_*.
> >    this is because it looks like only things used as roots so far are things
> >    that all targets support (like constructors) or that will be lowered by
> >    veclower later.  This is easy to fix  I can work roots into the analysis
> >    part in vect_slp_analyze_node_operations and pass enough information to
> >    vectorize_slp_instance_root_stmt to be able to use 
> > vectorizable_early_break.
> >    I have a prototype of this currently working but it's a hack and need to 
> > do
> >    it properly if it's the way you'd like to go.
> 
> There is currently no "explicit" separate analysis of the root but only
> vect_slp_analyze_operations doing
> 
>                                              &cost_vec)
>           /* CTOR instances require vectorized defs for the SLP tree root.
> */
>           || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor
>               && (SLP_TREE_DEF_TYPE (SLP_INSTANCE_TREE (instance))
>                   != vect_internal_def
>                   /* Make sure we vectorized with the expected type.  */
>                   || !useless_type_conversion_p
>                         (TREE_TYPE (TREE_TYPE (gimple_assign_rhs1
> 
> (instance->root_stmts[0]->stmt))),
>                          TREE_TYPE (SLP_TREE_VECTYPE
>                                             (SLP_INSTANCE_TREE
> (instance))))))
>           /* Check we can vectorize the reduction.  */
>           || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc
>               && !vectorizable_bb_reduc_epilogue (instance, &cost_vec)))
> 
> for the transform phase we do have vectorize_slp_instance_root_stmt
> (called by vect_schedule_slp).  Both do not really fit the
> vectorizable_* API since how the root looks like really depends on
> the SLP instance kind.
> 
> So it would be above where you'd hook in the required code, adding
> a slp_inst_kind_early_break or so.  Factoring the analysis part
> into a vectorizable_slp_instance_root () function would be an
> improvement of course.
> 
> > 2.  consider the loop:
> >
> > #ifndef N
> > #define N 800
> > #endif
> > unsigned vect_a[N];
> > unsigned vect_b[N];
> >
> > unsigned test4(unsigned x)
> > {
> >  unsigned ret = 0;
> >  for (int i = 0; i < N; i++)
> >  {
> >    vect_b[i] = x + i;
> >    if (vect_a[i]*2 != x)
> >      break;
> >    vect_a[i] = x;
> >
> >  }
> >  return ret;
> > }
> >
> > The build part looks like:
> >
> > note:   === vect_analyze_slp ===
> > note:   Analyzing vectorizable control flow: if (patt_6 != 0)
> > note:   Starting SLP discovery for
> > note:     patt_6 = _4 != x_9(D);
> > note:   starting SLP discovery for node 0x5141280
> > note:   Build SLP for patt_6 = _4 != x_9(D);
> > note:   precomputed vectype: vector(4) <signed-boolean:32>
> > note:   nunits = 4
> > note:   vect_is_simple_use: operand x_9(D), type of def: external
> > note:   vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
> > +INF]
> MASK 0xfffffffe VALUE 0x0
> > _3 * 2, type of def: internal
> > note:   starting SLP discovery for node 0x51413a0
> > note:   Build SLP for _4 = _3 * 2;
> > note:   precomputed vectype: vector(4) unsigned int
> > note:   nunits = 4
> > note:   vect_is_simple_use: operand # VUSE <.MEM_10>
> > vect_aD.4416[i_15], type of def: internal
> > note:   vect_is_simple_use: operand 2, type of def: constant
> > note:   vect_is_simple_use: operand # VUSE <.MEM_10>
> > vect_aD.4416[i_15], type of def: internal
> > note:   vect_is_simple_use: operand 2, type of def: constant
> > note:   starting SLP discovery for node 0x5141430
> > note:   Build SLP for _3 = vect_a[i_15];
> > note:   precomputed vectype: vector(4) unsigned int
> > note:   nunits = 4
> > note:   SLP discovery for node 0x5141430 succeeded
> > note:   SLP discovery for node 0x51413a0 succeeded
> > note:   SLP discovery for node 0x5141280 succeeded
> > note:   SLP size 3 vs. limit 10.
> > note:   Final SLP tree for instance 0x5208e30:
> > note:   node 0x5141280 (max_nunits=4, refcnt=2) vector(4) <signed-
> boolean:32>
> > note:   op template: patt_6 = _4 != x_9(D);
> > note:      stmt 0 patt_6 = _4 != x_9(D);
> > note:      children 0x5141310 0x51413a0
> > note:   node (external) 0x5141310 (max_nunits=1, refcnt=1)
> > note:      { x_9(D) }
> > note:   node 0x51413a0 (max_nunits=4, refcnt=2) vector(4) unsigned int
> > note:   op template: _4 = _3 * 2;
> > note:      stmt 0 _4 = _3 * 2;
> > note:      children 0x5141430 0x51414c0
> > note:   node 0x5141430 (max_nunits=4, refcnt=2) vector(4) unsigned int
> > note:   op template: _3 = vect_a[i_15];
> > note:      stmt 0 _3 = vect_a[i_15];
> > note:      load permutation { 0 }
> > note:   node (constant) 0x51414c0 (max_nunits=1, refcnt=1)
> > note:      { 2 }
> >
> > and codegen:
> >
> > note:  ------>vectorizing statement: patt_6 = _4 != x_9(D);
> > note:  transform statement.
> > note:  vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
> > +INF]
> MASK 0xfffffffe VALUE 0x0
> >        _3 * 2, type of def: internal
> > note:  vect_is_simple_use: vectype vector(4) unsigned int
> > note:  vect_is_simple_use: operand x_9(D), type of def: external
> > note:  vect_get_vec_defs_for_operand: _4
> > note:  vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
> > +INF]
> MASK 0xfffffffe VALUE 0x0
> >        _3 * 2, type of def: internal
> > note:    def_stmt =  _4 = _3 * 2;
> > note:  vect_get_vec_defs_for_operand: x_9(D)
> > note:  vect_is_simple_use: operand x_9(D), type of def: external
> > note:  created new init_stmt: vect_cst__72 = {x_9(D), x_9(D), x_9(D), 
> > x_9(D)};
> > note:  add new stmt: mask_patt_6.25_73 = vect__4.24_71 != vect_cst__72;
> > note:  ------>vectorizing statement: if (patt_6 != 0)
> > note:  transform statement.
> > note:   === vectorizable_early_exit ===
> > note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
> > note:   vect_is_simple_use: vectype vector(4) <signed-boolean:32>
> > note:   transform early-exit.
> > note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
> > note:   vect_is_simple_use: vectype vector(4) <signed-boolean:32>
> > note:   vect_is_simple_use: operand 0, type of def: constant
> > note:   vect_get_vec_defs_for_operand: patt_6
> > note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
> > note:     def_stmt =  patt_6 = _4 != x_9(D);
> > note:   vect_get_vec_defs_for_operand: 0
> > note:   vect_is_simple_use: operand 0, type of def: constant
> > note:   created new init_stmt: vect_cst__74 = { 0, 0, 0, 0 };
> > note:   add new stmt: cmp_75 = mask_patt_6.25_73 ^ vect_cst__74;
> >
> > So far so good.
> >
> > However, things go wrong during SLP vect_detect_hybrid_slp analysis
> >
> > note:   === vect_update_vf_for_slp ===
> > note:   Loop contains SLP and non-SLP stmts
> > note:   Updating vectorization factor to 4.
> > note:  vectorization_factor = 4, niters = 800
> >
> > This has a couple of reasons:
> >
> > 1. The stores are non-grouped stores and so are never considered for SLP.
> 
> Yeah, that's an unmerged part of the all-SLP migration (I _think_ I have
> posted a patch to do this).
> 
> > Now I've temporarily worked around this by doing during vect_analyze_slp:
> >
> > /* Find SLP sequences starting from non-grouped stores.  */
> > for (auto dr : LOOP_VINFO_DATAREFS (vinfo))
> >     if (DR_IS_WRITE (dr))
> >       {
> >         stmt_vec_info dr_info = vinfo->lookup_stmt (DR_STMT (dr));
> >         if (!dr_info)
> >           continue;
> >
> >         vect_analyze_slp_instance (vinfo, bst_map, dr_info,
> >                                    slp_inst_kind_store, max_tree_size,
> >                                    &limit);
> >       }
> >
> > So it follows single lane stores.  But I'm not sure I understand why this is
> > needed.  I thought that your earlier work to transition to SLP only would 
> > have
> > already covered single stream stores.
> 
> Nope, only single-stream inverleaved stores (single element interleaving).
> 
> I've refrained from adding the "rest" yet (but it will look similar as to
> what you do above).
> 
> > The above works, but I am unsure if that's the best solution, or if I'm 
> > missing
> > something.
> 
> Just bad timing ;)  I keep being distracted from working on the
> remaining bits for all-SLP.
> 
> > 2. The second part that goes wrong is that due to the same IV being used by
> >     the early exit and the main exit, the main exit is now pulled into 
> > analysis:
> >
> > note:   === vect_detect_hybrid_slp ===
> > note:   Processing hybrid candidate : ivtmp_14 = ivtmp_7 - 1;
> > note:   Found loop_vect use: if (ivtmp_14 != 0)
> > note:   Processing hybrid candidate : i_12 = i_15 + 1;
> > note:   Marked SLP consumed stmt pure: i_12 = i_15 + 1;
> > note:   Processing hybrid candidate : ivtmp_7 = PHI <ivtmp_14(6), 800(2)>
> > note:   Found loop_vect use: ivtmp_14 = ivtmp_7 - 1;
> > note:   Processing hybrid candidate : if (patt_6 != 0)
> > note:   Found loop_vect sink: if (patt_6 != 0)
> > note:   marking hybrid: patt_6 = _4 != x_9(D);
> > note:   marking hybrid: _4 = _3 * 2;
> > note:   marking hybrid: _3 = vect_a[i_15];
> > note:   marking hybrid: i_15 = PHI <i_12(6), 0(2)>
> > note:   marking hybrid: i_12 = i_15 + 1;
> >
> > Is the solution here that I treat LOOP_VINFO_IV_EXIT as a sink as well, and
> > forcibly ignore it?
> >
> > I think this would match what the analysis code later does:
> >
> > note:   ==> examining statement: if (ivtmp_14 != 0)
> > note:   irrelevant.
> >
> > This is the part I'm having the most trouble with.  Today I believe we never
> > analyse the main loop exit because nothing pulls it into the analysis.
> 
> Probably ivcanon ensures the IV is in it's own isolated use-def cycle,
> otherwise I don't see how we'd run into this for example when we have
> a vectorizable induction based on the same IV and stored into a
> SLP memory group?
> 
> From reading both above eventually hybrid detection should ignore
> !STMT_VINFO_RELEVANT loop_vect uses ... (luckily hybrid detection
> will go away when we're only-SLP).


Ack!

> 
> > 3. I believe I also need to analyse roots during VF, i.e.
> >    vect_determine_vectorization_factor shows:
> >
> > note:   ==> examining statement: if (_4 != x_9(D))
> > note:   skip.
> > note:   ==> examining pattern def stmt: patt_17 = _4 != x_9(D);
> > note:   precomputed vectype: vector(2) <signed-boolean:32>
> > note:   nunits = 2
> >
> > which does not seem right.
> 
> Why's that not right?
> 

I initially thought so since the patt_17 != 0 reduction could result in multiple
statement but I mistakenly thought this means it could affect the unroll factor.

But it can't since the number of statements is actually determined by the _4 != 
x_9
compare.

So it was just a misunderstanding.

> For reference below is what I have in my dev tree for the non-grouped
> store SLP.
>

Thanks! I've placed it in my tree and have made some nice progress since then,
Next is looking at moving the stores in SLP scheduling and hopefully getting 
epilogue
vectorization supported.

Cheers,
Tamar
 
> Thanks,
> Richard.
> 
> From 6fea9f34bd218437fc2d08da38f3883cac59947e Mon Sep 17 00:00:00
> 2001
> From: Richard Biener <rguent...@suse.de>
> Date: Fri, 29 Sep 2023 12:54:17 +0200
> Subject: [PATCH] Handle non-grouped stores as single-lane SLP
> To: gcc-patches@gcc.gnu.org
> 
> The following enables single-lane loop SLP discovery for non-grouped stores
> and adjusts vectorizable_store to properly handle those.
> 
> For gfortran.dg/vect/vect-8.f90 we vectorize one additional loop,
> not running into the "not falling back to strided accesses" bail-out.
> I have not investigated in detail.  Similar for gcc.dg/vect/slp-19c.c.
> 
> The gcc.dg/vect/O3-pr39675-2.c and gcc.dg/vect/slp-19[abc].c SLPs
> depend on the load permute lowering as the single-lane store we
> now want to handle is fed from a single lane from groups of size four.
> I've updated the expected number of SLPs but they FAIL.
> 
> For gfortran.dg/vect/fast-math-mgrid-resid.f predictive commoning
> now unrolls the loop, the vectorization factor is the same.  I think
> association during SLP build might be the reason for the difference.
> 
> There is a set of i386 target assembler test FAILs,
> gcc.target/i386/pr88531-2[bc].c in particular fail because the
> target cannot identify SLP emulated gathers, see another mail from me.
> Others need adjustment, I've adjusted one with this patch only.
> 
>       * tree-vect-slp.cc (vect_analyze_slp): Perform single-lane
>       loop SLP discovery for non-grouped stores.
>       * tree-vect-stmts.cc (vectorizable_store): Always set
>       vec_num for SLP.
> 
>       * gcc.dg/vect/O3-pr39675-2.c: Adjust expected number of SLP.
>       * gcc.dg/vect/fast-math-vect-call-1.c: Likewise.
>       * gcc.dg/vect/no-scevccp-slp-31.c: Likewise.
>       * gcc.dg/vect/slp-12b.c: Likewise.
>       * gcc.dg/vect/slp-12c.c: Likewise.
>       * gcc.dg/vect/slp-19a.c: Likewise.
>       * gcc.dg/vect/slp-19b.c: Likewise.
>       * gcc.dg/vect/slp-19c.c: Likewise.
>       * gcc.dg/vect/slp-4-big-array.c: Likewise.
>       * gcc.dg/vect/slp-4.c: Likewise.
>       * gcc.dg/vect/slp-5.c: Likewise.
>       * gcc.dg/vect/slp-7.c: Likewise.
>       * gcc.dg/vect/slp-perm-7.c: Likewise.
>       * gcc.dg/vect/slp-37.c: Likewise.
>       * gcc.dg/vect/vect-outer-slp-3.c: Disable vectorization of
>       initialization loop.
>       * gcc.dg/vect/slp-reduc-5.c: Likewise.
>       * gcc.dg/vect/no-scevccp-outer-12.c: Un-XFAIL.  SLP can handle
>       inner loop inductions with multiple vector stmt copies.
>       * gfortran.dg/vect/vect-8.f90: Adjust expected number of
>       vectorized loops.
>       * gfortran.dg/vect/fast-math-mgrid-resid.f: Expect predictive
>       commoning with unrolling.
>       * gcc.target/i386/vectorize1.c: Adjust what we scan for.
> ---
>  gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c      |  2 +-
>  .../gcc.dg/vect/fast-math-vect-call-1.c       |  2 +-
>  .../gcc.dg/vect/no-scevccp-outer-12.c         |  3 +--
>  gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c |  5 ++--
>  gcc/testsuite/gcc.dg/vect/slp-12b.c           |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-12c.c           |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-19a.c           |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-19b.c           |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-19c.c           |  4 ++--
>  gcc/testsuite/gcc.dg/vect/slp-37.c            |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-4-big-array.c   |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-4.c             |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-5.c             |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-7.c             |  4 ++--
>  gcc/testsuite/gcc.dg/vect/slp-perm-7.c        |  4 ++--
>  gcc/testsuite/gcc.dg/vect/slp-reduc-5.c       |  3 ++-
>  gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c  |  1 +
>  gcc/testsuite/gcc.target/i386/vectorize1.c    |  4 ++--
>  .../gfortran.dg/vect/fast-math-mgrid-resid.f  |  2 +-
>  gcc/testsuite/gfortran.dg/vect/vect-8.f90     |  2 +-
>  gcc/tree-vect-slp.cc                          | 23 +++++++++++++++++++
>  gcc/tree-vect-stmts.cc                        | 11 +++++----
>  22 files changed, 57 insertions(+), 29 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c
> b/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c
> index c3f0f6dc1be..ddaac56cc0b 100644
> --- a/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c
> +++ b/gcc/testsuite/gcc.dg/vect/O3-pr39675-2.c
> @@ -27,5 +27,5 @@ foo ()
>  }
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target
> vect_strided4 } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target
> vect_strided4 } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {
> target vect_strided4 } } } */
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c
> b/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c
> index ad22f6e82b3..6c9b7c37b6e 100644
> --- a/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c
> +++ b/gcc/testsuite/gcc.dg/vect/fast-math-vect-call-1.c
> @@ -101,4 +101,4 @@ main ()
>  }
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 4 "vect" { target {
> vect_call_copysignf && vect_call_sqrtf } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" 
> { target
> { { vect_call_copysignf && vect_call_sqrtf } && vect_perm3_int } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" {
> target { { vect_call_copysignf && vect_call_sqrtf } && vect_perm3_int } } } } 
> */
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c
> index c2d3031bc0c..6ace6ad022e 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-12.c
> @@ -46,5 +46,4 @@ int main (void)
>    return 0;
>  }
> 
> -/* Until we support multiple types in the inner loop  */
> -/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
> xfail {
> ! { aarch64*-*-* riscv*-*-* } } } } } */
> +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
> b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
> index 22817a57ef8..f6ac5f60298 100644
> --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
> +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
> @@ -53,6 +53,7 @@ int main (void)
>    return 0;
>  }
> 
> +/* We cannot handle grouped accesses in outer loops.  */
> +/* { dg-final { scan-tree-dump-not "OUTER LOOP VECTORIZED" "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  
> } } */
> -
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect"  
> } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-12b.c 
> b/gcc/testsuite/gcc.dg/vect/slp-
> 12b.c
> index e2ea24d6c53..8e06e3bfa93 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-12b.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-12b.c
> @@ -47,6 +47,6 @@ int main (void)
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target 
> {
> vect_strided2 && vect_int_mult } } } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"  { target 
> { ! {
> vect_strided2 && vect_int_mult } } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect"  
> {
> target { vect_strided2 && vect_int_mult } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect"  
> {
> target { vect_strided2 && vect_int_mult } } } } */
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  
> { target
> { ! { vect_strided2 && vect_int_mult } } } } } */
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-12c.c 
> b/gcc/testsuite/gcc.dg/vect/slp-
> 12c.c
> index 9c48dff3bf4..a3536e3053b 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-12c.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-12c.c
> @@ -49,5 +49,5 @@ int main (void)
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target 
> {
> vect_int_mult } } } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"  { target 
> { !
> vect_int_mult } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target
> vect_int_mult } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {
> target vect_int_mult } } } */
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" 
> { target
> { ! vect_int_mult } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-19a.c 
> b/gcc/testsuite/gcc.dg/vect/slp-
> 19a.c
> index ca7a0a8e456..6c21416046d 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-19a.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-19a.c
> @@ -57,5 +57,5 @@ int main (void)
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target
> vect_strided8 } } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target 
> { !
> vect_strided8 } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target
> vect_strided8 } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {
> target vect_strided8 } } } */
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" 
> { target
> { ! vect_strided8} } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-19b.c 
> b/gcc/testsuite/gcc.dg/vect/slp-
> 19b.c
> index 4d53ac698db..10b84aab3b5 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-19b.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-19b.c
> @@ -54,5 +54,5 @@ int main (void)
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target
> vect_strided4 } } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target 
> { !
> vect_strided4 } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target
> vect_strided4 } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {
> target vect_strided4 } } } */
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" 
> { target
> { ! vect_strided4 } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-19c.c 
> b/gcc/testsuite/gcc.dg/vect/slp-
> 19c.c
> index 188ab37a0b6..84869cadc89 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-19c.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-19c.c
> @@ -105,5 +105,5 @@ int main (void)
>    return 0;
>  }
> 
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-37.c 
> b/gcc/testsuite/gcc.dg/vect/slp-37.c
> index caee2bb508f..8a430e63847 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-37.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-37.c
> @@ -60,4 +60,4 @@ int main (void)
>  }
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target
> vect_hw_misalign } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target
> vect_hw_misalign } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {
> target vect_hw_misalign } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c
> b/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c
> index fcda45ff368..f738a613324 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-4-big-array.c
> @@ -131,5 +131,5 @@ int main (void)
>  }
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect"  
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect"  
> } } */
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-4.c 
> b/gcc/testsuite/gcc.dg/vect/slp-4.c
> index 29e741df02b..1ecad7415ef 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-4.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-4.c
> @@ -125,5 +125,5 @@ int main (void)
>  }
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect"  
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect"  
> } } */
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-5.c 
> b/gcc/testsuite/gcc.dg/vect/slp-5.c
> index 6d51f6a7323..484898c2afd 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-5.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-5.c
> @@ -124,5 +124,5 @@ int main (void)
>  }
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect"  
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 5 "vect"  
> } } */
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-7.c 
> b/gcc/testsuite/gcc.dg/vect/slp-7.c
> index 2845a99dedf..f83fdc96d16 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-7.c
> @@ -125,6 +125,6 @@ int main (void)
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  { target
> vect_short_mult } } }*/
>  /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect"  { target 
> { ! {
> vect_short_mult } } } } }*/
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect"  
> {
> target vect_short_mult } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect"  
> {
> target { ! { vect_short_mult } } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 5 "vect"  
> {
> target vect_short_mult } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect"  
> {
> target { ! { vect_short_mult } } } } } */
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
> b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
> index df13c37bc75..c3d903e5b11 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
> @@ -97,8 +97,8 @@ int main (int argc, const char* argv[])
>  }
> 
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target
> vect_perm } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target
> { vect_perm3_int && { ! vect_load_lanes } } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" 
> { target
> vect_load_lanes } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {
> target { vect_perm3_int && { ! vect_load_lanes } } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" {
> target vect_load_lanes } } } */
>  /* { dg-final { scan-tree-dump "Built SLP cancelled: can use 
> load/store-lanes"
> "vect" { target { vect_perm3_int && vect_load_lanes } } } } */
>  /* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes 
> } } }
> */
>  /* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes 
> } } }
> */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
> b/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
> index 11f5a7414cf..0cde79d9e49 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
> @@ -36,6 +36,7 @@ int main (void)
> 
>    check_vect ();
> 
> +#pragma GCC novector
>    for (i = 0; i < N; i++)
>      c[i] = (i+3) * -1;
> 
> @@ -44,6 +45,6 @@ int main (void)
>    return 0;
>  }
> 
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { xfail
> vect_no_int_min_max } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail
> vect_no_int_min_max } } } */
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { xfail
> vect_no_int_min_max } } } */
>  /* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 0 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c
> b/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c
> index 3dce51426b5..d315db5632b 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-outer-slp-3.c
> @@ -30,6 +30,7 @@ int main ()
>  {
>    check_vect ();
> 
> +#pragma GCC novector
>    for (int i = 0; i < 40; ++i)
>      image[i] = 1.;
> 
> diff --git a/gcc/testsuite/gcc.target/i386/vectorize1.c
> b/gcc/testsuite/gcc.target/i386/vectorize1.c
> index f3b9bfba382..14a8c5f28b3 100644
> --- a/gcc/testsuite/gcc.target/i386/vectorize1.c
> +++ b/gcc/testsuite/gcc.target/i386/vectorize1.c
> @@ -1,6 +1,6 @@
>  /* PR middle-end/28915 */
>  /* { dg-do compile } */
> -/* { dg-options "-msse -O2 -ftree-vectorize -fdump-tree-vect" } */
> +/* { dg-options "-msse -O2 -ftree-vectorize -fdump-tree-vect-optimized" } */
> 
>  extern char lanip[3][40];
>  typedef struct
> @@ -17,4 +17,4 @@ int set_names (void)
>        tt1.t[ln] = lanip[1];
>  }
> 
> -/* { dg-final { scan-tree-dump "vect_cst" "vect" } } */
> +/* { dg-final { scan-tree-dump "optimized: loop vectorized" "vect" } } */
> diff --git a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
> b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
> index 2e548748296..9dda5087551 100644
> --- a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
> +++ b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
> @@ -43,5 +43,5 @@ C
>  ! vectorized loop.  If vector factor is 2, the vectorized loop can
>  ! be predictive commoned, we check if predictive commoning PHI node
>  ! is created with vector(2) type.
> -! { dg-final { scan-tree-dump "Executing predictive commoning without 
> unrolling"
> "pcom" { xfail vect_variable_length } } }
> +! { dg-final { scan-tree-dump "Unrolling 2 times" "pcom" { xfail
> vect_variable_length } } }
>  ! { dg-final { scan-tree-dump "vectp_u.*__lsm.* = PHI <.*vectp_u.*__lsm" 
> "pcom"
> { xfail vect_variable_length } } }
> diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> index f77ec9fb87a..283c36e0ebe 100644
> --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> @@ -708,5 +708,5 @@ END SUBROUTINE kernel
> 
>  ! { dg-final { scan-tree-dump-times "vectorized 2\[56\] loops" 1 "vect" { 
> target
> aarch64_sve } } }
>  ! { dg-final { scan-tree-dump-times "vectorized 2\[45\] loops" 1 "vect" { 
> target {
> aarch64*-*-* && { ! aarch64_sve } } } } }
> -! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { 
> target {
> vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
> +! { dg-final { scan-tree-dump-times "vectorized 2\[345\] loops" 1 "vect" { 
> target {
> vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
>  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target 
> { { !
> vect_intdouble_cvt } && { ! aarch64*-*-* } } } } }
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 5c8b1beda38..f40a530c183 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -4335,6 +4335,7 @@ vect_lower_load_permutations (loop_vec_info
> loop_vinfo,
>  opt_result
>  vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
>  {
> +  loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
>    unsigned int i;
>    stmt_vec_info first_element;
>    slp_instance instance;
> @@ -4351,6 +4352,28 @@ vect_analyze_slp (vec_info *vinfo, unsigned
> max_tree_size)
>      vect_analyze_slp_instance (vinfo, bst_map, first_element,
>                              slp_inst_kind_store, max_tree_size, &limit);
> 
> +  /* For loops also start SLP discovery from non-grouped stores.  */
> +  if (loop_vinfo)
> +    {
> +      data_reference_p dr;
> +      FOR_EACH_VEC_ELT (vinfo->shared->datarefs, i, dr)
> +     if (DR_IS_WRITE (dr))
> +       {
> +         stmt_vec_info stmt_info = vinfo->lookup_dr (dr)->stmt;
> +         /* Grouped stores are already handled above.  */
> +         if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
> +           continue;
> +         vec<stmt_vec_info> stmts;
> +         vec<stmt_vec_info> roots = vNULL;
> +         vec<tree> remain = vNULL;
> +         stmts.create (1);
> +         stmts.quick_push (stmt_info);
> +         vect_build_slp_instance (vinfo, slp_inst_kind_store,
> +                                  stmts, roots, remain, max_tree_size,
> +                                  &limit, bst_map, NULL);
> +       }
> +    }
> +
>    if (bb_vec_info bb_vinfo = dyn_cast <bb_vec_info> (vinfo))
>      {
>        for (unsigned i = 0; i < bb_vinfo->roots.length (); ++i)
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 153348000b2..c743a77f946 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8334,10 +8334,12 @@ vectorizable_store (vec_info *vinfo,
>        return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, 
> ncopies);
>      }
> 
> -  if (grouped_store)
> +  if (grouped_store || slp)
>      {
>        /* FORNOW */
> -      gcc_assert (!loop || !nested_in_vect_loop_p (loop, stmt_info));
> +      gcc_assert (!grouped_store
> +               || !loop
> +               || !nested_in_vect_loop_p (loop, stmt_info));
> 
>        if (slp)
>          {
> @@ -8346,8 +8348,9 @@ vectorizable_store (vec_info *vinfo,
>               group.  */
>            vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>         first_stmt_info = SLP_TREE_SCALAR_STMTS (slp_node)[0];
> -       gcc_assert (DR_GROUP_FIRST_ELEMENT (first_stmt_info)
> -                   == first_stmt_info);
> +       gcc_assert (!STMT_VINFO_GROUPED_ACCESS (first_stmt_info)
> +                   || (DR_GROUP_FIRST_ELEMENT (first_stmt_info)
> +                       == first_stmt_info));
>         first_dr_info = STMT_VINFO_DR_INFO (first_stmt_info);
>         op = vect_get_store_rhs (first_stmt_info);
>          }
> --
> 2.43.0

RE: [RFC] Support single lane SLP early break

Reply via email to