Richard Biener <rguent...@suse.de> writes:
> The following is a prototype for how to represent load/store-lanes
> within SLP.  I've for now settled with having a single load node
> with multiple permute nodes acting as selection, one for each loaded lane
> and a single store node fed from all stored lanes.  For
>
>   for (int i = 0; i < 1024; ++i)
>     {
>       a[2*i] = b[2*i] + 7;
>       a[2*i+1] = b[2*i+1] * 3;
>     }
>
> you have the following SLP graph where I explain how things are set
> up and code-generated:
>
> t.c:23:21: note:   SLP graph after lowering permutations:
> t.c:23:21: note:   node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int
> t.c:23:21: note:   op template: *_6 = _7;
> t.c:23:21: note:        stmt 0 *_6 = _7;
> t.c:23:21: note:        stmt 1 *_12 = _13;
> t.c:23:21: note:        children 0x50dc488 0x50dc6e8
>
> This is the store node, it's marked with ldst_lanes = true during
> SLP discovery.  This node code-generates
>
>   vect_array.65[0] = vect__7.61_29;
>   vect_array.65[1] = vect__13.62_28;
>   MEM <int[8]> [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65);
>
> ...
> t.c:23:21: note:   node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int
> t.c:23:21: note:   op: VEC_PERM_EXPR
> t.c:23:21: note:        stmt 0 _5 = *_4;
> t.c:23:21: note:        lane permutation { 0[0] }
> t.c:23:21: note:        children 0x50dc948
> t.c:23:21: note:   node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int
> t.c:23:21: note:   op: VEC_PERM_EXPR
> t.c:23:21: note:        stmt 0 _11 = *_10;
> t.c:23:21: note:        lane permutation { 0[1] }
> t.c:23:21: note:        children 0x50dc948
>
> These are the selection nodes, marked with ldst_lanes = true.
> They code generate nothing.
>
> t.c:23:21: note:   node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int
> t.c:23:21: note:   op template: _5 = *_4;
> t.c:23:21: note:        stmt 0 _5 = *_4;
> t.c:23:21: note:        stmt 1 _11 = *_10;
> t.c:23:21: note:        load permutation { 0 1 }
>
> This is the load node, marked with ldst_lanes = true (the load
> permutation is only accurate when taking into account the lane permute
> in the selection nodes).  It code generates
>
>   vect_array.58 = .LOAD_LANES (MEM <int[8]> [(int *)vectp_b.56_33]);
>   vect__5.59_31 = vect_array.58[0];
>   vect__5.60_30 = vect_array.58[1];
>
> This scheme allows to leave code generation in vectorizable_load/store
> mostly as-is.
>
> While this should support both load-lanes and (masked) store-lanes
> the decision to do either is done during SLP discovery time and
> cannot be reversed without altering the SLP tree - as-is the SLP
> tree is not usable for non-store-lanes on the store side, the
> load side is OK representation-wise but will very likely fail
> permute handling as the lowering to deal with the two input vector
> restriction isn't done - but of course since the permute node is
> marked as to be ignored that doesn't work out.  So I've put
> restrictions in place that fail vectorization if a load/store-lane
> SLP tree is later classified differently by get_load_store_type.
>
> With this I've disabled the code scrapping SLP as it will no longer
> fire.  I'll note that for example gcc.target/aarch64/sve/mask_struct_store_3.c
> will not get SLP store-lanes used because the full store SLPs just
> fine though we then fail to handle the "splat" load-permutation
>
> t2.c:5:21: note:   node 0x4db2630 (max_nunits=4, refcnt=2) vector([4,4]) int
> t2.c:5:21: note:   op template: _6 = *_5;
> t2.c:5:21: note:        stmt 0 _6 = *_5;
> t2.c:5:21: note:        stmt 1 _6 = *_5;
> t2.c:5:21: note:        stmt 2 _6 = *_5;
> t2.c:5:21: note:        stmt 3 _6 = *_5;
> t2.c:5:21: note:        load permutation { 0 0 0 0 }
>
> the load permute lowering code currently doesn't consider it worth
> lowering single loads from a group (or in this case not grouped loads).
> The expectation is the target can handle this by two interleaves with
> itself.
>
> So what we see here is that while the explicit SLP representation is
> helpful in some cases, in cases like this it would require changing
> it when we make decisions how to vectorize.  My idea is that this
> all will change a lot when we re-do SLP discovery (for loops) and
> when we get rid of non-SLP as I think vectorizable_* should be
> allowed to alter the SLP graph during analysis.
>
> I'm not sure what's the best way forward - if we can decide to
> live with (temporary) regressions in this area?  There is the possibility
> to do the "non-SLP" mode by forcing single-lane discovery everywhere(?)
> as a temporary measure.  Unfortunately this will alter the VF and thus
> cannot be done on-the-fly per SLP instance I think (much like we cannot
> currently cancel only one SLP instance without a full re-analysis).

Living with temporary regressions sounds good to me.  For something
as complicated as this transition, it seems better to concentrate on how
the final form should look rather than artificially constrain things to
be incremental improvements.  We can put any missing wheels back on
during stage 3.

Not sure I understood everything in the patch, but it LGTM FWIW.
(Pre-existing trivia: s/catched/caught/.)

Richard

>       * tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark
>       load, store and permute nodes.
>       * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize ldst_lanes.
>       (vect_build_slp_instance): For stores iff the target prefers
>       store-lanes discover single-lane sub-groups, do not perform
>       interleaving lowering but mark the node with ldst_lanes.
>       (vect_lower_load_permutations): When the target supports
>       load lanes and the loads all fit the pattern split out
>       a single level of permutes only and mark the load and
>       permute nodes with ldst_lanes.
>       (vectorizable_slp_permutation_1): Handle the load-lane permute
>       forwarding of vector defs.
>       * tree-vect-stmts.cc (get_group_load_store_type): Support
>       load/store-lanes for SLP.
>       (vectorizable_store): Support SLP code generation for store-lanes.
>       (vectorizable_load): Support SLP code generation for load-lanes.
>       * tree-vect-loop.cc (vect_analyze_loop_2): Do not cancel SLP
>       when store-lanes can be used.
>
>       * gcc.dg/vect/slp-55.c: New testcase.
>       * gcc.dg/vect/slp-56.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/vect/slp-55.c |  37 +++++++++
>  gcc/testsuite/gcc.dg/vect/slp-56.c |  51 ++++++++++++
>  gcc/tree-vect-loop.cc              |  76 ------------------
>  gcc/tree-vect-slp.cc               | 122 +++++++++++++++++++++++++++--
>  gcc/tree-vect-stmts.cc             | 119 ++++++++++++++++++++++------
>  gcc/tree-vectorizer.h              |   3 +
>  6 files changed, 299 insertions(+), 109 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-55.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-56.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-55.c 
> b/gcc/testsuite/gcc.dg/vect/slp-55.c
> new file mode 100644
> index 00000000000..0bf65ef6dc4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-55.c
> @@ -0,0 +1,37 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target vect_int_mult } */
> +/* { dg-additional-options "-fdump-tree-optimized" } */
> +
> +void foo (int * __restrict a, int *b, int *c)
> +{
> +  for (int i = 0; i < 1024; ++i)
> +    {
> +      a[2*i] = b[i] + 7;
> +      a[2*i+1] = c[i] * 3;
> +    }
> +}
> +
> +int bar (int *b)
> +{
> +  int res = 0;
> +  for (int i = 0; i < 1024; ++i)
> +    {
> +      res += b[2*i] + 7;
> +      res += b[2*i+1] * 3;
> +    }
> +  return res;
> +}
> +
> +void baz (int * __restrict a, int *b)
> +{
> +  for (int i = 0; i < 1024; ++i)
> +    {
> +      a[2*i] = b[2*i] + 7;
> +      a[2*i+1] = b[2*i+1] * 3;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "LOAD_LANES" 2 "optimized" { target 
> vect_load_lanes } } } */
> +/* { dg-final { scan-tree-dump-times "STORE_LANES" 2 "optimized" { target 
> vect_load_lanes } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-56.c 
> b/gcc/testsuite/gcc.dg/vect/slp-56.c
> new file mode 100644
> index 00000000000..0b985eae55e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-56.c
> @@ -0,0 +1,51 @@
> +#include "tree-vect.h"
> +
> +/* This is a load-lane / masked-store-lane test that more reliably
> +   triggers SLP than SVEs mask_srtuct_store_*.c  */
> +
> +void __attribute__ ((noipa))
> +test4 (int *__restrict dest, int *__restrict src,
> +       int *__restrict cond, int bias, int n)
> +{
> +  for (int i = 0; i < n; ++i)
> +    {
> +      int value0 = src[i * 4] + bias;
> +      int value1 = src[i * 4 + 1] * bias;
> +      int value2 = src[i * 4 + 2] + bias;
> +      int value3 = src[i * 4 + 3] * bias;
> +      if (cond[i])
> +        {
> +          dest[i * 4] = value0;
> +          dest[i * 4 + 1] = value1;
> +          dest[i * 4 + 2] = value2;
> +          dest[i * 4 + 3] = value3;
> +        }
> +    }
> +}
> +
> +int dest[16*4];
> +int src[16*4];
> +int cond[16];
> +const int dest_chk[16*4] = {0, 0, 0, 0, 9, 25, 11, 35, 0, 0, 0, 0, 17, 65, 
> 19,
> +    75, 0, 0, 0, 0, 25, 105, 27, 115, 0, 0, 0, 0, 33, 145, 35, 155, 0, 0, 0,
> +    0, 41, 185, 43, 195, 0, 0, 0, 0, 49, 225, 51, 235, 0, 0, 0, 0, 57, 265, 
> 59,
> +    275, 0, 0, 0, 0, 65, 305, 67, 315};
> +
> +int main()
> +{
> +  check_vect ();
> +#pragma GCC novector
> +  for (int i = 0; i < 16; ++i)
> +    cond[i] = i & 1;
> +#pragma GCC novector
> +  for (int i = 0; i < 16 * 4; ++i)
> +    src[i] = i;
> +  test4 (dest, src, cond, 5, 16);
> +#pragma GCC novector
> +  for (int i = 0; i < 16 * 4; ++i)
> +    if (dest[i] != dest_chk[i])
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target { 
> vect_variable_length && vect_load_lanes } } } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index a64b5082bd1..0d48c4980ce 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2957,82 +2957,6 @@ start_over:
>                                      "unsupported SLP instances\n");
>         goto again;
>       }
> -
> -      /* Check whether any load in ALL SLP instances is possibly permuted.  
> */
> -      slp_tree load_node, slp_root;
> -      unsigned i, x;
> -      slp_instance instance;
> -      bool can_use_lanes = true;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (loop_vinfo), x, instance)
> -     {
> -       slp_root = SLP_INSTANCE_TREE (instance);
> -       int group_size = SLP_TREE_LANES (slp_root);
> -       tree vectype = SLP_TREE_VECTYPE (slp_root);
> -       bool loads_permuted = false;
> -       FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), i, load_node)
> -         {
> -           if (!SLP_TREE_LOAD_PERMUTATION (load_node).exists ())
> -             continue;
> -           unsigned j;
> -           stmt_vec_info load_info;
> -           FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (load_node), j, load_info)
> -             if (SLP_TREE_LOAD_PERMUTATION (load_node)[j] != j)
> -               {
> -                 loads_permuted = true;
> -                 break;
> -               }
> -         }
> -
> -       /* If the loads and stores can be handled with load/store-lane
> -          instructions record it and move on to the next instance.  */
> -       if (loads_permuted
> -           && SLP_INSTANCE_KIND (instance) == slp_inst_kind_store
> -           && vect_store_lanes_supported (vectype, group_size, false)
> -                != IFN_LAST)
> -         {
> -           FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), i, load_node)
> -             if (STMT_VINFO_GROUPED_ACCESS
> -                   (SLP_TREE_REPRESENTATIVE (load_node)))
> -               {
> -                 stmt_vec_info stmt_vinfo = DR_GROUP_FIRST_ELEMENT
> -                     (SLP_TREE_REPRESENTATIVE (load_node));
> -                 /* Use SLP for strided accesses (or if we can't
> -                    load-lanes).  */
> -                 if (STMT_VINFO_STRIDED_P (stmt_vinfo)
> -                     || vect_load_lanes_supported
> -                          (STMT_VINFO_VECTYPE (stmt_vinfo),
> -                           DR_GROUP_SIZE (stmt_vinfo), false) == IFN_LAST)
> -                   break;
> -               }
> -
> -           can_use_lanes
> -             = can_use_lanes && i == SLP_INSTANCE_LOADS (instance).length ();
> -
> -           if (can_use_lanes && dump_enabled_p ())
> -             dump_printf_loc (MSG_NOTE, vect_location,
> -                              "SLP instance %p can use load/store-lanes\n",
> -                              (void *) instance);
> -         }
> -       else
> -         {
> -           can_use_lanes = false;
> -           break;
> -         }
> -     }
> -
> -      /* If all SLP instances can use load/store-lanes abort SLP and try 
> again
> -      with SLP disabled.  */
> -      if (can_use_lanes)
> -     {
> -       ok = opt_result::failure_at (vect_location,
> -                                    "Built SLP cancelled: can use "
> -                                    "load/store-lanes\n");
> -       if (dump_enabled_p ())
> -         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                          "Built SLP cancelled: all SLP instances support "
> -                          "load/store-lanes\n");
> -       goto again;
> -     }
>      }
>  
>    /* Dissolve SLP-only groups.  */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 2dc6d365303..79fb12f134b 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -120,6 +120,7 @@ _slp_tree::_slp_tree ()
>    SLP_TREE_SIMD_CLONE_INFO (this) = vNULL;
>    SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def;
>    SLP_TREE_CODE (this) = ERROR_MARK;
> +  this->ldst_lanes = false;
>    SLP_TREE_VECTYPE (this) = NULL_TREE;
>    SLP_TREE_REPRESENTATIVE (this) = NULL;
>    SLP_TREE_REF_COUNT (this) = 1;
> @@ -3600,10 +3601,27 @@ vect_build_slp_instance (vec_info *vinfo,
>        /* For loop vectorization split the RHS into arbitrary pieces of
>        size >= 1.  */
>        else if (is_a <loop_vec_info> (vinfo)
> -            && (i > 0 && i < group_size)
> -            && !vect_slp_prefer_store_lanes_p (vinfo,
> -                                               stmt_info, group_size, i))
> -     {
> +            && (i > 0 && i < group_size))
> +     {
> +       /* There are targets that cannot do even/odd interleaving schemes
> +          so they absolutely need to use load/store-lanes.  For now
> +          force single-lane SLP for them - they would be happy with
> +          uniform power-of-two lanes (but depending on element size),
> +          but even if we can use 'i' as indicator we would need to
> +          backtrack when later lanes fail to discover with the same
> +          granularity.  We cannot turn any of strided or scatter store
> +          into store-lanes.  */
> +       /* ???  If this is not in sync with what get_load_store_type
> +          later decides the SLP representation is not good for other
> +          store vectorization methods.  */
> +       bool want_store_lanes
> +         = (! STMT_VINFO_GATHER_SCATTER_P (stmt_info)
> +            && ! STMT_VINFO_STRIDED_P (stmt_info)
> +            && vect_slp_prefer_store_lanes_p (vinfo, stmt_info,
> +                                              group_size, 1));
> +       if (want_store_lanes)
> +         i = 1;
> +
>         if (dump_enabled_p ())
>           dump_printf_loc (MSG_NOTE, vect_location,
>                            "Splitting SLP group at stmt %u\n", i);
> @@ -3637,7 +3655,10 @@ vect_build_slp_instance (vec_info *vinfo,
>                                              (max_nunits, end - start));
>                 rhs_nodes.safe_push (node);
>                 start = end;
> -               end = group_size;
> +               if (want_store_lanes)
> +                 end = start + 1;
> +               else
> +                 end = group_size;
>               }
>             else
>               {
> @@ -3676,6 +3697,24 @@ vect_build_slp_instance (vec_info *vinfo,
>                                          SLP_TREE_CHILDREN
>                                            (rhs_nodes[0]).length ());
>         SLP_TREE_VECTYPE (node) = SLP_TREE_VECTYPE (rhs_nodes[0]);
> +       if (want_store_lanes)
> +         {
> +           /* For store-lanes feed the store node with all RHS nodes
> +              in order.  */
> +           node->ldst_lanes = true;
> +           SLP_TREE_CHILDREN (node)
> +             .reserve_exact (SLP_TREE_CHILDREN (rhs_nodes[0]).length ()
> +                             + rhs_nodes.length () - 1);
> +           /* First store value and possibly mask.  */
> +           SLP_TREE_CHILDREN (node)
> +             .splice (SLP_TREE_CHILDREN (rhs_nodes[0]));
> +           /* Rest of the store values.  All mask nodes are the same,
> +              this should be guaranteed by dataref group discovery.  */
> +           for (unsigned j = 1; j < rhs_nodes.length (); ++j)
> +             SLP_TREE_CHILDREN (node)
> +               .quick_push (SLP_TREE_CHILDREN (rhs_nodes[j])[0]);
> +         }
> +       else
>         for (unsigned l = 0;
>              l < SLP_TREE_CHILDREN (rhs_nodes[0]).length (); ++l)
>           {
> @@ -4057,6 +4096,42 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
>    if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
>      return;
>  
> +  /* Verify if all load permutations can be implemented with a suitably
> +     large element load-lanes operation.  */
> +  unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]);
> +  if (exact_log2 (ld_lanes_lanes) == -1
> +      /* ???  For now only support the single-lane case as there is
> +      missing support on the store-lane side and code generation
> +      isn't up to the task yet.  */
> +      || ld_lanes_lanes != 1
> +      || vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]),
> +                                 group_lanes / ld_lanes_lanes,
> +                                 false) == IFN_LAST)
> +    ld_lanes_lanes = 0;
> +  else
> +    /* Verify the loads access the same number of lanes aligned to
> +       ld_lanes_lanes.  */
> +    for (slp_tree load : loads)
> +      {
> +     if (SLP_TREE_LANES (load) != ld_lanes_lanes)
> +       {
> +         ld_lanes_lanes = 0;
> +         break;
> +       }
> +     unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0];
> +     if (first % ld_lanes_lanes != 0)
> +       {
> +         ld_lanes_lanes = 0;
> +         break;
> +       }
> +     for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i)
> +       if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i)
> +         {
> +           ld_lanes_lanes = 0;
> +           break;
> +         }
> +      }
> +
>    for (slp_tree load : loads)
>      {
>        /* Leave masked or gather loads alone for now.  */
> @@ -4071,7 +4146,8 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
>        with a non-1:1 load permutation around instead of canonicalizing
>        those into a load and a permute node.  Removing this early
>        check would do such canonicalization.  */
> -      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
> +      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2
> +       && ld_lanes_lanes == 0)
>       continue;
>  
>        /* First build (and possibly re-use) a load node for the
> @@ -4104,10 +4180,20 @@ vect_lower_load_permutations (loop_vec_info 
> loop_vinfo,
>       final_perm.quick_push
>         (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
>  
> +      if (ld_lanes_lanes != 0)
> +     {
> +       /* ???  If this is not in sync with what get_load_store_type
> +          later decides the SLP representation is not good for other
> +          store vectorization methods.  */
> +       l0->ldst_lanes = true;
> +       load->ldst_lanes = true;
> +     }
> +
>        while (1)
>       {
>         unsigned group_lanes = SLP_TREE_LANES (l0);
> -       if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
> +       if (ld_lanes_lanes != 0
> +           || SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
>           break;
>  
>         /* Try to lower by reducing the group to half its size using an
> @@ -9758,6 +9844,28 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, 
> gimple_stmt_iterator *gsi,
>  
>    gcc_assert (perm.length () == SLP_TREE_LANES (node));
>  
> +  /* Load-lanes permute.  This permute only acts as a forwarder to
> +     select the correct vector def of the load-lanes load which
> +     has the permuted vectors in its vector defs like
> +     { v0, w0, r0, v1, w1, r1 ... } for a ld3.  */
> +  if (node->ldst_lanes)
> +    {
> +      gcc_assert (children.length () == 1);
> +      if (!gsi)
> +     /* This is a trivial op always supported.  */
> +     return 1;
> +      slp_tree child = children[0];
> +      unsigned vec_idx = (SLP_TREE_LANE_PERMUTATION (node)[0].second
> +                       / SLP_TREE_LANES (node));
> +      unsigned vec_num = SLP_TREE_LANES (child) / SLP_TREE_LANES (node);
> +      for (unsigned i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i)
> +     {
> +       tree def = SLP_TREE_VEC_DEFS (child)[i * vec_num  + vec_idx];
> +       node->push_vec_def (def);
> +     }
> +      return 1;
> +    }
> +
>    /* REPEATING_P is true if every output vector is guaranteed to use the
>       same permute vector.  We can handle that case for both variable-length
>       and constant-length vectors, but we only handle other cases for
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index fdcda0d2aba..1edcc1f819b 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1508,7 +1508,8 @@ check_load_store_for_partial_vectors (loop_vec_info 
> loop_vinfo, tree vectype,
>  
>    unsigned int nvectors;
>    if (slp_node)
> -    nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> +    /* ???  Incorrect for multi-lane lanes.  */
> +    nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size;
>    else
>      nvectors = vect_get_num_copies (loop_vinfo, vectype);
>  
> @@ -2069,6 +2070,14 @@ get_group_load_store_type (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>                is irrelevant for them.  */
>             *alignment_support_scheme = dr_unaligned_supported;
>           }
> +       /* Try using LOAD/STORE_LANES.  */
> +       else if (slp_node->ldst_lanes
> +                && (*lanes_ifn
> +                      = (vls_type == VLS_LOAD
> +                         ? vect_load_lanes_supported (vectype, group_size, 
> masked_p)
> +                         : vect_store_lanes_supported (vectype, group_size,
> +                                                       masked_p))) != 
> IFN_LAST)
> +         *memory_access_type = VMAT_LOAD_STORE_LANES;
>         else
>           *memory_access_type = VMAT_CONTIGUOUS;
>  
> @@ -8189,6 +8198,16 @@ vectorizable_store (vec_info *vinfo,
>                           &lanes_ifn))
>      return false;
>  
> +  if (slp_node
> +      && slp_node->ldst_lanes
> +      && memory_access_type != VMAT_LOAD_STORE_LANES)
> +    {
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "discovered store-lane but cannot use it.\n");
> +      return false;
> +    }
> +
>    if (mask)
>      {
>        if (memory_access_type == VMAT_CONTIGUOUS)
> @@ -8705,7 +8724,7 @@ vectorizable_store (vec_info *vinfo,
>    else
>      {
>        if (memory_access_type == VMAT_LOAD_STORE_LANES)
> -     aggr_type = build_array_type_nelts (elem_type, vec_num * nunits);
> +     aggr_type = build_array_type_nelts (elem_type, group_size * nunits);
>        else
>       aggr_type = vectype;
>        bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type,
> @@ -8762,11 +8781,24 @@ vectorizable_store (vec_info *vinfo,
>  
>    if (memory_access_type == VMAT_LOAD_STORE_LANES)
>      {
> -      gcc_assert (!slp && grouped_store);
> +      if (costing_p && slp_node)
> +     /* Update all incoming store operand nodes, the general handling
> +        above only handles the mask and the first store operand node.  */
> +     for (slp_tree child : SLP_TREE_CHILDREN (slp_node))
> +       if (child != mask_node
> +           && !vect_maybe_update_slp_op_vectype (child, vectype))
> +         {
> +           if (dump_enabled_p ())
> +             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                              "incompatible vector types for invariants\n");
> +           return false;
> +         }
>        unsigned inside_cost = 0, prologue_cost = 0;
>        /* For costing some adjacent vector stores, we'd like to cost with
>        the total number of them once instead of cost each one by one. */
>        unsigned int n_adjacent_stores = 0;
> +      if (slp)
> +     ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size;
>        for (j = 0; j < ncopies; j++)
>       {
>         gimple *new_stmt;
> @@ -8784,7 +8816,7 @@ vectorizable_store (vec_info *vinfo,
>                 op = vect_get_store_rhs (next_stmt_info);
>                 if (costing_p)
>                   update_prologue_cost (&prologue_cost, op);
> -               else
> +               else if (!slp)
>                   {
>                     vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
>                                                    ncopies, op,
> @@ -8799,15 +8831,15 @@ vectorizable_store (vec_info *vinfo,
>               {
>                 if (mask)
>                   {
> -                   vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
> -                                                  mask, &vec_masks,
> -                                                  mask_vectype);
> +                   if (slp_node)
> +                     vect_get_slp_defs (mask_node, &vec_masks);
> +                   else
> +                     vect_get_vec_defs_for_operand (vinfo, stmt_info, 
> ncopies,
> +                                                    mask, &vec_masks,
> +                                                    mask_vectype);
>                     vec_mask = vec_masks[0];
>                   }
>  
> -               /* We should have catched mismatched types earlier.  */
> -               gcc_assert (
> -                 useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
>                 dataref_ptr
>                   = vect_create_data_ref_ptr (vinfo, first_stmt_info,
>                                               aggr_type, NULL, offset, &dummy,
> @@ -8819,10 +8851,16 @@ vectorizable_store (vec_info *vinfo,
>             gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>             /* DR_CHAIN is then used as an input to
>                vect_permute_store_chain().  */
> -           for (i = 0; i < group_size; i++)
> +           if (!slp)
>               {
> -               vec_oprnd = (*gvec_oprnds[i])[j];
> -               dr_chain[i] = vec_oprnd;
> +               /* We should have catched mismatched types earlier.  */
> +               gcc_assert (
> +                 useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
> +               for (i = 0; i < group_size; i++)
> +                 {
> +                   vec_oprnd = (*gvec_oprnds[i])[j];
> +                   dr_chain[i] = vec_oprnd;
> +                 }
>               }
>             if (mask)
>               vec_mask = vec_masks[j];
> @@ -8832,12 +8870,12 @@ vectorizable_store (vec_info *vinfo,
>  
>         if (costing_p)
>           {
> -           n_adjacent_stores += vec_num;
> +           n_adjacent_stores += group_size;
>             continue;
>           }
>  
>         /* Get an array into which we can store the individual vectors.  */
> -       tree vec_array = create_vector_array (vectype, vec_num);
> +       tree vec_array = create_vector_array (vectype, group_size);
>  
>         /* Invalidate the current contents of VEC_ARRAY.  This should
>            become an RTL clobber too, which prevents the vector registers
> @@ -8845,9 +8883,19 @@ vectorizable_store (vec_info *vinfo,
>         vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
>  
>         /* Store the individual vectors into the array.  */
> -       for (i = 0; i < vec_num; i++)
> +       for (i = 0; i < group_size; i++)
>           {
> -           vec_oprnd = dr_chain[i];
> +           if (slp)
> +             {
> +               slp_tree child;
> +               if (i == 0 || !mask_node)
> +                 child = SLP_TREE_CHILDREN (slp_node)[i];
> +               else
> +                 child = SLP_TREE_CHILDREN (slp_node)[i + 1];
> +               vec_oprnd = SLP_TREE_VEC_DEFS (child)[j];
> +             }
> +           else
> +             vec_oprnd = dr_chain[i];
>             write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array,
>                                 i);
>           }
> @@ -8917,9 +8965,10 @@ vectorizable_store (vec_info *vinfo,
>  
>         /* Record that VEC_ARRAY is now dead.  */
>         vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
> -       if (j == 0)
> +       if (j == 0 && !slp)
>           *vec_stmt = new_stmt;
> -       STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
> +       if (!slp)
> +         STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>       }
>  
>        if (costing_p)
> @@ -10023,6 +10072,16 @@ vectorizable_load (vec_info *vinfo,
>                           &lanes_ifn))
>      return false;
>  
> +  if (slp_node
> +      && slp_node->ldst_lanes
> +      && memory_access_type != VMAT_LOAD_STORE_LANES)
> +    {
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "discovered load-lane but cannot use it.\n");
> +      return false;
> +    }
> +
>    if (mask)
>      {
>        if (memory_access_type == VMAT_CONTIGUOUS)
> @@ -10765,12 +10824,13 @@ vectorizable_load (vec_info *vinfo,
>      {
>        gcc_assert (alignment_support_scheme == dr_aligned
>                 || alignment_support_scheme == dr_unaligned_supported);
> -      gcc_assert (grouped_load && !slp);
>  
>        unsigned int inside_cost = 0, prologue_cost = 0;
>        /* For costing some adjacent vector loads, we'd like to cost with
>        the total number of them once instead of cost each one by one. */
>        unsigned int n_adjacent_loads = 0;
> +      if (slp_node)
> +     ncopies = slp_node->vec_stmts_size / vec_num;
>        for (j = 0; j < ncopies; j++)
>       {
>         if (costing_p)
> @@ -10884,24 +10944,31 @@ vectorizable_load (vec_info *vinfo,
>         gimple_call_set_nothrow (call, true);
>         vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
>  
> -       dr_chain.create (vec_num);
> +       if (!slp)
> +         dr_chain.create (vec_num);
>         /* Extract each vector into an SSA_NAME.  */
>         for (i = 0; i < vec_num; i++)
>           {
>             new_temp = read_vector_array (vinfo, stmt_info, gsi, scalar_dest,
>                                           vec_array, i);
> -           dr_chain.quick_push (new_temp);
> +           if (slp)
> +             slp_node->push_vec_def (new_temp);
> +           else
> +             dr_chain.quick_push (new_temp);
>           }
>  
> -       /* Record the mapping between SSA_NAMEs and statements.  */
> -       vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain);
> +       if (!slp)
> +         /* Record the mapping between SSA_NAMEs and statements.  */
> +         vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain);
>  
>         /* Record that VEC_ARRAY is now dead.  */
>         vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
>  
> -       dr_chain.release ();
> +       if (!slp)
> +         dr_chain.release ();
>  
> -       *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
> +       if (!slp_node)
> +         *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>       }
>  
>        if (costing_p)
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 8eb3ec4df86..ac288541c51 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -222,6 +222,9 @@ struct _slp_tree {
>    unsigned int lanes;
>    /* The operation of this node.  */
>    enum tree_code code;
> +  /* Whether uses of this load or feeders of this store are suitable
> +     for load/store-lanes.  */
> +  bool ldst_lanes;
>  
>    int vertex;

Reply via email to