Re: [PATCH v2 3/3] vect: Use strided loads for VMAT_STRIDED_SLP.

Richard Biener Fri, 06 Jun 2025 05:37:14 -0700

On Fri, Jun 6, 2025 at 1:26 PM Robin Dapp <rdapp....@gmail.com> wrote:
>
> > In case the riscv strided vector load instruction has additional 
> > requirements
> > on the loaded (scalar) element alignment then we'd have to implement this.
> > For the moment the vectorizer will really emit scalar loads here, so that's
> > fine (though eventually inefficient).  For the strided vector load there 
> > should
> > be an alignment argument specifying element alignment, like we have for
> > .MASK_LOAD, etc.
>
> Our strided loads are similar to other vector insns in that they need to be
> element aligned for certain microarchitectures.  I guess then we indeed need 
> to
> adjust the IFN.


Yes.  Note I don't see we guarantee element alignment for gather/scatter
either, nor do the IFNs seem to have encoding space for alignment.  The
effective type for TBAA seems also missing there ...

> Regarding vector_vector_composition_type I had a try and attached a 
> preliminary
> V3.  I'm not really happy with it (and I suppose you won't be either) because
> it's now essentially two closely related functions in one with different
> argument requirements (I needed four additional ones).

Indeed :/

I'm not sure whether handling this case as part of VMAT_STRIDED_SLP is
wise.  IIRC we do already choose VMAT_GATHER_SCATTER for some
strided loads, so why not do strided load/store handling as part of
gather/scatter handling?

Sorry to send you from A to B here ...

I think the spotted correctness issues wrt alignment/aliasing should be
addressed up-front.  In the end the gather/stride-load is probably an
UNSPEC, so there's no MEM RTX with wrong info?  How would we
query the target on whether it can handle the alignment here?  Usually
we go through vect_supportable_dr_alignment which asks
targetm.vectorize.support_vector_misalignment which in turn gets
packed_p as true in case the scalar load involved isn't aligned according
to its size.  But I'm not sure we'll end up there for gather/scatter or
strided loads.

Richard.

> --
> Regards
>  Robin
>
>
> This patch enables strided loads for VMAT_STRIDED_SLP.  Instead of
> building vectors from scalars or other vectors we can use strided loads
> directly when applicable.
>
> The current implementation limits strided loads to cases where we can
> load entire groups and not subsets of them.  A future improvement would
> be to e.g. load a group of three uint8_t
>
>   g0 g1  g2,     g0 + stride g1 + stride g2 + stride, ...
>
> by
>
>   vlse16 vlse8
>
> and permute those into place (after re-interpreting as vector of
> uint8_t).
>
> For satd_8x4 in particular we can do even better by eliding the strided
> SLP load permutations, essentially turning
>
>   vlse64 v0, (a0)
>   vlse64 v1, (a1)
>   VEC_PERM_EXPR <v0, v1, { 0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25,
>   26, 27 }>;
>   VEC_PERM_EXPR <v0, v1, { 4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29,
>   30, 31 }>;
>
> into
>
>   vlse32 v0, (a0)
>   vlse32 v1, (a1)
>   vlse32 v0, 4(a0)
>   vlse32 v1, 4(a1)
>
> but that is going to be a follow up.
>
> Bootstrapped and regtested on x86, aarch64, and power10.
> Regtested on rv64gcv_zvl512b.  I'm seeing one additional failure in
> gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-7.c
> where we use a larger LMUL than we should but IMHO this can wait.
>
>         PR target/118109
>
> gcc/ChangeLog:
>
>         * internal-fn.cc (internal_strided_fn_supported_p): New
>         function.
>         * internal-fn.h (internal_strided_fn_supported_p): Declare.
>         * tree-vect-stmts.cc (vector_vector_composition_type): Add
>         handling for strided accesses.
>         (get_group_load_store_type): Adjust return type of
>         vector_vector_composition_type.
>         (vectorizable_load): Add strided-load support for strided
>         groups.
>         * tree-vectorizer.h (enum vect_composition_kind): New enum.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/autovec/pr118019-2.c: New test.
> ---
>  gcc/internal-fn.cc                            |  21 ++
>  gcc/internal-fn.h                             |   2 +
>  .../gcc.target/riscv/rvv/autovec/pr118019-2.c |  51 ++++
>  gcc/tree-vect-stmts.cc                        | 243 ++++++++++++++----
>  gcc/tree-vectorizer.h                         |   9 +
>  5 files changed, 283 insertions(+), 43 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr118019-2.c
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 6b04443f7cd..203ba5ab6f4 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -5203,6 +5203,27 @@ internal_gather_scatter_fn_supported_p (internal_fn 
> ifn, tree vector_type,
>    return ok;
>  }
>
> +/* Return true if the target supports a strided load/store function IFN
> +   with VECTOR_TYPE.  If supported and ELSVALS is nonzero the supported else
> +   values will be added to the vector ELSVALS points to.  */
> +
> +bool
> +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> +                                vec<int> *elsvals)
> +{
> +  machine_mode mode = TYPE_MODE (vector_type);
> +  optab optab = direct_internal_fn_optab (ifn);
> +  insn_code icode = direct_optab_handler (optab, mode);
> +
> +  bool ok = icode != CODE_FOR_nothing;
> +
> +  if (ok && elsvals)
> +    get_supported_else_vals
> +      (icode, internal_fn_else_index (ifn), *elsvals);
> +
> +  return ok;
> +}
> +
>  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
>     for pointers of type TYPE when the accesses have LENGTH bytes and their
>     common byte alignment is ALIGN.  */
> diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> index afd4f8e64c7..7d386246a42 100644
> --- a/gcc/internal-fn.h
> +++ b/gcc/internal-fn.h
> @@ -242,6 +242,8 @@ extern int internal_fn_stored_value_index (internal_fn);
>  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
>                                                     tree, tree, int,
>                                                     vec<int> * = nullptr);
> +extern bool internal_strided_fn_supported_p (internal_fn ifn, tree 
> vector_type,
> +                                            vec<int> *elsvals);
>  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
>                                                 poly_uint64, unsigned int);
>
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr118019-2.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr118019-2.c
> new file mode 100644
> index 00000000000..9918d4d7f52
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr118019-2.c
> @@ -0,0 +1,51 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=rv64gcv_zvl512b -mabi=lp64d 
> -mno-vector-strict-align" } */
> +
> +/* Ensure we use strided loads.  */
> +
> +typedef unsigned char uint8_t;
> +typedef unsigned short uint16_t;
> +typedef unsigned int uint32_t;
> +
> +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3)                            
>   \
> +  {                                                                          
>   \
> +    int t0 = s0 + s1;                                                        
>   \
> +    int t1 = s0 - s1;                                                        
>   \
> +    int t2 = s2 + s3;                                                        
>   \
> +    int t3 = s2 - s3;                                                        
>   \
> +    d0 = t0 + t2;                                                            
>   \
> +    d2 = t0 - t2;                                                            
>   \
> +    d1 = t1 + t3;                                                            
>   \
> +    d3 = t1 - t3;                                                            
>   \
> +  }
> +
> +uint32_t
> +abs2 (uint32_t a)
> +{
> +  uint32_t s = ((a >> 15) & 0x10001) * 0xffff;
> +  return (a + s) ^ s;
> +}
> +
> +int
> +x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2)
> +{
> +  uint32_t tmp[4][4];
> +  uint32_t a0, a1, a2, a3;
> +  int sum = 0;
> +  for (int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
> +    {
> +      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> +      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> +      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> +      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> +      HADAMARD4 (tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3);
> +    }
> +  for (int i = 0; i < 4; i++)
> +    {
> +      HADAMARD4 (a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]);
> +      sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3);
> +    }
> +  return (((uint16_t) sum) + ((uint32_t) sum >> 16)) >> 1;
> +}
> +
> +/* { dg-final { scan-assembler-times "vlse64" 8 } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 3710694ac75..801bad929c8 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1977,43 +1977,90 @@ vect_get_store_rhs (stmt_vec_info stmt_info)
>     pieces-size scalar mode for construction further.  It returns NULL_TREE if
>     fails to find the available composition.
>
> +   The other arguments mainly pertain to strided loads:
> +   GROUP_SIZE is the size of the group to load.  For strided loads it needs 
> to
> +   match nelts, but can be larger for vector-vector or vector-element
> +   composition.
> +   VLS_TYPE specifies the type of load/store.  If KIND is nonzero it will be
> +   set to the first composition type that was selected.  Strided loads are
> +   tried first, then vector-vector initialization, then vector-element
> +   initialization.  If strided loads are selected and ELSVALS is nonzero,
> +   it will be populated with the supported else values.
> +
>     For example, for (vtype=V16QI, nelts=4), we can probably get:
>       - V16QI with PTYPE V4QI.
>       - V4SI with PTYPE SI.
>       - NULL_TREE.  */
>
>  static tree
> -vector_vector_composition_type (tree vtype, poly_uint64 nelts, tree *ptype)
> +vector_vector_composition_type (tree vtype, poly_uint64 nelts, tree *ptype,
> +                               unsigned int group_size = 0,
> +                               vec_load_store_type vls_type = VLS_LOAD,
> +                               vect_composition_kind *kind = nullptr,
> +                               vec<int> *elsvals = nullptr)
>  {
>    gcc_assert (VECTOR_TYPE_P (vtype));
>    gcc_assert (known_gt (nelts, 0U));
>
>    machine_mode vmode = TYPE_MODE (vtype);
>    if (!VECTOR_MODE_P (vmode))
> -    return NULL_TREE;
> +    {
> +      if (kind)
> +       *kind = vect_composition_none;
> +      return NULL_TREE;
> +    }
>
>    /* When we are asked to compose the vector from its components let
>       that happen directly.  */
>    if (known_eq (TYPE_VECTOR_SUBPARTS (vtype), nelts))
>      {
>        *ptype = TREE_TYPE (vtype);
> +      if (kind)
> +       *kind = vect_composition_vec;
>        return vtype;
>      }
>
>    poly_uint64 vbsize = GET_MODE_BITSIZE (vmode);
>    unsigned int pbsize;
> +
>    if (constant_multiple_p (vbsize, nelts, &pbsize))
>      {
> -      /* First check if vec_init optab supports construction from
> -        vector pieces directly.  */
> +      /* First, try strided loads.  For now we restrict ourselves to loads
> +        whose element size exactly matches the group size.  */
>        scalar_mode elmode = SCALAR_TYPE_MODE (TREE_TYPE (vtype));
> +      if (known_eq (nelts, group_size)
> +         && int_mode_for_size (pbsize, 0).exists (&elmode))
> +       {
> +         *ptype = build_nonstandard_integer_type (pbsize, 1);
> +         tree strided_vtype
> +           = get_related_vectype_for_scalar_type (TYPE_MODE (vtype),
> +                                                  *ptype, nelts);
> +
> +         internal_fn ifn = vls_type == VLS_LOAD
> +           ? IFN_MASK_LEN_STRIDED_LOAD
> +           : IFN_MASK_LEN_STRIDED_STORE;
> +
> +         if (strided_vtype
> +             && internal_strided_fn_supported_p (ifn, strided_vtype, 
> elsvals))
> +           {
> +             if (kind)
> +               *kind = vect_composition_strided;
> +             return strided_vtype;
> +           }
> +       }
> +
> +      /* Then check if vec_init optab supports construction from
> +        vector pieces directly.  */
>        poly_uint64 inelts = pbsize / GET_MODE_BITSIZE (elmode);
>        machine_mode rmode;
> +
>        if (related_vector_mode (vmode, elmode, inelts).exists (&rmode)
>           && (convert_optab_handler (vec_init_optab, vmode, rmode)
>               != CODE_FOR_nothing))
>         {
>           *ptype = build_vector_type (TREE_TYPE (vtype), inelts);
> +         if (kind)
> +           *kind = vect_composition_vec;
>           return vtype;
>         }
>
> @@ -2025,10 +2072,14 @@ vector_vector_composition_type (tree vtype, 
> poly_uint64 nelts, tree *ptype)
>               != CODE_FOR_nothing))
>         {
>           *ptype = build_nonstandard_integer_type (pbsize, 1);
> +         if (kind)
> +           *kind = vect_composition_elt;
>           return build_vector_type (*ptype, nelts);
>         }
>      }
>
> +  if (kind)
> +    *kind = vect_composition_none;
>    return NULL_TREE;
>  }
>
> @@ -10669,6 +10720,8 @@ vectorizable_load (vec_info *vinfo,
>        tree running_off;
>        vec<constructor_elt, va_gc> *v = NULL;
>        tree stride_base, stride_step, alias_off;
> +      bool strided_load_ok_p = false;
> +      tree stride_step_signed = NULL_TREE;
>        /* Checked by get_load_store_type.  */
>        unsigned int const_nunits = nunits.to_constant ();
>        unsigned HOST_WIDE_INT cst_offset = 0;
> @@ -10744,13 +10797,31 @@ vectorizable_load (vec_info *vinfo,
>           stride_step = cse_and_gimplify_to_preheader (loop_vinfo, 
> stride_step);
>         }
>
> +      tree stride_step_full = NULL_TREE;
> +      auto_vec<tree> dr_chain;
> +
> +      /* For SLP permutation support we need to load the whole group,
> +        not only the number of vector stmts the permutation result
> +        fits in.  */
> +      if (slp_perm)
> +       {
> +         /* We don't yet generate SLP_TREE_LOAD_PERMUTATIONs for
> +            variable VF.  */
> +         unsigned int const_vf = vf.to_constant ();
> +         ncopies = CEIL (group_size * const_vf, const_nunits);
> +         dr_chain.create (ncopies);
> +       }
> +      else
> +       ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> +
> +      /* Initialize for VMAT_ELEMENTWISE i.e. when we must load each element
> +        separately.  */
>        running_off = offvar;
>        alias_off = build_int_cst (ref_type, 0);
>        int nloads = const_nunits;
>        int lnel = 1;
>        tree ltype = TREE_TYPE (vectype);
>        tree lvectype = vectype;
> -      auto_vec<tree> dr_chain;
>        if (memory_access_type == VMAT_STRIDED_SLP)
>         {
>           HOST_WIDE_INT n = gcd (group_size, const_nunits);
> @@ -10786,10 +10857,46 @@ vectorizable_load (vec_info *vinfo,
>           else if (n > 1)
>             {
>               tree ptype;
> -             tree vtype
> -               = vector_vector_composition_type (vectype, const_nunits / n,
> -                                                 &ptype);
> -             if (vtype != NULL_TREE)
> +             enum vect_composition_kind kind;
> +             tree vtype = vector_vector_composition_type
> +               (vectype, const_nunits / n, &ptype,
> +                group_size, VLS_LOAD, &kind, &elsvals);
> +
> +             /* Instead of loading individual vector elements and
> +                constructing a larger vector from them we can use
> +                a strided load directly.
> +                ??? For non-power-of-two groups we could build the
> +                group from smaller element sizes and permute them
> +                into place afterwards instead of relying on a more
> +                rigid vec_init.
> +                ??? For group sizes of 3, 5, 7 we could use masked
> +                strided loads.  */
> +             if (vtype != NULL_TREE
> +                 && kind == vect_composition_strided)
> +               {
> +                 dr_alignment_support dr_align = dr_aligned;
> +                 int mis_align = 0;
> +                 mis_align = dr_misalignment (first_dr_info,
> +                                              vtype);
> +                 dr_align
> +                   = vect_supportable_dr_alignment (vinfo, dr_info,
> +                                                    vtype,
> +                                                    mis_align);
> +                 if (dr_align == dr_aligned
> +                     || dr_align == dr_unaligned_supported)
> +                   {
> +                     nloads = 1;
> +                     lnel = n;
> +                     lvectype = vtype;
> +                     ltype = TREE_TYPE (vtype);
> +                     alignment_support_scheme = dr_align;
> +                     misalignment = mis_align;
> +
> +                     strided_load_ok_p = true;
> +                   }
> +               }
> +
> +             else if (vtype != NULL_TREE)
>                 {
>                   dr_alignment_support dr_align;
>                   int mis_align = 0;
> @@ -10829,19 +10936,34 @@ vectorizable_load (vec_info *vinfo,
>           ltype = build_aligned_type (ltype, align * BITS_PER_UNIT);
>         }
>
> -      /* For SLP permutation support we need to load the whole group,
> -        not only the number of vector stmts the permutation result
> -        fits in.  */
> -      if (slp_perm)
> +      /* We don't use masking here so just use any else value and don't
> +        perform any zeroing.  */
> +      tree vec_els = NULL_TREE;
> +      if (strided_load_ok_p && !costing_p)
>         {
> -         /* We don't yet generate SLP_TREE_LOAD_PERMUTATIONs for
> -            variable VF.  */
> -         unsigned int const_vf = vf.to_constant ();
> -         ncopies = CEIL (group_size * const_vf, const_nunits);
> -         dr_chain.create (ncopies);
> +         gcc_assert (elsvals.length ());
> +         maskload_elsval = *elsvals.begin ();
> +         vec_els = vect_get_mask_load_else (maskload_elsval, lvectype);
> +
> +         stride_step_full
> +           = fold_build2 (MULT_EXPR, TREE_TYPE (stride_step),
> +                          stride_step,
> +                          build_int_cst (TREE_TYPE (stride_step),
> +                                         TYPE_VECTOR_SUBPARTS (lvectype)));
> +         stride_step_full
> +           = cse_and_gimplify_to_preheader (loop_vinfo, stride_step_full);
> +
> +         tree cst_off = build_int_cst (ref_type, cst_offset);
> +         dataref_ptr
> +           = vect_create_data_ref_ptr (vinfo, first_stmt_info, lvectype,
> +                                       loop, cst_off, &dummy, gsi, &ptr_incr,
> +                                       true);
> +         stride_step_signed
> +           = fold_build1 (NOP_EXPR, signed_type_for (TREE_TYPE 
> (stride_step)),
> +                          stride_step);
> +         stride_step_signed
> +           = cse_and_gimplify_to_preheader (loop_vinfo, stride_step_signed);
>         }
> -      else
> -       ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>
>        unsigned int group_el = 0;
>        unsigned HOST_WIDE_INT
> @@ -10859,25 +10981,52 @@ vectorizable_load (vec_info *vinfo,
>             {
>               if (costing_p)
>                 {
> +                 /* TODO: The vectype in stmt_info/slp_node is potentially
> +                    wrong as we could be using a much smaller vectype
> +                    as determined by vector_vector_composition_type.  */
> +                 if (strided_load_ok_p)
> +                   inside_cost += record_stmt_cost (cost_vec, 1,
> +                                                    vector_gather_load,
> +                                                    slp_node, 0,
> +                                                    vect_body);
>                   /* For VMAT_ELEMENTWISE, just cost it as scalar_load to
>                      avoid ICE, see PR110776.  */
> -                 if (VECTOR_TYPE_P (ltype)
> -                     && memory_access_type != VMAT_ELEMENTWISE)
> +                 else if (VECTOR_TYPE_P (ltype)
> +                          && memory_access_type != VMAT_ELEMENTWISE)
>                     n_adjacent_loads++;
>                   else
>                     inside_cost += record_stmt_cost (cost_vec, 1, scalar_load,
>                                                      slp_node, 0, vect_body);
>                   continue;
>                 }
> -             tree this_off = build_int_cst (TREE_TYPE (alias_off),
> -                                            group_el * elsz + cst_offset);
> -             tree data_ref = build2 (MEM_REF, ltype, running_off, this_off);
> -             vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr));
> -             new_temp = make_ssa_name (ltype);
> -             new_stmt = gimple_build_assign (new_temp, data_ref);
> -             vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +
> +             if (!strided_load_ok_p)
> +               {
> +                 tree this_off = build_int_cst (TREE_TYPE (alias_off),
> +                                                group_el * elsz + 
> cst_offset);
> +                 tree data_ref = build2 (MEM_REF, ltype, running_off,
> +                                         this_off);
> +                 vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr));
> +                 new_temp = make_ssa_name (ltype);
> +                 new_stmt = gimple_build_assign (new_temp, data_ref);
> +               }
> +             else
> +               {
> +                 mask_vectype = truth_type_for (lvectype);
> +                 tree final_mask = build_minus_one_cst (mask_vectype);
> +                 tree bias = build_int_cst (intQI_type_node, 0);
> +                 tree len = size_int (TYPE_VECTOR_SUBPARTS (lvectype));
> +                 tree zero = build_zero_cst (lvectype);
> +                 new_stmt
> +                   = gimple_build_call_internal
> +                   (IFN_MASK_LEN_STRIDED_LOAD, 7, dataref_ptr,
> +                    stride_step_signed, zero, final_mask, vec_els, len, 
> bias);
> +                 new_temp = make_ssa_name (lvectype);
> +                 gimple_set_lhs (new_stmt, new_temp);
> +               }
>               if (nloads > 1)
>                 CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, new_temp);
> +             vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
>
>               group_el += lnel;
>               if (group_el == group_size)
> @@ -10888,12 +11037,21 @@ vectorizable_load (vec_info *vinfo,
>                      so just use the last element again.  See PR107451.  */
>                   if (known_lt (n_groups, vf))
>                     {
> -                     tree newoff = copy_ssa_name (running_off);
> -                     gimple *incr
> -                       = gimple_build_assign (newoff, POINTER_PLUS_EXPR,
> -                                              running_off, stride_step);
> -                     vect_finish_stmt_generation (vinfo, stmt_info, incr, 
> gsi);
> -                     running_off = newoff;
> +                     if (!strided_load_ok_p)
> +                       {
> +                         tree newoff = copy_ssa_name (running_off);
> +                         gimple *incr
> +                           = gimple_build_assign (newoff, POINTER_PLUS_EXPR,
> +                                                  running_off, stride_step);
> +                         vect_finish_stmt_generation (vinfo, stmt_info, 
> incr, gsi);
> +                         running_off = newoff;
> +                       }
> +                     else if (strided_load_ok_p && !costing_p)
> +                       {
> +                         dataref_ptr
> +                           = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, 
> gsi,
> +                                              stmt_info, stride_step_full);
> +                       }
>                     }
>                   group_el = 0;
>                 }
> @@ -10902,7 +11060,8 @@ vectorizable_load (vec_info *vinfo,
>           if (nloads > 1)
>             {
>               if (costing_p)
> -               inside_cost += record_stmt_cost (cost_vec, 1, vec_construct,
> +               inside_cost += record_stmt_cost (cost_vec, 1,
> +                                                vec_construct,
>                                                  slp_node, 0, vect_body);
>               else
>                 {
> @@ -10935,7 +11094,7 @@ vectorizable_load (vec_info *vinfo,
>           if (!costing_p)
>             {
>               if (slp_perm)
> -               dr_chain.quick_push (gimple_assign_lhs (new_stmt));
> +               dr_chain.quick_push (gimple_get_lhs (new_stmt));
>               else
>                 slp_node->push_vec_def (new_stmt);
>             }
> @@ -12105,9 +12264,8 @@ vectorizable_load (vec_info *vinfo,
>                               {
>                                 tree ptype;
>                                 new_vtype
> -                                 = vector_vector_composition_type (vectype,
> -                                                                   num,
> -                                                                   &ptype);
> +                                 = vector_vector_composition_type
> +                                     (vectype, num, &ptype, VLS_LOAD);
>                                 if (new_vtype)
>                                   ltype = ptype;
>                               }
> @@ -12133,9 +12291,8 @@ vectorizable_load (vec_info *vinfo,
>                                   {
>                                     tree ptype;
>                                     new_vtype
> -                                     = vector_vector_composition_type 
> (vectype,
> -                                                                       num,
> -                                                                       
> &ptype);
> +                                     = vector_vector_composition_type
> +                                        (vectype, num, &ptype, VLS_LOAD);
>                                     if (new_vtype)
>                                       ltype = ptype;
>                                   }
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 7aa2b02b63c..32c7e52a46e 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -80,6 +80,15 @@ enum vect_induction_op_type {
>     vect_step_op_shr
>  };
>
> +/* Define the type of vector composition when building a vector
> +   from smaller elements/vectors.  */
> +enum vect_composition_kind {
> +    vect_composition_none = 0,
> +    vect_composition_vec,
> +    vect_composition_elt,
> +    vect_composition_strided
> +};
> +
>  /* Define type of reduction.  */
>  enum vect_reduction_type {
>    TREE_CODE_REDUCTION,
> --
> 2.49.0
>

Re: [PATCH v2 3/3] vect: Use strided loads for VMAT_STRIDED_SLP.

Reply via email to