store with length in vectorizer

Richard Sandiford Tue, 30 Jun 2020 12:54:26 -0700

"Kewen.Lin" <li...@linux.ibm.com> writes:
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 06a04e3d7dd..284c15705ea 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -13389,6 +13389,13 @@ by the copy loop headers pass.
>  @item vect-epilogues-nomask
>  Enable loop epilogue vectorization using smaller vector size.
>  
> +@item vect-with-length-scope


In principle there's nothing length-specific about this option.
We could do the same for masks or for any future loop control
mechanism.  So how about vect-partial-vector-usage instead?

> +Control the scope of vector memory access with length exploitation.  0 means 
> we
> +don't expliot any vector memory access with length, 1 means we only exploit
> +vector memory access with length for those loops whose iteration number are
> +less than VF, such as very small loop or epilogue, 2 means we want to exploit
> +vector memory access with length for any loops if possible.

Maybe:

  Controls when the loop vectorizer considers using partial vector loads
  and stores as an alternative to falling back to scalar code.  0 stops
  the vectorizer from ever using partial vector loads and stores.  1 allows
  partial vector loads and stores if vectorization removes the need for the
  code to iterate.  2 allows partial vector loads and stores in all loops.
  The parameter only has an effect on targets that support partial
  vector loads and stores.
  
> diff --git a/gcc/optabs-query.c b/gcc/optabs-query.c
> index 215d68e4225..9c351759204 100644
> --- a/gcc/optabs-query.c
> +++ b/gcc/optabs-query.c
> @@ -606,6 +606,60 @@ can_vec_mask_load_store_p (machine_mode mode,
>    return false;
>  }
>  
> +/* Return true if target supports vector load/store with length for vector
> +   mode MODE.  There are two flavors for vector load/store with length, one
> +   is to measure length with bytes, the other is to measure length with 
> lanes.
> +   As len_{load,store} optabs point out, for the flavor with bytes, we use
> +   VnQI to wrap the other supportable same size vector modes.  Here the
> +   pointer FACTOR is to indicate that it is using VnQI to wrap if its value
> +   more than 1 and how many bytes for one element of wrapped vector mode.  */
> +
> +bool
> +can_vec_len_load_store_p (machine_mode mode, bool is_load, unsigned int 
> *factor)
> +{
> +  optab op = is_load ? len_load_optab : len_store_optab;
> +  gcc_assert (VECTOR_MODE_P (mode));
> +
> +  /* Check if length in lanes supported for this mode directly.  */
> +  if (direct_optab_handler (op, mode))
> +    {
> +      *factor = 1;
> +      return true;
> +    }
> +
> +  /* Check if length in bytes supported for VnQI with the same vector size.  
> */
> +  scalar_mode emode = QImode;
> +  poly_uint64 esize = GET_MODE_SIZE (emode);

This is always equal to 1, so…

> +  poly_uint64 vsize = GET_MODE_SIZE (mode);
> +  poly_uint64 nunits;
> +
> +  /* To get how many nunits it would have if the element is QImode.  */
> +  if (multiple_p (vsize, esize, &nunits))
> +    {

…we can just set nunits to GET_MODE_SIZE (mode).

> +      machine_mode vmode;
> +      /* Check whether the related VnQI vector mode exists, as well as
> +      optab supported.  */
> +      if (related_vector_mode (mode, emode, nunits).exists (&vmode)
> +       && direct_optab_handler (op, vmode))
> +     {
> +       unsigned int mul;
> +       scalar_mode orig_emode = GET_MODE_INNER (mode);
> +       poly_uint64 orig_esize = GET_MODE_SIZE (orig_emode);
> +
> +       if (constant_multiple_p (orig_esize, esize, &mul))
> +         *factor = mul;
> +       else
> +         gcc_unreachable ();

This is just:

          *factor = GET_MODE_UNIT_SIZE (mode);

However, I think it would be better to return the vector mode that the
load or store should use, instead of this factor.  That way we can reuse
it when generating the load and store statements.

So maybe call the function get_len_load_store_mode and return an
opt_machine_mode.

> +
> +       return true;
> +     }
> +    }
> +  else
> +    gcc_unreachable ();
> +
> +  return false;
> +}
> +
>  /* Return true if there is a compare_and_swap pattern.  */
>  
>  bool
> […]
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 9b564bb046c..daa6e8a2beb 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the 
> vectorizer's loop versioning f
>  Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) 
> Init(6) Param Optimization
>  Bound on number of runtime checks inserted by the vectorizer's loop 
> versioning for alignment check.
>  
> +-param=vect-with-length-scope=
> +Common Joined UInteger Var(param_vect_with_length_scope) Init(0) 
> IntegerRange(0, 2) Param Optimization
> +Control the vector with length exploitation scope.

Think this should be a bit more descriptive, at least saying what the
three values are (but in a more abbreviated form than the .texi above).

I think the default should be 2, with targets actively turning it down
where necessary.  That way, the decision to turn it down is more likely
to have a comment explaining why.

> […]
> @@ -422,10 +423,20 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>  {
>    tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
>    tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

IMO just “use_masks_p” would be more readable.  Same later on.

> +
>    tree ctrl_type = rgc->type;
> -  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
> +  /* Scale up nscalars per iteration with factor.  */
> +  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * 
> rgc->factor;

Maybe “scaled_nscalars_per_iter”?  Not sure the comment really adds
anything here.

Or maybe “nitems_per_iter”, to keep the names shorter?

>    poly_uint64 nscalars_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);

Maybe worth inserting a scaled_nscalars_per_ctrl or nitems_per_ctrl
here, since it's used in two places below (length_limit and as
batch_nscalars_ft).

>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  tree length_limit = NULL_TREE;
> +  /* For length, we need length_limit to check length in range.  */
> +  if (!vect_for_masking)
> +    {
> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
> +      length_limit = build_int_cst (compare_type, len_limit);
> +    }
>  
>    /* Calculate the maximum number of scalar values that the rgroup
>       handles in total, the number that it handles for each iteration
> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>    tree nscalars_total = niters;
>    tree nscalars_step = build_int_cst (iv_type, vf);
>    tree nscalars_skip = niters_skip;
> -  if (nscalars_per_iter != 1)
> +  if (nscalars_per_iter_ft != 1)
>      {
>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>        these multiplications don't overflow.  */
> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
> +      tree compare_factor = build_int_cst (compare_type, 
> nscalars_per_iter_ft);
> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>                                    nscalars_total, compare_factor);
>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>            NSCALARS_SKIP to that cannot overflow.  */
>         tree const_limit = build_int_cst (compare_type,
>                                           LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> -                                         * nscalars_per_iter);
> +                                         * nscalars_per_iter_ft);
>         first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>                                     nscalars_total, const_limit);
>         first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,

It looks odd that we don't need to adjust the other nscalars_* values too.
E.g. the above seems to be comparing an unscaled nscalars_total with
a scaled nscalars_per_iter.  I think the units ought to “agree”,
both here and in the rest of the function.

> […]
> @@ -617,16 +638,32 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>                                     init_ctrl, unskipped_mask);
>         else
>           init_ctrl = unskipped_mask;
> +       gcc_assert (vect_for_masking);

I think this ought to go at the beginning of the { … } block,
rather than the end.

>       }
>  
> +      /* First iteration is full.  */

This comment belongs inside the “if”.

>        if (!init_ctrl)
> -     /* First iteration is full.  */
> -     init_ctrl = build_minus_one_cst (ctrl_type);
> +     {
> +       if (vect_for_masking)
> +         init_ctrl = build_minus_one_cst (ctrl_type);
> +       else
> +         init_ctrl = length_limit;
> +     }
>  
> […]
> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
> tree nitersm1,
>    if (vect_epilogues
>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>        && prolog_peeling >= 0
> -      && known_eq (vf, lowest_vf))
> +      && known_eq (vf, lowest_vf)
> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>      {
>        unsigned HOST_WIDE_INT eiters
>       = (LOOP_VINFO_INT_NITERS (loop_vinfo)

I'm still not really convinced that this check is right.  It feels
like it's hiding a problem elsewhere.

> […]
> @@ -1072,6 +1074,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>    return true;
>  }
>  
> +/* Check whether we can use vector access with length based on precison
> +   comparison.  So far, to keep it simple, we only allow the case that the
> +   precision of the target supported length is larger than the precision
> +   required by loop niters.  */
> +
> +static bool
> +vect_verify_loop_lens (loop_vec_info loop_vinfo)
> +{
> +  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +
> +  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
> +    return false;
> +
> +  /* The one which has the largest NV should have max bytes per iter.  */
> +  rgroup_controls *rgl = &(*lens)[lens->length () - 1];

“lens->last ()”.  Using a reference feels more natural here.

> +
> +  /* Work out how many bits we need to represent the length limit.  */
> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * 
> rgl->factor;

I think this breaks the abstraction.  There's no guarantee that the
factor is the same for each rgroup_control, so there's no guarantee
that the maximum bytes per iter comes the last entry.  (Also, it'd
be better to avoid talking about bytes if we're trying to be general.)
I think we should take the maximum of each entry instead.

> +  unsigned int min_ni_prec
> +    = vect_min_prec_for_max_niters (loop_vinfo, nscalars_per_iter_ft);
> +
> +  /* Now use the maximum of below precisions for one suitable IV type:
> +     - the IV's natural precision
> +     - the precision needed to hold: the maximum number of scalar
> +       iterations multiplied by the scale factor (min_ni_prec above)
> +     - the Pmode precision
> +  */
> +
> +  /* If min_ni_width is less than the precision of the current niters,

min_ni_prec

> +     we perfer to still use the niters type.  */
> +  unsigned int ni_prec
> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
> +
> +  unsigned int required_prec = ni_prec;
> +  if (required_prec < pmode_prec)
> +    required_prec = pmode_prec;
> +
> +  tree iv_type = NULL_TREE;
> +  if (min_ni_prec > required_prec)
> +    {

Do we need this condition?  Looks like we could just do:

  min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
  min_ni_prec = MAX (min_ni_prec, ni_prec);

and then run the loop below.

> +      opt_scalar_int_mode tmode_iter;
> +      unsigned standard_bits = 0;
> +      FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
> +      {
> +     scalar_mode tmode = tmode_iter.require ();
> +     unsigned int tbits = GET_MODE_BITSIZE (tmode);
> +
> +     /* ??? Do we really want to construct one IV whose precision exceeds
> +        BITS_PER_WORD?  */
> +     if (tbits > BITS_PER_WORD)
> +       break;
> +
> +     /* Find the first available standard integral type.  */
> +     if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
> +       {
> +         standard_bits = tbits;
> +         break;
> +       }
> +      }
> +      if (standard_bits != 0)
> +     iv_type = build_nonstandard_integer_type (standard_bits, true);

I don't think there's any need for “standard_bits” here, we can just
set “iv_type” directly before breaking.

> +    }
> +  else
> +    iv_type = build_nonstandard_integer_type (required_prec, true);
> +
> +  if (!iv_type)
> +    {
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "can't vectorize with length-based partial vectors"
> +                      " due to no suitable iv type.\n");
> +      return false;
> +    }
> +
> +  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
> +  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> +
> +  return true;
> +}
> +
>  /* Calculate the cost of one scalar iteration of the loop.  */
>  static void
>  vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
> @@ -2170,11 +2254,64 @@ start_over:
>        return ok;
>      }
>  
> -  /* Decide whether to use a fully-masked loop for this vectorization
> -     factor.  */
> -  LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> -    = (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -       && vect_verify_full_masking (loop_vinfo));
> +  /* For now, we don't expect to mix both masking and length approaches for 
> one
> +     loop, disable it if both are recorded.  */
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
> +      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
> +    {
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "can't vectorize a loop with partial vectors"
> +                      " because we don't expect to mix different"
> +                      " approaches with partial vectors for the"
> +                      " same loop.\n");
> +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
> +
> +  /* Decide whether to vectorize a loop with partial vectors for
> +     this vectorization factor.  */
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +    {
> +      /* Decide whether to use fully-masked approach.  */
> +      if (vect_verify_full_masking (loop_vinfo))
> +     LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +      /* Decide whether to use length-based approach.  */
> +      else if (vect_verify_loop_lens (loop_vinfo))
> +     {
> +       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +           || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
> +         {
> +           if (dump_enabled_p ())
> +             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                              "can't vectorize this loop with length-based"
> +                              " partial vectors approach becuase peeling"
> +                              " for alignment or gaps is required.\n");
> +           LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +         }

Why are these peeling cases necessary?  Peeling for gaps should
just mean subtracting one scalar iteration from the iteration count
and shouldn't otherwise affect the main loop.  Similarly, peeling for
alignment can be handled in the normal way, with a scalar prologue loop.

> +       else if (param_vect_with_length_scope == 0)
> +         LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;

As above, I don't think this should be length-specific.  Same for the
== 1 handling, which we could do afterwards.

> +       /* The epilogue and other known niters less than VF
> +         cases can still use vector access with length fully.  */
> +       else if (param_vect_with_length_scope == 1
> +                && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +                && !vect_known_niters_smaller_than_vf (loop_vinfo))
> +         {
> +           LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +           LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +         }
> +       else
> +         {
> +           LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +           LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;

Think it's better to leave this last line out, otherwise it raises
the question why we don't set it to false elsewhere as well.

> +         }
> +     }
> +      else
> +     LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
> +  else
> +    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +
>    if (dump_enabled_p ())
>      {
>        if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> @@ -2183,6 +2320,15 @@ start_over:
>        else
>       dump_printf_loc (MSG_NOTE, vect_location,
>                        "not using a fully-masked loop.\n");
> +
> +      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +     dump_printf_loc (MSG_NOTE, vect_location,
> +                      "using length-based partial"
> +                      " vectors for loop fully.\n");
> +      else
> +     dump_printf_loc (MSG_NOTE, vect_location,
> +                      "not using length-based partial"
> +                      " vectors for loop fully.\n");

Think just one message for all three cases is better, perhaps with

  "operating only on full vectors.\n"

instead of "not using a fully-masked loop.\n".  Might need some
testsuite updates though -- probably worth splitting the wording
change out into a separate patch if so.

>      }
>  
>    /* If epilog loop is required because of data accesses with gaps,
> @@ -8249,6 +8423,63 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, 
> vec_loop_masks *masks,
>    return mask;
>  }
>  
> +/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
> +   lengths for vector access with length that each control a vector of type
> +   VECTYPE.  FACTOR is only meaningful for length in bytes, and to indicate
> +   how many bytes for each element (lane).  */

Maybe:

/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
   lengths for controlling an operation on VECTYPE.  The operation splits
   each element of VECTYPE into FACTOR separate subelements, measuring
   the length as a number of these subelements.  */

> +
> +void
> +vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> +                   unsigned int nvectors, tree vectype, unsigned int factor)
> +{
> +  gcc_assert (nvectors != 0);
> +  if (lens->length () < nvectors)
> +    lens->safe_grow_cleared (nvectors);
> +  rgroup_controls *rgl = &(*lens)[nvectors - 1];
> +
> +  /* The number of scalars per iteration, scalar occupied bytes and
> +     the number of vectors are both compile-time constants.  */
> +  unsigned int nscalars_per_iter
> +    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> +              LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> +
> +  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
> +    {
> +      rgl->max_nscalars_per_iter = nscalars_per_iter;
> +      rgl->type = vectype;
> +      rgl->factor = factor;
> +    }

This is dangerous because it ignores “factor” otherwise, and ignores
the previous factor if we overwrite it.

I think instead we should have:

  /* For now, we only support cases in which all loads and stores fall back
     to VnQI or none do.  */
  gcc_assert (!rgl->max_nscalars_per_iter
              || (rgl->factor == 1 && factor == 1)
              || (rgl->max_nscalars_per_iter * rgl->factor
                  == nscalars_per_iter * factor));

before changing rgl.

> […]
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index cdd6f6c5e5d..e0ffbab1d02 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1742,29 +1742,56 @@ check_load_store_for_partial_vectors (loop_vec_info 
> loop_vinfo, tree vectype,
>        return;
>      }
>  
> -  machine_mode mask_mode;
> -  if (!VECTOR_MODE_P (vecmode)
> -      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
> -      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
> +  if (!VECTOR_MODE_P (vecmode))
>      {
>        if (dump_enabled_p ())
>       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                      "can't use a fully-masked loop because the target"
> -                      " doesn't have the appropriate masked load or"
> -                      " store.\n");
> +                      "can't operate on partial vectors because of"
> +                      " the unexpected mode.\n");

Maybe: “can't operate on partial vectors when emulating vector operations”

>        LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>        return;
>      }
> -  /* We might load more scalars than we need for permuting SLP loads.
> -     We checked in get_group_load_store_type that the extra elements
> -     don't leak into a new vector.  */
> +
>    poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    unsigned int nvectors;
> -  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
> -    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, 
> scalar_mask);
> -  else
> -    gcc_unreachable ();
> +
> +  machine_mode mask_mode;
> +  bool using_partial_vectors_p = false;
> +  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
> +      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
> +    {
> +      /* We might load more scalars than we need for permuting SLP loads.
> +      We checked in get_group_load_store_type that the extra elements
> +      don't leak into a new vector.  */
> +      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))

Please split this out into a lambda that returns the number of vectors,
and keep the comment with it.  That way we can use it here and below.

> +     vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
> +                            scalar_mask);
> +      else
> +     gcc_unreachable ();
> +      using_partial_vectors_p = true;
> +    }
> +
> +  unsigned int factor;
> +  if (can_vec_len_load_store_p (vecmode, is_load, &factor))
> +    {
> +      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
> +     vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor);
> +      else
> +     gcc_unreachable ();
> +      using_partial_vectors_p = true;
> +    }
> +
> +  if (!using_partial_vectors_p)
> +    {
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "can't operate on partial vectors because the"
> +                      " target doesn't have the appropriate partial"
> +                      "vectorization load or store.\n");

missing space between “partial” and “vectorization”.

> +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
>  }
>  
>  /* Return the mask input to a masked load or store.  VEC_MASK is the 
> vectorized
> @@ -6936,6 +6963,28 @@ vectorizable_scan_store (vec_info *vinfo,
>    return true;
>  }
>  
> +/* For the vector type VTYPE, return the same size vector type with
> +   QImode element, which is mainly for vector load/store with length
> +   in bytes.  */
> +
> +static tree
> +vect_get_same_size_vec_for_len (tree vtype)
> +{
> +  gcc_assert (VECTOR_TYPE_P (vtype));
> +  machine_mode v_mode = TYPE_MODE (vtype);
> +  gcc_assert (GET_MODE_INNER (v_mode) != QImode);
> +
> +  /* Obtain new element counts with QImode.  */
> +  poly_uint64 vsize = GET_MODE_SIZE (v_mode);
> +  poly_uint64 esize = GET_MODE_SIZE (QImode);
> +  poly_uint64 nelts = exact_div (vsize, esize);
> +
> +  /* Build element type with QImode.  */
> +  unsigned int eprec = GET_MODE_PRECISION (QImode);
> +  tree etype = build_nonstandard_integer_type (eprec, 1);
> +
> +  return build_vector_type (etype, nelts);
> +}

As mentioned above, I think we should be getting the mode of
the vector from get_len_load_store_mode.

> […]
> @@ -7911,10 +7968,16 @@ vectorizable_store (vec_info *vinfo,
>             unsigned HOST_WIDE_INT align;
>  
>             tree final_mask = NULL_TREE;
> +           tree final_len = NULL_TREE;
>             if (loop_masks)
>               final_mask = vect_get_loop_mask (gsi, loop_masks,
>                                                vec_num * ncopies,
>                                                vectype, vec_num * j + i);
> +           else if (loop_lens)
> +             final_len = vect_get_loop_len (loop_vinfo, loop_lens,
> +                                            vec_num * ncopies,
> +                                            vec_num * j + i);
> +

I don't think we need this “final_len”.  Unlike for masks, we only have
a single length, and can calculate it in the “if” statement below.

>             if (vec_mask)
>               final_mask = prepare_load_store_mask (mask_vectype, final_mask,
>                                                     vec_mask, gsi);
> @@ -7994,6 +8057,34 @@ vectorizable_store (vec_info *vinfo,
>                 vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
>                 new_stmt = call;
>               }
> +           else if (final_len)
> +             {
> +               align = least_bit_hwi (misalign | align);
> +               tree ptr = build_int_cst (ref_type, align);
> +               tree vtype = TREE_TYPE (vec_oprnd);

Couldn't you just reuse “vectype”?  Worth a comment if not.

> +               /* Need conversion if it's wrapped with VnQI.  */
> +               if (!direct_optab_handler (len_store_optab,
> +                                          TYPE_MODE (vtype)))

I think this should use get_len_load_store_mode rather than querying
the optab directly.

> +                 {
> +                   tree new_vtype = vect_get_same_size_vec_for_len (vtype);
> +                   tree var
> +                     = vect_get_new_ssa_name (new_vtype, vect_simple_var);
> +                   vec_oprnd
> +                     = build1 (VIEW_CONVERT_EXPR, new_vtype, vec_oprnd);
> +                   gassign *new_stmt
> +                     = gimple_build_assign (var, VIEW_CONVERT_EXPR,
> +                                            vec_oprnd);
> +                   vect_finish_stmt_generation (vinfo, stmt_info, new_stmt,
> +                                                gsi);
> +                   vec_oprnd = var;
> +                 }
> +               gcall *call
> +                 = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
> +                                               ptr, final_len, vec_oprnd);
> +               gimple_call_set_nothrow (call, true);
> +               vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
> +               new_stmt = call;
> +             }
>             else
>               {
>                 data_ref = fold_build2 (MEM_REF, vectype,
> @@ -8531,6 +8622,7 @@ vectorizable_load (vec_info *vinfo,
>        tree dr_offset;
>  
>        gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
> +      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));

Might as well just change the existing assert to
!LOOP_VINFO_USING_PARTIAL_VECTORS_P.

Same comments for the load code.

> […]
> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>         return false;
>       }
>  
> -      if (loop_vinfo
> -       && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -       && reduction_type == EXTRACT_LAST_REDUCTION)
> -     vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
> -                            ncopies * vec_num, vectype, NULL);
> +      if (loop_vinfo && for_reduction
> +       && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +     {
> +       if (reduction_type == EXTRACT_LAST_REDUCTION)
> +         vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
> +                                ncopies * vec_num, vectype, NULL);
> +       /* Using partial vectors can introduce inactive lanes in the last
> +          iteration, since full vector of condition results are operated,
> +          it's unsafe here.  But if we can AND the condition mask with
> +          loop mask, it would be safe then.  */
> +       else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
> +         {
> +           scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
> +           if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
> +             {
> +               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> +               cond.code = invert_tree_comparison (cond.code, honor_nans);
> +               if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
> +                 LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +             }
> +         }
> +       else
> +         LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +     }
>  
>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,

I don't understand this part.

> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>    *nunits_vectype_out = nunits_vectype;
>    return opt_result::success ();
>  }
> +
> +/* Generate and return statement sequence that sets vector length LEN that 
> is:
> +
> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
> +   left_len = END_INDEX - min_of_start_and_end;
> +   rhs = min (left_len, LEN_LIMIT);
> +   LEN = rhs;
> +
> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, 
> which
> +   means if we have left_len in bytes larger than 255, it can't be saturated 
> to
> +   vector limit (vector size).  One target hook can be provided if other 
> ports
> +   don't suffer this.
> +*/

Should be no line break before the */

Personally I think it'd be better to drop the TODO.  This isn't the only
place that would need to change if we allowed out-of-range lengths,
whereas the comment might give the impression that it is.

> +
> +gimple_seq
> +vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
> +{
> +  gimple_seq stmts = NULL;
> +  tree len_type = TREE_TYPE (len);
> +  gcc_assert (TREE_TYPE (start_index) == len_type);
> +
> +  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
> +  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
> +  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
> +
> +  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
> +  gimple *new_stmt = gimple_build_assign (len, rhs);
> +  gimple_stmt_iterator i = gsi_last (stmts);
> +  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
> +
> +  return stmts;
> +}

It's better to build this up using gimple_build instead.

> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 6c830ad09f4..4155ffe1d49 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -417,6 +417,16 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
>     are compile-time constants but VF and nL can be variable (if the target
>     supports variable-length vectors).
>  
> +   Moreover, for some approach with partial vectors like being controlled
> +   by length (in bytes), it cares about the occupied bytes for each scalar.
> +   Provided that each scalar has factor bytes, the total number of scalar
> +   values becomes to factor * N, the above equation becomes to:
> +
> +       factor * N = factor * NS * VF = factor * NV * NL
> +
> +   factor * NS is the bytes of each scalar, factor * NL is the vector size
> +   in bytes.
> +
>     In classical vectorization, each iteration of the vector loop would
>     handle exactly VF iterations of the original scalar loop.  However,
>     in vector loops that are able to operate on partial vectors, a

As above, I think it'd be better to model the factor as splitting each
element into FACTOR pieces.  In that case I don't think we need to
describe it in this comment; a comment above the field should be enough.

> @@ -473,14 +483,19 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
>     first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
>     the second being indexed by the mask index 0 <= i < nV.  */
>  
> -/* The controls (like masks) needed by rgroups with nV vectors,
> +/* The controls (like masks, lengths) needed by rgroups with nV vectors,
>     according to the description above.  */

“(masks or lengths)”

>  struct rgroup_controls {
>    /* The largest nS for all rgroups that use these controls.  */
>    unsigned int max_nscalars_per_iter;
>  
> -  /* The type of control to use, based on the highest nS recorded above.
> -     For mask-based approach, it's used for mask_type.  */
> +  /* For now, it's mainly used for length-based in bytes approach, it's
> +     record the occupied bytes of each scalar.  */

Maybe:

  /* For the largest nS recorded above, the loop controls divide each scalar
     into FACTOR equal-sized pieces.  This is useful if we need to split
     element-based accesses into byte-based accesses.  */

> +  unsigned int factor;
> +
> +  /* This type is based on the highest nS recorded above.
> +     For mask-based approach, it records mask type to use.
> +     For length-based approach, it records appropriate vector type.  */

Maybe:

  /* This is a vector type with MAX_NSCALARS_PER_ITER * VF / nV elements.
     For mask-based controls, it is the type of the masks in CONTROLS.
     For length-based controls, it can be any vector type that has the
     specified number of elements; the type of the elements doesn't matter.  */

> @@ -644,6 +665,10 @@ public:
>       the vector loop can handle fewer than VF scalars.  */
>    bool using_partial_vectors_p;
>  
> +  /* True if we've decided to use partially-populated vectors for the
> +     epilogue of loop, only for length-based approach for now.  */

Don't think the bit after the comma is necessary.

> +  bool epil_using_partial_vectors_p;
> +
>    /* When we have grouped data accesses with gaps, we may introduce invalid
>       memory accesses.  We peel the last iteration of the loop to prevent
>       this.  */

Thanks,
Richard

Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer

Reply via email to