length for BB SLP

Richard Biener Tue, 11 Nov 2025 06:18:46 -0800

On Tue, 11 Nov 2025, Christopher Bazley wrote:

> 
> On 07/11/2025 13:42, Richard Biener wrote:
> > On Wed, 5 Nov 2025, Christopher Bazley wrote:
> >
> >> On 28/10/2025 13:29, Richard Biener wrote:
> >>> On Tue, 28 Oct 2025, Christopher Bazley wrote:
> >>>
> >>>> +tree
> >>>> +vect_slp_get_bb_mask (slp_tree slp_node, gimple_stmt_iterator *gsi,
> >>>> +                      unsigned int nvectors, tree vectype, unsigned int 
> >>>> index)
> >>>> +{
> >>>> +  gcc_checking_assert (SLP_TREE_CAN_USE_MASK_P (slp_node));
> >>>> +
> >>>> +  /* Only the last vector can be a partial vector.  */
> >>>> +  if (index < nvectors - 1)
> >>>> +    return NULL_TREE;
> >>>> +
> >>>> +  /* vect_get_num_copies only allows a partial vector if it is the only
> >>>> +     vector.  */
> >>>> +  if (nvectors > 1)
> >>>> +    return NULL_TREE;
> >>>> +
> >>>> +  gcc_checking_assert (nvectors == 1);
> >>>> +
> >>>> +  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> >>>> +  unsigned int group_size = SLP_TREE_LANES (slp_node);
> >>> In particular I think that in general with partial vectors the group_size
> >>> is not equal to the number of scalar lanes but instead is computed by
> >>> the "VF", thus equal to max_nunits?  This means we have to be careful
> >>> what we deal with or rather what we want to record, given locally for
> >>> a SLP node we can only compute it's own total nunits based on the
> >>> number of scalar lanes and the vector type.
> >> group_size comes from scalar_stmts.length () or ops.length () in the
> >> overloaded function vect_create_new_slp_node or from TYPE_VECTOR_SUBPARTS
> >> (SLP_TREE_VECTYPE (vnode)) in vect_build_slp_tree_2.
> >>
> >> LOOP_VINFO_VECT_FACTOR ("VF") is only stored for the loop vectoriser
> >> instances, therefore it cannot affect BB SLP. The vect_slp_get_bb_mask
> >> function is only used for BB SLP.
> >>
> >> The main use of nunits.max is in calculate_unrolling_factor, which does not
> >> require it to be equal to group_size or an integral multiple of group_size,
> >> nor vice-versa. The design of that function implies that nunits.max is
> >> expected to be divisible by group_size though, so I guess it does something
> >> like VF = group_size * nunits.max.
> >>
> >>> So we might want to explicitly record the group size.  In any case
> >> Sorry but I'm not sure what you are suggesting here. The group size is
> >> already
> >> explicitly recorded in the SLP node (as 'lanes', although the term
> >> 'group_size' seems to be overloaded in the vectoriser --  e.g., group_size
> >> can
> >> also be DR_GROUP_SIZE in vectorizable_load or vectorizable_store).
> >>
> >>> 'nvectors' should be also correct here, the question is how we
> >>> compute that right now.
> >> nvectors is vector_unroll_factor (i.e. SLP_TREE_LANES / simdlen for BB
> >> SLP) in
> >> vectorizable_simd_clone_call, vect_get_num_copies (i.e. SLP_TREE_LANES /
> >> TYPE_VECTOR_SUBPARTS for BB SLP) in vectorizable_operation, and
> >> vect_get_num_copies / DR_GROUP_SIZE in vectorizable_{store|load}. The
> >> vect_get_mask function is also called with values of nvectors between 0
> >> and vect_get_num_copies.
> >>
> >> I assumed that arguments that are valid for vect_get_loop_mask would also
> >> be
> >> valid for my new function, vect_slp_get_bb_mask, because my intention was
> >> always to share as much code as possible between loop vectorisation and
> >> SLP.
> >> It's likely that some of the code is not optimal as a result.
> > Well yes, I know all this.  But when we now add padding, the question
> > is where we should track that (or, as you seem to imply, not track it).
> > Do we increase group_size to reflect that the actual vectors have more
> > lanes?  Do we just track that in max_nunits somehow?
> SLP_TREE_LANES gives the unpadded size of a group (i.e. the number of active
> lanes), which seems reasonable to me because that is the actual size of the
> group. slp_tree_nunits simply gives the range of TYPE_VECTOR_SUBPARTS
> (vectype), i.e. including any inactive lanes in the last vector of the group.
> So, no, max_nunits (or rather, nunits.max in my patch) does not include
> padding and I didn't want/need to change it to do so.
> 
> I don't yet understand why you want the amount of padding to be tracked
> independently from TYPE_VECTOR_SUBPARTS (vectype) - SLP_TREE_LANES (slp_node).
> Even if vect_slp_get_bb_mask were modified to produce masks for partial
> vectors in cases where nvectors > 1, the amount of padding required (or, more
> usefully in this function, the number of unmasked bits) could still be derived
> from the vectype and group_size by using the remainder of a division, e.g.,
> TYPE_VECTOR_SUBPARTS (vectype) - (SLP_TREE_LANES (slp_node) %
> TYPE_VECTOR_SUBPARTS (vectype)).
> 
> > The purpose of max_nunits for BB vectorization is solely to detect
> > the case that we do not have sufficient lanes in the SLP node to
> > fill the vector lanes of the vector type we chose, thus we'd need
> > "unrolling".
> >
> > Richard.
> 
> My understanding of calculate_unrolling_factor and its calling code in
> vect_analyze_slp_instance and vect_build_slp_instance is that unrolling is
> required ifthe group size is less than the maximum number of lanes of all of
> the chosen vector types, or the group size is greater than nunits.max but not
> exactly divisible by it. If padding lanes were included in the group size, it
> might prevent correct detection of when unrolling is required. (If you say
> that unrolling is never required for BB SLP, that would require restructuring
> of the control flow in vect_analyze_slp_instance because currently all of the
> SLP-specific code is in the "unrolling required" block.)
> 
> For BB SLP, the vectoriser used to give up completely if the group size was
> less than the maximum number of lanes of all of the chosen vector types
> ("...do not have sufficient lanes in the SLP node to fill the vector lanes of
> the vector type we chose..."), or split the group if the group size is greater
> than nunits.max but not exactly divisible by it.


BB vectorization cannot perform unrolling.  Unrolling is only performed
to fill vector lanes with actual data - we could statically predicate
all vectors in loop vectorization and retain VF == 1, but that would
be inefficient, so we unroll N times with N computed so that all vectors
are fully populated by used lanes.

Yes, there are existing checks that bail out early for BB vectorization.
We need (and can) remove those and have the late check perform the
necessary testing.

> If the minimum number of lanes across all of the chosen vector types is
> sufficient to store the whole group then it might be possible to use tail
> predication, which is why I added !known_ge (nunits.min, group_size) to the
> conjunction that must be true before entering that block. My modification does
> not prevent groups bigger than nelems.max from being split, including in cases
> where one of the new groups resulting from such a split can be handled by
> tail-predication.
> 
> Are you suggesting that such splits should be avoided? If so, please could you
> explain the rationale?

I think I explained the situations where a split isn't necessary already
(uniform vector type as optimal configuration to cover all lanes).  If
we can do tail predication for a subset of vectors (and not strictly
require a single one), then not splitting is still possible in such case.

I think we can retain splitting as we do now and put in the hard 
requirement that predication can be only done on a single vector
SLP node (aka SLP_TREE_LANES < lower_bound (TYPE_VECTOR_SUBPARTS)).
I think that we might need to relax this to cover some cases
involving conversions, but we'll see.

Retaining the splitting has the short-term advantage that we do not have
to re-try analysis w/o predication or handle cannot-predicate but
would have to by splitting late.

Note there is my intention to push deciding on what actual vector type
we use to later, analysis in vectorizable_*, this impacts the ability
to chose appropriate splitting points, because that would currently
be tied to vector type choices.  Or the other way around, a split
point for a SLP node 'A' might influence the optimal choice for
its children SLP nodes vector type (again consider 2xV2DI with 1xV4SI
child - we can split the former at V2DI boundary, but not its child
unless we demote that to V2SI).

Richard.

> Thanks,
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [RFC 3/9] Implement recording/getting of mask/length for BB SLP

Reply via email to