length for BB SLP

Christopher Bazley Tue, 11 Nov 2025 05:54:45 -0800


On 07/11/2025 13:42, Richard Biener wrote:

On Wed, 5 Nov 2025, Christopher Bazley wrote:

On 28/10/2025 13:29, Richard Biener wrote:

On Tue, 28 Oct 2025, Christopher Bazley wrote:

+tree
+vect_slp_get_bb_mask (slp_tree slp_node, gimple_stmt_iterator *gsi,
+                     unsigned int nvectors, tree vectype, unsigned int index)
+{
+  gcc_checking_assert (SLP_TREE_CAN_USE_MASK_P (slp_node));
+
+  /* Only the last vector can be a partial vector.  */
+  if (index < nvectors - 1)
+    return NULL_TREE;
+
+  /* vect_get_num_copies only allows a partial vector if it is the only
+     vector.  */
+  if (nvectors > 1)
+    return NULL_TREE;
+
+  gcc_checking_assert (nvectors == 1);
+
+  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  unsigned int group_size = SLP_TREE_LANES (slp_node);

In particular I think that in general with partial vectors the group_size
is not equal to the number of scalar lanes but instead is computed by
the "VF", thus equal to max_nunits?  This means we have to be careful
what we deal with or rather what we want to record, given locally for
a SLP node we can only compute it's own total nunits based on the
number of scalar lanes and the vector type.

group_size comes from scalar_stmts.length () or ops.length () in the
overloaded function vect_create_new_slp_node or from TYPE_VECTOR_SUBPARTS
(SLP_TREE_VECTYPE (vnode)) in vect_build_slp_tree_2.

LOOP_VINFO_VECT_FACTOR ("VF") is only stored for the loop vectoriser
instances, therefore it cannot affect BB SLP. The vect_slp_get_bb_mask
function is only used for BB SLP.

The main use of nunits.max is in calculate_unrolling_factor, which does not
require it to be equal to group_size or an integral multiple of group_size,
nor vice-versa. The design of that function implies that nunits.max is
expected to be divisible by group_size though, so I guess it does something
like VF = group_size * nunits.max.

So we might want to explicitly record the group size.  In any case

Sorry but I'm not sure what you are suggesting here. The group size is already
explicitly recorded in the SLP node (as 'lanes', although the term
'group_size' seems to be overloaded in the vectoriser --  e.g., group_size can
also be DR_GROUP_SIZE in vectorizable_load or vectorizable_store).

'nvectors' should be also correct here, the question is how we
compute that right now.

nvectors is vector_unroll_factor (i.e. SLP_TREE_LANES / simdlen for BB SLP) in
vectorizable_simd_clone_call, vect_get_num_copies (i.e. SLP_TREE_LANES /
TYPE_VECTOR_SUBPARTS for BB SLP) in vectorizable_operation, and
vect_get_num_copies / DR_GROUP_SIZE in vectorizable_{store|load}. The
vect_get_mask function is also called with values of nvectors between 0
and vect_get_num_copies.

I assumed that arguments that are valid for vect_get_loop_mask would also be
valid for my new function, vect_slp_get_bb_mask, because my intention was
always to share as much code as possible between loop vectorisation and SLP.
It's likely that some of the code is not optimal as a result.

Well yes, I know all this.  But when we now add padding, the question
is where we should track that (or, as you seem to imply, not track it).
Do we increase group_size to reflect that the actual vectors have more
lanes?  Do we just track that in max_nunits somehow?

SLP_TREE_LANES gives the unpadded size of a group (i.e. the number ofactive lanes), which seems reasonable to me because that is the actualsize of the group. slp_tree_nunits simply gives the range ofTYPE_VECTOR_SUBPARTS (vectype), i.e. including any inactive lanes in thelast vector of the group. So, no, max_nunits (or rather, nunits.max inmy patch) does not include padding and I didn't want/need to change itto do so.

I don't yet understand why you want the amount of padding to be trackedindependently from TYPE_VECTOR_SUBPARTS (vectype) - SLP_TREE_LANES(slp_node). Even if vect_slp_get_bb_mask were modified to produce masksfor partial vectors in cases where nvectors > 1, the amount of paddingrequired (or, more usefully in this function, the number of unmaskedbits) could still be derived from the vectype and group_size by usingthe remainder of a division, e.g., TYPE_VECTOR_SUBPARTS (vectype) -(SLP_TREE_LANES (slp_node) % TYPE_VECTOR_SUBPARTS (vectype)).

The purpose of max_nunits for BB vectorization is solely to detect
the case that we do not have sufficient lanes in the SLP node to
fill the vector lanes of the vector type we chose, thus we'd need
"unrolling".

Richard.

My understanding of calculate_unrolling_factor and its calling code invect_analyze_slp_instance and vect_build_slp_instance is that unrollingis required ifthe group size is less than the maximum number of lanes ofall of the chosen vector types, or the group size is greater thannunits.max but not exactly divisible by it. If padding lanes wereincluded in the group size, it might prevent correct detection of whenunrolling is required. (If you say that unrolling is never required forBB SLP, that would require restructuring of the control flow invect_analyze_slp_instance because currently all of the SLP-specific codeis in the "unrolling required" block.)

For BB SLP, the vectoriser used to give up completely if the group sizewas less than the maximum number of lanes of all of the chosen vectortypes ("...do not have sufficient lanes in the SLP node to fill thevector lanes of the vector type we chose..."), or split the group if thegroup size is greater than nunits.max but not exactly divisible by it.

If the minimum number of lanes across all of the chosen vector types issufficient to store the whole group then it might be possible to usetail predication, which is why I added !known_ge (nunits.min,group_size) to the conjunction that must be true before entering thatblock. My modification does not prevent groups bigger than nelems.maxfrom being split, including in cases where one of the new groupsresulting from such a split can be handled by tail-predication.

Are you suggesting that such splits should be avoided? If so, pleasecould you explain the rationale?


Thanks,

--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/

Re: [RFC 3/9] Implement recording/getting of mask/length for BB SLP

Reply via email to