length for BB SLP

Christopher Bazley Tue, 11 Nov 2025 10:01:47 -0800


On 07/11/2025 13:53, Richard Biener wrote:

On Thu, 6 Nov 2025, Christopher Bazley wrote:

On 05/11/2025 12:25, Richard Biener wrote:

On Tue, 4 Nov 2025, Christopher Bazley wrote:

On 28/10/2025 13:29, Richard Biener wrote:

+/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE
that
+   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX <
NVECTORS.
+   Masking is only required for the tail, therefore NULL_TREE is
returned
for
+   every value of INDEX except the last.  Insert any set-up statements
before
+   GSI.  */

I think it might happen that some vectors are fully masked, say for
a conversion from double to int and V2DImode vs. V4SImode when we
have 5 lanes the conversion likely expects 4 V2DImode inputs to
produce 2 V4SImode outputs, but the 4th V2DImode input has no active
lanes at all.

But maybe you handle this situation differently, I'll see.

You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI
inactive,
and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI (8SI - 5SI =
3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked).

I don't think that the "1 of 2DI is fully masked" would ever happen though,
because a group of 5DI would be split long before the vectoriser attempts
to
materialise masks. The only reason that a group of 5DI might be allowed to
survive that long would be if the number of subparts of the natural vector
type (the one currently being tried by vect_slp_region) were at least 5, a
factor of 5, or both. No such vector types exist.

For example, consider this translation unit:

#include <stdint.h>

void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5])
{
    (*si)[0] = (*di)[0];
    (*si)[1] = (*di)[1];
    (*si)[2] = (*di)[2];
    (*si)[3] = (*di)[3];
    (*si)[4] = (*di)[4];
}

Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable) as:

convert:
.LFB0:
          .cfi_startproc
          ldp     q30, q31, [x0] ; vector load the first four lanes
          ptrue   p7.d, vl2 ; enable two lanes for vector stores
          add     x2, x1, 8
          ldr     x0, [x0, 32] ; load the fifth lane
          st1w    z30.d, p7, [x1] ; store least-significant 32 bits of
the first two lanes
         st1w    z31.d, p7, [x2] ; store least-significant 32 bits of lanes
3
and 4
          str     w0, [x1, 16] ; store least-significant 32 bits of fifth
  lane
          ret
          .cfi_endproc

The slp2 dump shows:

note:   Starting SLP discovery for
note:     (*si_13(D))[0] = _2;
note:     (*si_13(D))[1] = _4;
note:     (*si_13(D))[2] = _6;
note:     (*si_13(D))[3] = _8;
note:     (*si_13(D))[4] = _10;
note:   Created SLP node 0x4bd9e00
note:   starting SLP discovery for node 0x4bd9e00
note:   get vectype for scalar type (group size 5): uint32_t
note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 5): vector([4,4]) unsigned int
note:   vectype: vector([4,4]) unsigned int
note:   nunits = [4,4]
missed:   Build SLP failed: unrolling required in basic block SLP

This fails the check in vect_record_nunits because the group size of 5 may
be
larger than the number of subparts of vector([4,4]) unsigned int (which
could
be as low as 4) and 5 is never an integral multiple of [4,4].

The vectoriser therefore splits the group of 5SI into 4SI + 1SI:

I had the impression the intent of this series is to _not_ split the
groups in this case.  On x86 with V2DImode / V4SImode (aka SSE2)

Not exactly. Richard Sandiford did tell me (months ago) that this task is
about trying to avoid splitting, but I think that is not the whole story.
Richard's initial example of a function that is not currently vectorised, but
could be with tail-predication, was:

void
foo (char *x, int n)
{
   x[0] += 1;
   x[1] += 2;
   x[2] += 1;
   x[3] += 2;
   x[4] += 1;
   x[5] += 2;
}

A group of 6QI such as that shown in the function above would not need to be
split because each lane is only one byte wide, not a double word (unlike in
your example of a conversion from 5DF to 5SI). A group of 6QI can always be
stored in one vector of type VNx16QI, because VNx16QI's minimum number of
lanes is 16.

     ptrue    p7.b, vl6
     ptrue    p6.b, all
     ld1b    z31.b, p7/z, [x0] ; one predicated load
     adrp    x1, .LC0
     add    x1, x1, :lo12:.LC0
     ld1rqb    z30.b, p6/z, [x1]
     add    z30.b, z31.b, z30.b
     st1b    z30.b, p7, [x0] ; one predicated store
     ret

If some target architecture provides both VNx8DF and VNx8SI then your example
conversion wouldn't result in a split either because the group size of 5 would
certainly be smaller than the number of subparts of vector([8,8]) double and
the fact that 5 is not an integral multiple of [8,8] would be irrelevant. (SVE
doesn't provide either type in implementations that I'm aware of.)

However, I believe it could also be beneficial to be able to vectorise
functions with more than a small number of operations in them (e.g., 26
instead of 6 operations):

void
foo (char *x, int n)
{
   x[0] += 1;
   x[1] += 2;
   x[2] += 1;
   x[3] += 2;
   x[4] += 1;
   x[5] += 2;
   x[6] += 1;
   x[7] += 2;
   x[8] += 1;
   x[9] += 2;
   x[10] += 1;
   x[11] += 2;
   x[12] += 1;
   x[13] += 2;
   x[14] += 1;
   x[15] += 2;
   x[16] += 1;
   x[17] += 2;
   x[18] += 1;
   x[19] += 2;
   x[20] += 1;
   x[21] += 2;
   x[22] += 1;
   x[23] += 2;
   x[24] += 1;
   x[25] += 2;
}

Admittedly, such cases are probably rarer than small groups in real code.

In such cases, even a group of byte-size operations might need to be split in
order to be vectorised. e.g., a group of 26QI additions could be vectorised
with VNx16QI as 16QI + 10QI. A mask would be generated for both groups:

Note you say "split" and mean you have two vector operations in the end.
But I refer to with "split" to the split into two different SLP graphs,
usually, even with BB vectorization, a single SLP node can happily
represent multiple vectors (with the same vector type) when necessary
to fill all lanes.


Thanks for clarifying that.

My original concept of splitting was probably based on something RichardSandiford said about the desirability of using Advanced SIMD instead ofSVE to vectorise the part of a large group that does not requiretail-predication. At that time, I was not aware that the target backendcould automatically generate Advanced SIMD instructions for WHILE_ULToperations in which the mask has all bits set. I therefore assumed thatit would be necessary to split such groups. Splitting is also a naturalconsequence of the existing control flow in thevect_analyze_slp_instance function.

But to agree to that we still might want to do some splitting, at least
on x86 where we have multiple vector sizes (and thus types for the
same element type), your first example with 6 lanes could be split
into a V4QImode subgraph and a V2QImode subgraph.  I don't think
x86 has V2QImode, but just make that V4DImode and V2DImode.  A
variant without the need for splitting would be using V2DImode
(with three vectors) or a variant using V4DImode and masking
for the second vector.

Is your concern that adding !known_ge (nunits.min, group_size) to theconjunction in the vect_analyze_slp_instance function prevents splittingof BB SLP groups known to be smaller than the minimum number of lanes ofany of the chosen vector types? Can such groups really be usefully split?

Let's suppose the group size is 6 and the natural vector type (for thecurrent iteration of the outer loop) is V8DI.

Previously, this example would have failed the following test (conditiontrue):

if (!max_nunits.is_constant (&const_max_nunits) ||const_max_nunits > group_size)

which would have resulted in "Build SLP failed: store group size not amultiple of the vector size in basic block SLP" andvect_analyze_slp_instance returning false, instead of the group being split.

Any split would only occur when the next iteration of the outer loopselects V4DI, for which !known_ge (nunits.min, group_size) is true withmy changes to the function (because 4 < 8). Consequently, the BB SLPblock would still be entered, and the const_max_nunits > group_size testwould be repeated. This time would pass (condition false) because 4 <=8, therefore "SLP discovery succeeded but node needs splitting" and thegroup could be split into V4DImode and V2DImode as you described.

Your AdvSIMD substitution for the larger case could be done by
splitting the graph and choosing AdvSIMD for the half that does
not need predication but SVE for the other half.

That's what the current implementation does.

That said, as long as the vector type is the same for each
part covering distinct lanes there is no need for splitting.
What I'd like to understand is whether the implementation at
hand from you for the masking assumes that if masking is required
(we padded lanes) whether that requires there to be exactly
one hardware vector for each SLP node.  Below you say
that's an "invariant", so that's a yes?

The vect_analyze_slp_instance function only creates a new SLP instancefor BB vectorisation with an unrolling factor not equal to one if theminimum number of lanes for all of the vector types is sufficient tostore the whole group. That implies that there is exactly one hardwarevector.


The vect_get_num_copies function also relies on that assumption:

  vf *= SLP_TREE_LANES (node);
  tree vectype = SLP_TREE_VECTYPE (node);
  if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf))
    return 1;

Otherwise, exact_div would fail in a callee, vect_get_num_vectors.

My current implementation of the vect_slp_get_bb_mask function returnsNULL_TREE (i.e. 'no mask') for all vectors of an SLP node with multiplevectors:


  /* vect_get_num_copies only allows a partial vector if it is the only
     vector.  */
  if (nvectors > 1)
    return NULL_TREE;

That means the tail of such a node would not be masked correctly if itneeds to be masked at all. Even if that guard were removed, thefollowing statement would also need to be made more complex to handlecases in which the group size is not also the number of active lanes:


tree end_index = build_int_cst (cmp_type, group_size);

   I'm not sure that
will work out for all cases in the end.  I'm fine with
requiring this initially but please keep in mind that we'd
want to lift this restriction without re-doing most of what
you do in a different way.

Richard.

void foo (char * x, int n)
{
   char * vectp.14;
   vector([16,16]) char * vectp_x.13;
   vector([16,16]) char vect__34.12;
   vector([16,16]) char vect__33.11;
   char * vectp.10;
   vector([16,16]) char * vectp_x.9;
   char * vectp.8;
   vector([16,16]) char * vectp_x.7;
   vector([16,16]) char vect__2.6;
   vector([16,16]) char vect__1.5;
   char * vectp.4;
   vector([16,16]) char * vectp_x.3;
   vector([16,16]) <signed-boolean:1> slp_mask_82;
   vector([16,16]) <signed-boolean:1> slp_mask_86;
   vector([16,16]) <signed-boolean:1> slp_mask_89;
   vector([16,16]) <signed-boolean:1> slp_mask_93;

   <bb 2> [local count: 1073741824]:
   vectp.4_81 = x_54(D);
   slp_mask_82 = .WHILE_ULT (0, 16, { 0, ... });
   vect__1.5_83 = .MASK_LOAD (vectp.4_81, 8B, slp_mask_82, { 0, ... });
   vect__2.6_84 = vect__1.5_83 + { 1, 2, ... };
   vectp.8_85 = x_54(D);
   slp_mask_86 = .WHILE_ULT (0, 16, { 0, ... });
   .MASK_STORE (vectp.8_85, 8B, slp_mask_86, vect__2.6_84);
   vectp.10_88 = x_54(D) + 16;
   slp_mask_89 = .WHILE_ULT (0, 10, { 0, ... });
   vect__33.11_90 = .MASK_LOAD (vectp.10_88, 8B, slp_mask_89, { 0, ... });
   vect__34.12_91 = vect__33.11_90 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0,
0, 0, 0, 0, 0, ... };
   vectp.14_92 = x_54(D) + 16;
   slp_mask_93 = .WHILE_ULT (0, 10, { 0, ... });
   .MASK_STORE (vectp.14_92, 8B, slp_mask_93, vect__34.12_91);
   return;

}

If advantageous, the AArch64 backend later substitutes Advanced SIMD
instructions for the group that uses variable-length vector type with a mask
of a known regular length:

mov x1, x0 mov w2, 513 ptrue p6.b, all ldr q29, [x0] ; first load is replaced
with Advanced SIMD mov z28.h, w2 add z28.b, z29.b, z28.b ; first add is done
using SVE (z29.b aliases q29) mov x3, 10 whilelo p7.b, xzr, x3 adrp x2, .LC0
add x2, x2, :lo12:.LC0 ld1rqb z30.b, p6/z, [x2] str q28, [x1], 16 ; first
store is replaced with Advanced SIMD (q28 aliases z28.b) ld1b z31.b, p7/z,
[x1] ; second load is predicated SVE add z30.b, z31.b, z30.b ; second add is
also done using SVE st1b z30.b, p7, [x1] ; second store is predicated SVE ret

With -msve-vector-bits=128 the GIMPLE produced by the vectoriser doesn't
specify any masks at all, but instead splits the group of 26 into 16 + 8 + 2:

void foo (char * x, int n) { char * vectp.20; vector(2) char * vectp_x.19;
vector(2) char vect__50.18; vector(2) char vect__49.17; char * vectp.16;
vector(2) char * vectp_x.15; char * vectp.14; vector(8) char * vectp_x.13;
vector(8) char vect__34.12; vector(8) char vect__33.11; char * vectp.10;
vector(8) char * vectp_x.9; char * vectp.8; vector(16) char * vectp_x.7;
vector(16) char vect__2.6; vector(16) char vect__1.5; char * vectp.4;
vector(16) char * vectp_x.3; <bb 2> [local count: 1073741824]: vectp.4_81 =
x_54(D); vect__1.5_82 = MEM <vector(16) char> [(char *)vectp.4_81];
vect__2.6_84 = vect__1.5_82 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2
}; vectp.4_83 = vectp.4_81 + 10; vectp.8_85 = x_54(D); MEM <vector(16) char>
[(char *)vectp.8_85] = vect__2.6_84; vectp.10_87 = x_54(D) + 16;
vect__33.11_88 = MEM <vector(8) char> [(char *)vectp.10_87]; vect__34.12_90 =
vect__33.11_88 + { 1, 2, 1, 2, 1, 2, 1, 2 }; vectp.10_89 = x_54(D) + 34;
vectp.14_91 = x_54(D) + 16; MEM <vector(8) char> [(char *)vectp.14_91] =
vect__34.12_90; vectp.16_93 = x_54(D) + 24; vect__49.17_94 = MEM <vector(2)
char> [(char *)vectp.16_93]; vect__50.18_96 = vect__49.17_94 + { 1, 2 };
vectp.16_95 = x_54(D) + 48; _49 = MEM[(char *)x_54(D) + 24B]; _50 = _49 + 1;
_51 = MEM[(char *)x_54(D) + 25B]; _52 = _51 + 2; vectp.20_97 = x_54(D) + 24;
MEM <vector(2) char> [(char *)vectp.20_97] = vect__50.18_96; return; }

The AArch64 backend still uses SVE if available though:

     adrp    x1, .LC0
     ldr    d29, [x0, 16] ; load the middle 8 bytes using Advanced SIMD
     ptrue    p7.b, vl16 ; this SVE mask is actually for 2 lanes, when
interpreted as doubles later!
     ldr    q27, [x0] ; load the first 16 bytes using Advanced SIMD
     index    z30.d, #1, #1
     ldr    d28, [x1, #:lo12:.LC0]
     adrp    x1, .LC1
     ldr    q26, [x1, #:lo12:.LC1]
     add    v28.8b, v29.8b, v28.8b ; add the middle 8 bytes using
Advanced SIMD (v29.8b aliases d29)
     add    x1, x0, 24 ; offset to the last two bytes [x0,24] and [x0,25]
     add    v26.16b, v27.16b, v26.16b ; add the first 16 bytes using
Advanced SIMD (v27.16b aliases q27)
     str    d28, [x0, 16] ; store the middle 8 bytes using Advanced SIMD
     str    q26, [x0] ; store the first 16 bytes using Advanced SIMD
     ld1b    z31.d, p7/z, [x1] ; load the last two bytes using SVE
     add    z30.b, z31.b, z30.b
     st1b    z30.d, p7, [x1] ; store the last two bytes using SVE
     ret

So you see there is only a loose relationship between GIMPLE vector types and
instructions chosen by the backend.

you'd have three V2DImode vectors, the last with one masked lane and two
V4SImode vectors, the last with three masked lanes.
The 2nd V2DImode -> V4SImode (2nd because two output vectors)
conversion expects two V2DImode inputs because it uses two-to-one
vector pack instructions.  But the 2nd V2DImode input does not exist.

I'm not familiar with other CPU architectures, but I suspect they are neither
helped nor hindered by my change.

That said, downthread you have comments that only a single vector
element is supported when using masked operation (I don't remember
exactly where).  So you are hoping that the group splitting provides
you with a fully "leaf" situation here?

I think it's an invariant.

Keep in mind that splitting is not always a good option, like with

   a[0] = b[0];
   a[1] = b[2];
   a[2] = b[1];
   a[3] = b[3];

we do not split along V2DImode boundaries but having 2xV2DImode
allows to handle the loads efficiently with shuffling.  Similar
situations may arise when there's vector parts.

That said, if you think the current limitiation to leafs does not
restrict us design-wise then it's an OK initial limitation.

Thanks!

--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/

Re: [RFC 3/9] Implement recording/getting of mask/length for BB SLP

Reply via email to