length for BB SLP

Richard Biener Wed, 05 Nov 2025 04:25:42 -0800

On Tue, 4 Nov 2025, Christopher Bazley wrote:

> On 28/10/2025 13:29, Richard Biener wrote:
> >> +/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE
> >> that
> >> +   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX <
> >> NVECTORS.
> >> +   Masking is only required for the tail, therefore NULL_TREE is returned
> >> for
> >> +   every value of INDEX except the last.  Insert any set-up statements
> >> before
> >> +   GSI.  */
> > I think it might happen that some vectors are fully masked, say for
> > a conversion from double to int and V2DImode vs. V4SImode when we
> > have 5 lanes the conversion likely expects 4 V2DImode inputs to
> > produce 2 V4SImode outputs, but the 4th V2DImode input has no active
> > lanes at all.
> >
> > But maybe you handle this situation differently, I'll see.
> 
> You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI inactive,
> and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI (8SI - 5SI =
> 3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked).
> 
> I don't think that the "1 of 2DI is fully masked" would ever happen though,
> because a group of 5DI would be split long before the vectoriser attempts to
> materialise masks. The only reason that a group of 5DI might be allowed to
> survive that long would be if the number of subparts of the natural vector
> type (the one currently being tried by vect_slp_region) were at least 5, a
> factor of 5, or both. No such vector types exist.
> 
> For example, consider this translation unit:
> 
> #include <stdint.h>
> 
> void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5])
> {
>   (*si)[0] = (*di)[0];
>   (*si)[1] = (*di)[1];
>   (*si)[2] = (*di)[2];
>   (*si)[3] = (*di)[3];
>   (*si)[4] = (*di)[4];
> }
> 
> Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve
> --param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable) as:
> 
> convert:
> .LFB0:
>         .cfi_startproc
>         ldp     q30, q31, [x0] ; vector load the first four lanes
>         ptrue   p7.d, vl2 ; enable two lanes for vector stores
>         add     x2, x1, 8
>         ldr     x0, [x0, 32] ; load the fifth lane
>         st1w    z30.d, p7, [x1] ; store least-significant 32 bits of 
> the first two lanes
>         st1w    z31.d, p7, [x2] ; store least-significant 32 bits of lanes 3
> and 4
>         str     w0, [x1, 16] ; store least-significant 32 bits of fifth lane
>         ret
>         .cfi_endproc
> 
> The slp2 dump shows:
> 
> note:   Starting SLP discovery for
> note:     (*si_13(D))[0] = _2;
> note:     (*si_13(D))[1] = _4;
> note:     (*si_13(D))[2] = _6;
> note:     (*si_13(D))[3] = _8;
> note:     (*si_13(D))[4] = _10;
> note:   Created SLP node 0x4bd9e00
> note:   starting SLP discovery for node 0x4bd9e00
> note:   get vectype for scalar type (group size 5): uint32_t
> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring 
> group size 5): vector([4,4]) unsigned int
> note:   vectype: vector([4,4]) unsigned int
> note:   nunits = [4,4]
> missed:   Build SLP failed: unrolling required in basic block SLP
> 
> This fails the check in vect_record_nunits because the group size of 5 may be
> larger than the number of subparts of vector([4,4]) unsigned int (which could
> be as low as 4) and 5 is never an integral multiple of [4,4].
> 
> The vectoriser therefore splits the group of 5SI into 4SI + 1SI:


I had the impression the intent of this series is to _not_ split the
groups in this case.  On x86 with V2DImode / V4SImode (aka SSE2)
you'd have three V2DImode vectors, the last with one masked lane
and two V4SImode vectors, the last with three masked lanes.
The 2nd V2DImode -> V4SImode (2nd because two output vectors)
conversion expects two V2DImode inputs because it uses two-to-one
vector pack instructions.  But the 2nd V2DImode input does not exist.

That said, downthread you have comments that only a single vector
element is supported when using masked operation (I don't remember
exactly where).  So you are hoping that the group splitting provides
you with a fully "leaf" situation here?

Keep in mind that splitting is not always a good option, like with

 a[0] = b[0];
 a[1] = b[2];
 a[2] = b[1];
 a[3] = b[3];

we do not split along V2DImode boundaries but having 2xV2DImode
allows to handle the loads efficiently with shuffling.  Similar
situations may arise when there's vector parts.

That said, if you think the current limitiation to leafs does not
restrict us design-wise then it's an OK initial limitation.

Keep in mind I'm having fixed-size vector ISAs in mind here and x86
exposes many different size vector modes, basically all power-of-two
sized modes up to 64 bytes.

Richard.

> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group
> size 4): vector([4,4]) unsigned int
> note:   Splitting SLP group at stmt 4
> note:   Split group into 4 and 1
> note:   Starting SLP discovery for
> note:     (*si_13(D))[0] = _2;
> note:     (*si_13(D))[1] = _4;
> note:     (*si_13(D))[2] = _6;
> note:     (*si_13(D))[3] = _8;
> note:   Created SLP node 0x4bd9ec0
> note:   starting SLP discovery for node 0x4bd9ec0
> note:   get vectype for scalar type (group size 4): uint32_t
> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring 
> group size 4): vector([4,4]) unsigned int
> note:   vectype: vector([4,4]) unsigned int
> note:   nunits = [4,4]
> note:   Build SLP for (*si_13(D))[0] = _2;
> note:   Build SLP for (*si_13(D))[1] = _4;
> note:   Build SLP for (*si_13(D))[2] = _6;
> note:   Build SLP for (*si_13(D))[3] = _8;
> note:   vect_is_simple_use: operand (unsigned int) _1, type of def: internal
> note:   vect_is_simple_use: operand (unsigned int) _3, type of def: internal
> note:   vect_is_simple_use: operand (unsigned int) _5, type of def: internal
> note:   vect_is_simple_use: operand (unsigned int) _7, type of def: internal
> 
> ... which goes well untill it looks at the 64-bit inputs:
> 
> note:   Created SLP node 0x4bda040
> note:   starting SLP discovery for node 0x4bda040
> note:   get vectype for scalar type (group size 4): const uint64_t
> note:   get_vectype_for_scalar_type: natural type for const uint64_t 
> (ignoring group size 4): const vector([2,2]) long unsigned int
> note:   vectype: const vector([2,2]) long unsigned int
> note:   nunits = [2,2]
> missed:   Build SLP failed: unrolling required in basic block SLP
> 
> This fails the check in vect_record_nunits because the group size of 4 may be
> larger than the number of subparts of vector([2,2]) unsigned int (which could
> be as low as 2) and 4 is not necessarily an integral multiple of [2,2] (e.g.
> the polynomial vector length could be 2+(2*3) if the vectors are 512 bit).
> 
> The vectoriser doesn't give up though. Instead, it falls back to scalars for
> the external node representing the 64-bit inputs:
> 
> note:   Build SLP for _1 = (*di_12(D))[0];
> note:   Build SLP for _3 = (*di_12(D))[1];
> note:   Build SLP for _5 = (*di_12(D))[2];
> note:   Build SLP for _7 = (*di_12(D))[3];
> note:   SLP discovery for node 0x4bda040 failed
> note:   Building vector operands from scalars
> note:   Created SLP node 0x4bda100
> note:   SLP discovery for node 0x4bd9f80 succeeded
> note:   SLP discovery for node 0x4bd9ec0 succeeded
> note:   SLP size 3 vs. limit 16.
> note:   Final SLP tree for instance 0x4b174b0:
> note:   node 0x4bd9ec0 (nunits.min=4, nunits.max=4, refcnt=2) 
> vector([4,4]) unsigned int
> note:   op template: (*si_13(D))[0] = _2;
> note:       stmt 0 (*si_13(D))[0] = _2;
> note:       stmt 1 (*si_13(D))[1] = _4;
> note:       stmt 2 (*si_13(D))[2] = _6;
> note:       stmt 3 (*si_13(D))[3] = _8;
> note:       children 0x4bd9f80
> note:   node 0x4bd9f80 (nunits.min=4, nunits.max=4, refcnt=2) 
> vector([4,4]) unsigned int
> note:   op template: _2 = (unsigned int) _1;
> note:       stmt 0 _2 = (unsigned int) _1;
> note:       stmt 1 _4 = (unsigned int) _3;
> note:       stmt 2 _6 = (unsigned int) _5;
> note:       stmt 3 _8 = (unsigned int) _7;
> note:       children 0x4bda100
> note:   node (external) 0x4bda100 (nunits.min=18446744073709551615, 
> nunits.max=1, refcnt=1)
> note:       { _1, _3, _5, _7 }
> 
> The convert node wants vector([2,2]) long unsigned int (2 64-bit values),
> which doesn't divide 4 lanes exactly, so the vectoriser falls back to building
> from scalars:
> 
> note:   === vect_slp_analyze_operations ===
> note:   ==> examining statement: _2 = (unsigned int) _1;
> note:   get_vectype_for_scalar_type: natural type for long unsigned int 
> (ignoring group size 4): vector([2,2]) long unsigned int
> note:   inferred vector type vector([2,2]) long unsigned int
> missed:   lanes=4 is not divisible by subparts=2.
> missed:   incompatible vector types for invariants
> note:   get_vectype_for_scalar_type: natural type for long unsigned int
> (ignoring group size 4): vector([2,2]) long unsigned int
> note:   get_vectype_for_scalar_type: natural type for long unsigned int
> (ignoring group size 4): vector([2,2]) long unsigned int
> missed:   not vectorized: relevant stmt not supported: _2 = (unsigned int) _1;
> note:   Building vector operands of 0x4bd9f80 from scalars instead
> note:   ==> examining statement: (*si_13(D))[0] = _2;
> note:   updated vectype of operand 0x4bd9f80 with 4 lanes to 
> vector([4,4]) unsigned int
> note:   vect_model_store_cost: aligned.
> note:   vect_model_store_cost: inside_cost = 1, prologue_cost = 0 .
> note:   vect_prologue_cost_for_slp: node 0x4bd9f80, vector type 
> vector([4,4]) unsigned int, group_size 4
> note:   === vect_bb_partition_graph ===
> note: ***** Analysis succeeded with vector mode VNx2DI
> note: SLPing BB part
> 
> However, the vectorisation with mode VNx2DI is not deemed profitable:
> 
> note: Costing subgraph:
> note: node 0x4bd9ec0 (nunits.min=4, nunits.max=4, refcnt=1) 
> vector([4,4]) unsigned int
> note:op template: (*si_13(D))[0] = _2;
> note:     stmt 0 (*si_13(D))[0] = _2;
> note:     stmt 1 (*si_13(D))[1] = _4;
> note:     stmt 2 (*si_13(D))[2] = _6;
> note:     stmt 3 (*si_13(D))[3] = _8;
> note:     children 0x4bd9f80
> note:node (external) 0x4bd9f80 (nunits.min=4, nunits.max=4, refcnt=1) 
> vector([4,4]) unsigned int
> note:     stmt 0 _2 = (unsigned int) _1;
> note:     stmt 1 _4 = (unsigned int) _3;
> note:     stmt 2 _6 = (unsigned int) _5;
> note:     stmt 3 _8 = (unsigned int) _7;
> note:     children 0x4bda100
> note:node (external) 0x4bda100 (nunits.min=18446744073709551615, 
> nunits.max=1, refcnt=1)
> note:     { _1, _3, _5, _7 }
> note:Cost model analysis:
> _2 1 times scalar_store costs 1 in body
> _4 1 times scalar_store costs 1 in body
> _6 1 times scalar_store costs 1 in body
> _8 1 times scalar_store costs 1 in body
> _2 1 times vector_store costs 1 in body
> node 0x4bd9f80 1 times vec_construct costs 3 in prologue
> note: Cost model analysis for part in loop 0:
>   Vector cost: 11
>   Scalar cost: 4
> missed: not vectorized: vectorization is not profitable.
> note: ***** The result for vector mode VNx16QI would be the same
> note: ***** The result for vector mode VNx8QI would be the same
> note: ***** The result for vector mode VNx4QI would be the same
> 
> The vectoriser then successfully analyses the same block with VNx2QI:
> 
> note:   === vect_analyze_slp ===
> note:   Starting SLP discovery for
> note:     (*si_13(D))[0] = _2;
> note:     (*si_13(D))[1] = _4;
> note:     (*si_13(D))[2] = _6;
> note:     (*si_13(D))[3] = _8;
> note:     (*si_13(D))[4] = _10;
> note:   Created SLP node 0x4bd9ec0
> note:   starting SLP discovery for node 0x4bd9ec0
> note:   get vectype for scalar type (group size 5): uint32_t
> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring 
> group size 5): vector([2,2]) unsigned int
> note:   vectype: vector([2,2]) unsigned int
> note:   nunits = [2,2]
> missed:   Build SLP failed: unrolling required in basic block SLP
> note:   Build SLP for (*si_13(D))[0] = _2;
> note:   Build SLP for (*si_13(D))[1] = _4;
> note:   Build SLP for (*si_13(D))[2] = _6;
> note:   Build SLP for (*si_13(D))[3] = _8;
> note:   Build SLP for (*si_13(D))[4] = _10;
> note:   SLP discovery for node 0x4bd9ec0 failed
> 
> It splits the group of 5 into 4 + 1:
> 
> 
> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group
> size 4): vector([2,2]) unsigned int
> note:   Splitting SLP group at stmt 4
> note:   Split group into 4 and 1
> note:   Starting SLP discovery for
> note:     (*si_13(D))[0] = _2;
> note:     (*si_13(D))[1] = _4;
> note:     (*si_13(D))[2] = _6;
> note:     (*si_13(D))[3] = _8;
> note:   Created SLP node 0x4bd9f80
> note:   starting SLP discovery for node 0x4bd9f80
> note:   get vectype for scalar type (group size 4): uint32_t
> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring 
> group size 4): vector([2,2]) unsigned int
> note:   vectype: vector([2,2]) unsigned int
> note:   nunits = [2,2]
> missed:   Build SLP failed: unrolling required in basic block SLP
> note:   Build SLP for (*si_13(D))[0] = _2;
> note:   Build SLP for (*si_13(D))[1] = _4;
> note:   Build SLP for (*si_13(D))[2] = _6;
> note:   Build SLP for (*si_13(D))[3] = _8;
> note:   SLP discovery for node 0x4bd9f80 failed
> 
> It then splits the group of 4 into 2 + 2:
> 
> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group
> size 2): vector([2,2]) unsigned int
> note:   Splitting SLP group at stmt 2
> note:   Split group into 2 and 2
> note:   Starting SLP discovery for
> note:     (*si_13(D))[0] = _2;
> note:     (*si_13(D))[1] = _4;
> note:   Created SLP node 0x4bda100
> note:   starting SLP discovery for node 0x4bda100
> note:   get vectype for scalar type (group size 2): uint32_t
> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring 
> group size 2): vector([2,2]) unsigned int
> note:   vectype: vector([2,2]) unsigned int
> note:   nunits = [2,2]
> note:   Build SLP for (*si_13(D))[0] = _2;
> note:   Build SLP for (*si_13(D))[1] = _4;
> note:   vect_is_simple_use: operand (unsigned int) _1, type of def: internal
> note:   vect_is_simple_use: operand (unsigned int) _3, type of def: internal
> note:   Created SLP node 0x4bd9e00
> note:   starting SLP discovery for node 0x4bd9e00
> note:   get vectype for scalar type (group size 2): unsigned int
> note:   get_vectype_for_scalar_type: natural type for unsigned int 
> (ignoring group size 2): vector([2,2]) unsigned int
> note:   vectype: vector([2,2]) unsigned int
> note:   nunits = [2,2]
> note:   Build SLP for _2 = (unsigned int) _1;
> note:   Build SLP for _4 = (unsigned int) _3;
> note:   vect_is_simple_use: operand (*di_12(D))[0], type of def: internal
> note:   vect_is_simple_use: operand (*di_12(D))[1], type of def: internal
> note:   Created SLP node 0x4bda040
> note:   starting SLP discovery for node 0x4bda040
> note:   get vectype for scalar type (group size 2): const uint64_t
> note:   get_vectype_for_scalar_type: natural type for const uint64_t 
> (ignoring group size 2): const vector([2,2]) long unsigned int
> note:   vectype: const vector([2,2]) long unsigned int
> note:   nunits = [2,2]
> note:   Build SLP for _1 = (*di_12(D))[0];
> note:   Build SLP for _3 = (*di_12(D))[1];
> note:   SLP discovery for node 0x4bda040 succeeded
> note:   SLP discovery for node 0x4bd9e00 succeeded
> note:   SLP discovery for node 0x4bda100 succeeded
> note:   SLP size 3 vs. limit 16.
> note:   Final SLP tree for instance 0x4b174b0:
> note:   node 0x4bda100 (nunits.min=2, nunits.max=2, refcnt=2) 
> vector([2,2]) unsigned int
> note:   op template: (*si_13(D))[0] = _2;
> note:       stmt 0 (*si_13(D))[0] = _2;
> note:       stmt 1 (*si_13(D))[1] = _4;
> note:       children 0x4bd9e00
> note:   node 0x4bd9e00 (nunits.min=2, nunits.max=2, refcnt=2) 
> vector([2,2]) unsigned int
> note:   op template: _2 = (unsigned int) _1;
> note:       stmt 0 _2 = (unsigned int) _1;
> note:       stmt 1 _4 = (unsigned int) _3;
> note:       children 0x4bda040
> note:   node 0x4bda040 (nunits.min=2, nunits.max=2, refcnt=2) const 
> vector([2,2]) long unsigned int
> note:   op template: _1 = (*di_12(D))[0];
> note:       stmt 0 _1 = (*di_12(D))[0];
> note:       stmt 1 _3 = (*di_12(D))[1];
> note:       load permutation { 0 1 }
> 
> Unlike the previous attempt, this one is deemed profitable.
> 
> The resultant GIMPLE is:
> 
> void convert (const uint64_t[5] * const di, uint32_t[5] * const si)
> {
>   uint32_t * vectp.14;
>   vector([2,2]) unsigned int * vectp_si.13;
>   vector([2,2]) unsigned int vect__6.12;
>   const vector([2,2]) long unsigned int vect__5.11;
>   const uint64_t * vectp.10;
>   const vector([2,2]) long unsigned int * vectp_di.9;
>   uint32_t * vectp.8;
>   vector([2,2]) unsigned int * vectp_si.7;
>   vector([2,2]) unsigned int vect__2.6;
>   const vector([2,2]) long unsigned int vect__1.5;
>   const uint64_t * vectp.4;
>   const vector([2,2]) long unsigned int * vectp_di.3;
>   long unsigned int _1;
>   unsigned int _2;
>   long unsigned int _3;
>   unsigned int _4;
>   long unsigned int _5;
>   unsigned int _6;
>   long unsigned int _7;
>   unsigned int _8;
>   long unsigned int _9;
>   unsigned int _10;
>   vector([2,2]) <signed-boolean:8> slp_mask_20;
>   vector([2,2]) <signed-boolean:8> slp_mask_24;
>   vector([2,2]) <signed-boolean:8> slp_mask_27;
>   vector([2,2]) <signed-boolean:8> slp_mask_31;
> 
>   <bb 2> [local count: 1073741824]:
>   vectp.4_19 = &(*di_12(D))[0];
>   slp_mask_20 = .WHILE_ULT (0, 2, { 0, ... });
>   vect__1.5_21 = .MASK_LOAD (vectp.4_19, 64B, slp_mask_20, { 0, ... });
>   vect__2.6_22 = (vector([2,2]) unsigned int) vect__1.5_21;
>   _1 = (*di_12(D))[0];
>   _2 = (unsigned int) _1;
>   _3 = (*di_12(D))[1];
>   _4 = (unsigned int) _3;
>   vectp.8_23 = &(*si_13(D))[0];
>   slp_mask_24 = .WHILE_ULT (0, 2, { 0, ... });
>   .MASK_STORE (vectp.8_23, 32B, slp_mask_24, vect__2.6_22);
>   vectp.10_26 = &(*di_12(D))[2];
>   slp_mask_27 = .WHILE_ULT (0, 2, { 0, ... });
>   vect__5.11_28 = .MASK_LOAD (vectp.10_26, 64B, slp_mask_27, { 0, ... });
>   vect__6.12_29 = (vector([2,2]) unsigned int) vect__5.11_28;
>   _5 = (*di_12(D))[2];
>   _6 = (unsigned int) _5;
>   _7 = (*di_12(D))[3];
>   _8 = (unsigned int) _7;
>   vectp.14_30 = &(*si_13(D))[2];
>   slp_mask_31 = .WHILE_ULT (0, 2, { 0, ... });
>   .MASK_STORE (vectp.14_30, 32B, slp_mask_31, vect__6.12_29);
>   _9 = (*di_12(D))[4];
>   _10 = (unsigned int) _9;
>   (*si_13(D))[4] = _10;
>   return;
> }
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [RFC 3/9] Implement recording/getting of mask/length for BB SLP

Reply via email to