On 07/11/2025 13:53, Richard Biener wrote:
On Thu, 6 Nov 2025, Christopher Bazley wrote:On 05/11/2025 12:25, Richard Biener wrote:On Tue, 4 Nov 2025, Christopher Bazley wrote:On 28/10/2025 13:29, Richard Biener wrote:+/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE that + operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS. + Masking is only required for the tail, therefore NULL_TREE is returned for + every value of INDEX except the last. Insert any set-up statements before + GSI. */I think it might happen that some vectors are fully masked, say for a conversion from double to int and V2DImode vs. V4SImode when we have 5 lanes the conversion likely expects 4 V2DImode inputs to produce 2 V4SImode outputs, but the 4th V2DImode input has no active lanes at all. But maybe you handle this situation differently, I'll see.You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI inactive, and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI (8SI - 5SI = 3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked). I don't think that the "1 of 2DI is fully masked" would ever happen though, because a group of 5DI would be split long before the vectoriser attempts to materialise masks. The only reason that a group of 5DI might be allowed to survive that long would be if the number of subparts of the natural vector type (the one currently being tried by vect_slp_region) were at least 5, a factor of 5, or both. No such vector types exist. For example, consider this translation unit: #include <stdint.h> void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5]) { (*si)[0] = (*di)[0]; (*si)[1] = (*di)[1]; (*si)[2] = (*di)[2]; (*si)[3] = (*di)[3]; (*si)[4] = (*di)[4]; } Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve --param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable) as: convert: .LFB0: .cfi_startproc ldp q30, q31, [x0] ; vector load the first four lanes ptrue p7.d, vl2 ; enable two lanes for vector stores add x2, x1, 8 ldr x0, [x0, 32] ; load the fifth lane st1w z30.d, p7, [x1] ; store least-significant 32 bits of the first two lanes st1w z31.d, p7, [x2] ; store least-significant 32 bits of lanes 3 and 4 str w0, [x1, 16] ; store least-significant 32 bits of fifth lane ret .cfi_endproc The slp2 dump shows: note: Starting SLP discovery for note: (*si_13(D))[0] = _2; note: (*si_13(D))[1] = _4; note: (*si_13(D))[2] = _6; note: (*si_13(D))[3] = _8; note: (*si_13(D))[4] = _10; note: Created SLP node 0x4bd9e00 note: starting SLP discovery for node 0x4bd9e00 note: get vectype for scalar type (group size 5): uint32_t note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 5): vector([4,4]) unsigned int note: vectype: vector([4,4]) unsigned int note: nunits = [4,4] missed: Build SLP failed: unrolling required in basic block SLP This fails the check in vect_record_nunits because the group size of 5 may be larger than the number of subparts of vector([4,4]) unsigned int (which could be as low as 4) and 5 is never an integral multiple of [4,4]. The vectoriser therefore splits the group of 5SI into 4SI + 1SI:I had the impression the intent of this series is to _not_ split the groups in this case. On x86 with V2DImode / V4SImode (aka SSE2)Not exactly. Richard Sandiford did tell me (months ago) that this task is about trying to avoid splitting, but I think that is not the whole story. Richard's initial example of a function that is not currently vectorised, but could be with tail-predication, was: void foo (char *x, int n) { x[0] += 1; x[1] += 2; x[2] += 1; x[3] += 2; x[4] += 1; x[5] += 2; } A group of 6QI such as that shown in the function above would not need to be split because each lane is only one byte wide, not a double word (unlike in your example of a conversion from 5DF to 5SI). A group of 6QI can always be stored in one vector of type VNx16QI, because VNx16QI's minimum number of lanes is 16. ptrue p7.b, vl6 ptrue p6.b, all ld1b z31.b, p7/z, [x0] ; one predicated load adrp x1, .LC0 add x1, x1, :lo12:.LC0 ld1rqb z30.b, p6/z, [x1] add z30.b, z31.b, z30.b st1b z30.b, p7, [x0] ; one predicated store ret If some target architecture provides both VNx8DF and VNx8SI then your example conversion wouldn't result in a split either because the group size of 5 would certainly be smaller than the number of subparts of vector([8,8]) double and the fact that 5 is not an integral multiple of [8,8] would be irrelevant. (SVE doesn't provide either type in implementations that I'm aware of.) However, I believe it could also be beneficial to be able to vectorise functions with more than a small number of operations in them (e.g., 26 instead of 6 operations): void foo (char *x, int n) { x[0] += 1; x[1] += 2; x[2] += 1; x[3] += 2; x[4] += 1; x[5] += 2; x[6] += 1; x[7] += 2; x[8] += 1; x[9] += 2; x[10] += 1; x[11] += 2; x[12] += 1; x[13] += 2; x[14] += 1; x[15] += 2; x[16] += 1; x[17] += 2; x[18] += 1; x[19] += 2; x[20] += 1; x[21] += 2; x[22] += 1; x[23] += 2; x[24] += 1; x[25] += 2; } Admittedly, such cases are probably rarer than small groups in real code. In such cases, even a group of byte-size operations might need to be split in order to be vectorised. e.g., a group of 26QI additions could be vectorised with VNx16QI as 16QI + 10QI. A mask would be generated for both groups:Note you say "split" and mean you have two vector operations in the end. But I refer to with "split" to the split into two different SLP graphs, usually, even with BB vectorization, a single SLP node can happily represent multiple vectors (with the same vector type) when necessary to fill all lanes.
Thanks for clarifying that.My original concept of splitting was probably based on something Richard Sandiford said about the desirability of using Advanced SIMD instead of SVE to vectorise the part of a large group that does not require tail-predication. At that time, I was not aware that the target backend could automatically generate Advanced SIMD instructions for WHILE_ULT operations in which the mask has all bits set. I therefore assumed that it would be necessary to split such groups. Splitting is also a natural consequence of the existing control flow in the vect_analyze_slp_instance function.
But to agree to that we still might want to do some splitting, at least on x86 where we have multiple vector sizes (and thus types for the same element type), your first example with 6 lanes could be split into a V4QImode subgraph and a V2QImode subgraph. I don't think x86 has V2QImode, but just make that V4DImode and V2DImode. A variant without the need for splitting would be using V2DImode (with three vectors) or a variant using V4DImode and masking for the second vector.
Is your concern that adding !known_ge (nunits.min, group_size) to the conjunction in the vect_analyze_slp_instance function prevents splitting of BB SLP groups known to be smaller than the minimum number of lanes of any of the chosen vector types? Can such groups really be usefully split?
Let's suppose the group size is 6 and the natural vector type (for the current iteration of the outer loop) is V8DI.
Previously, this example would have failed the following test (condition true):
if (!max_nunits.is_constant (&const_max_nunits) || const_max_nunits > group_size)
which would have resulted in "Build SLP failed: store group size not a multiple of the vector size in basic block SLP" and vect_analyze_slp_instance returning false, instead of the group being split.
Any split would only occur when the next iteration of the outer loop selects V4DI, for which !known_ge (nunits.min, group_size) is true with my changes to the function (because 4 < 8). Consequently, the BB SLP block would still be entered, and the const_max_nunits > group_size test would be repeated. This time would pass (condition false) because 4 <= 8, therefore "SLP discovery succeeded but node needs splitting" and the group could be split into V4DImode and V2DImode as you described.
Your AdvSIMD substitution for the larger case could be done by splitting the graph and choosing AdvSIMD for the half that does not need predication but SVE for the other half.
That's what the current implementation does.
That said, as long as the vector type is the same for each part covering distinct lanes there is no need for splitting. What I'd like to understand is whether the implementation at hand from you for the masking assumes that if masking is required (we padded lanes) whether that requires there to be exactly one hardware vector for each SLP node. Below you say that's an "invariant", so that's a yes?
The vect_analyze_slp_instance function only creates a new SLP instance for BB vectorisation with an unrolling factor not equal to one if the minimum number of lanes for all of the vector types is sufficient to store the whole group. That implies that there is exactly one hardware vector.
The vect_get_num_copies function also relies on that assumption: vf *= SLP_TREE_LANES (node); tree vectype = SLP_TREE_VECTYPE (node); if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf)) return 1; Otherwise, exact_div would fail in a callee, vect_get_num_vectors.My current implementation of the vect_slp_get_bb_mask function returns NULL_TREE (i.e. 'no mask') for all vectors of an SLP node with multiple vectors:
/* vect_get_num_copies only allows a partial vector if it is the only vector. */ if (nvectors > 1) return NULL_TREE;That means the tail of such a node would not be masked correctly if it needs to be masked at all. Even if that guard were removed, the following statement would also need to be made more complex to handle cases in which the group size is not also the number of active lanes:
tree end_index = build_int_cst (cmp_type, group_size);
I'm not sure that will work out for all cases in the end. I'm fine with requiring this initially but please keep in mind that we'd want to lift this restriction without re-doing most of what you do in a different way. Richard.void foo (char * x, int n) { char * vectp.14; vector([16,16]) char * vectp_x.13; vector([16,16]) char vect__34.12; vector([16,16]) char vect__33.11; char * vectp.10; vector([16,16]) char * vectp_x.9; char * vectp.8; vector([16,16]) char * vectp_x.7; vector([16,16]) char vect__2.6; vector([16,16]) char vect__1.5; char * vectp.4; vector([16,16]) char * vectp_x.3; vector([16,16]) <signed-boolean:1> slp_mask_82; vector([16,16]) <signed-boolean:1> slp_mask_86; vector([16,16]) <signed-boolean:1> slp_mask_89; vector([16,16]) <signed-boolean:1> slp_mask_93; <bb 2> [local count: 1073741824]: vectp.4_81 = x_54(D); slp_mask_82 = .WHILE_ULT (0, 16, { 0, ... }); vect__1.5_83 = .MASK_LOAD (vectp.4_81, 8B, slp_mask_82, { 0, ... }); vect__2.6_84 = vect__1.5_83 + { 1, 2, ... }; vectp.8_85 = x_54(D); slp_mask_86 = .WHILE_ULT (0, 16, { 0, ... }); .MASK_STORE (vectp.8_85, 8B, slp_mask_86, vect__2.6_84); vectp.10_88 = x_54(D) + 16; slp_mask_89 = .WHILE_ULT (0, 10, { 0, ... }); vect__33.11_90 = .MASK_LOAD (vectp.10_88, 8B, slp_mask_89, { 0, ... }); vect__34.12_91 = vect__33.11_90 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0, 0, 0, 0, 0, 0, ... }; vectp.14_92 = x_54(D) + 16; slp_mask_93 = .WHILE_ULT (0, 10, { 0, ... }); .MASK_STORE (vectp.14_92, 8B, slp_mask_93, vect__34.12_91); return; } If advantageous, the AArch64 backend later substitutes Advanced SIMD instructions for the group that uses variable-length vector type with a mask of a known regular length: mov x1, x0 mov w2, 513 ptrue p6.b, all ldr q29, [x0] ; first load is replaced with Advanced SIMD mov z28.h, w2 add z28.b, z29.b, z28.b ; first add is done using SVE (z29.b aliases q29) mov x3, 10 whilelo p7.b, xzr, x3 adrp x2, .LC0 add x2, x2, :lo12:.LC0 ld1rqb z30.b, p6/z, [x2] str q28, [x1], 16 ; first store is replaced with Advanced SIMD (q28 aliases z28.b) ld1b z31.b, p7/z, [x1] ; second load is predicated SVE add z30.b, z31.b, z30.b ; second add is also done using SVE st1b z30.b, p7, [x1] ; second store is predicated SVE ret With -msve-vector-bits=128 the GIMPLE produced by the vectoriser doesn't specify any masks at all, but instead splits the group of 26 into 16 + 8 + 2: void foo (char * x, int n) { char * vectp.20; vector(2) char * vectp_x.19; vector(2) char vect__50.18; vector(2) char vect__49.17; char * vectp.16; vector(2) char * vectp_x.15; char * vectp.14; vector(8) char * vectp_x.13; vector(8) char vect__34.12; vector(8) char vect__33.11; char * vectp.10; vector(8) char * vectp_x.9; char * vectp.8; vector(16) char * vectp_x.7; vector(16) char vect__2.6; vector(16) char vect__1.5; char * vectp.4; vector(16) char * vectp_x.3; <bb 2> [local count: 1073741824]: vectp.4_81 = x_54(D); vect__1.5_82 = MEM <vector(16) char> [(char *)vectp.4_81]; vect__2.6_84 = vect__1.5_82 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; vectp.4_83 = vectp.4_81 + 10; vectp.8_85 = x_54(D); MEM <vector(16) char> [(char *)vectp.8_85] = vect__2.6_84; vectp.10_87 = x_54(D) + 16; vect__33.11_88 = MEM <vector(8) char> [(char *)vectp.10_87]; vect__34.12_90 = vect__33.11_88 + { 1, 2, 1, 2, 1, 2, 1, 2 }; vectp.10_89 = x_54(D) + 34; vectp.14_91 = x_54(D) + 16; MEM <vector(8) char> [(char *)vectp.14_91] = vect__34.12_90; vectp.16_93 = x_54(D) + 24; vect__49.17_94 = MEM <vector(2) char> [(char *)vectp.16_93]; vect__50.18_96 = vect__49.17_94 + { 1, 2 }; vectp.16_95 = x_54(D) + 48; _49 = MEM[(char *)x_54(D) + 24B]; _50 = _49 + 1; _51 = MEM[(char *)x_54(D) + 25B]; _52 = _51 + 2; vectp.20_97 = x_54(D) + 24; MEM <vector(2) char> [(char *)vectp.20_97] = vect__50.18_96; return; } The AArch64 backend still uses SVE if available though: adrp x1, .LC0 ldr d29, [x0, 16] ; load the middle 8 bytes using Advanced SIMD ptrue p7.b, vl16 ; this SVE mask is actually for 2 lanes, when interpreted as doubles later! ldr q27, [x0] ; load the first 16 bytes using Advanced SIMD index z30.d, #1, #1 ldr d28, [x1, #:lo12:.LC0] adrp x1, .LC1 ldr q26, [x1, #:lo12:.LC1] add v28.8b, v29.8b, v28.8b ; add the middle 8 bytes using Advanced SIMD (v29.8b aliases d29) add x1, x0, 24 ; offset to the last two bytes [x0,24] and [x0,25] add v26.16b, v27.16b, v26.16b ; add the first 16 bytes using Advanced SIMD (v27.16b aliases q27) str d28, [x0, 16] ; store the middle 8 bytes using Advanced SIMD str q26, [x0] ; store the first 16 bytes using Advanced SIMD ld1b z31.d, p7/z, [x1] ; load the last two bytes using SVE add z30.b, z31.b, z30.b st1b z30.d, p7, [x1] ; store the last two bytes using SVE ret So you see there is only a loose relationship between GIMPLE vector types and instructions chosen by the backend.you'd have three V2DImode vectors, the last with one masked lane and two V4SImode vectors, the last with three masked lanes. The 2nd V2DImode -> V4SImode (2nd because two output vectors) conversion expects two V2DImode inputs because it uses two-to-one vector pack instructions. But the 2nd V2DImode input does not exist.I'm not familiar with other CPU architectures, but I suspect they are neither helped nor hindered by my change.That said, downthread you have comments that only a single vector element is supported when using masked operation (I don't remember exactly where). So you are hoping that the group splitting provides you with a fully "leaf" situation here?I think it's an invariant.Keep in mind that splitting is not always a good option, like with a[0] = b[0]; a[1] = b[2]; a[2] = b[1]; a[3] = b[3]; we do not split along V2DImode boundaries but having 2xV2DImode allows to handle the loads efficiently with shuffling. Similar situations may arise when there's vector parts. That said, if you think the current limitiation to leafs does not restrict us design-wise then it's an OK initial limitation.Thanks!
-- Christopher Bazley Staff Software Engineer, GNU Tools Team. Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK. http://www.arm.com/
