On 05/11/2025 12:25, Richard Biener wrote:
On Tue, 4 Nov 2025, Christopher Bazley wrote:

On 28/10/2025 13:29, Richard Biener wrote:
+/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE
that
+   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX <
NVECTORS.
+   Masking is only required for the tail, therefore NULL_TREE is returned
for
+   every value of INDEX except the last.  Insert any set-up statements
before
+   GSI.  */
I think it might happen that some vectors are fully masked, say for
a conversion from double to int and V2DImode vs. V4SImode when we
have 5 lanes the conversion likely expects 4 V2DImode inputs to
produce 2 V4SImode outputs, but the 4th V2DImode input has no active
lanes at all.

But maybe you handle this situation differently, I'll see.
You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI inactive,
and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI (8SI - 5SI =
3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked).

I don't think that the "1 of 2DI is fully masked" would ever happen though,
because a group of 5DI would be split long before the vectoriser attempts to
materialise masks. The only reason that a group of 5DI might be allowed to
survive that long would be if the number of subparts of the natural vector
type (the one currently being tried by vect_slp_region) were at least 5, a
factor of 5, or both. No such vector types exist.

For example, consider this translation unit:

#include <stdint.h>

void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5])
{
   (*si)[0] = (*di)[0];
   (*si)[1] = (*di)[1];
   (*si)[2] = (*di)[2];
   (*si)[3] = (*di)[3];
   (*si)[4] = (*di)[4];
}

Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable) as:

convert:
.LFB0:
         .cfi_startproc
         ldp     q30, q31, [x0] ; vector load the first four lanes
         ptrue   p7.d, vl2 ; enable two lanes for vector stores
         add     x2, x1, 8
         ldr     x0, [x0, 32] ; load the fifth lane
         st1w    z30.d, p7, [x1] ; store least-significant 32 bits of
the first two lanes
         st1w    z31.d, p7, [x2] ; store least-significant 32 bits of lanes 3
and 4
         str     w0, [x1, 16] ; store least-significant 32 bits of fifth lane
         ret
         .cfi_endproc

The slp2 dump shows:

note:   Starting SLP discovery for
note:     (*si_13(D))[0] = _2;
note:     (*si_13(D))[1] = _4;
note:     (*si_13(D))[2] = _6;
note:     (*si_13(D))[3] = _8;
note:     (*si_13(D))[4] = _10;
note:   Created SLP node 0x4bd9e00
note:   starting SLP discovery for node 0x4bd9e00
note:   get vectype for scalar type (group size 5): uint32_t
note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 5): vector([4,4]) unsigned int
note:   vectype: vector([4,4]) unsigned int
note:   nunits = [4,4]
missed:   Build SLP failed: unrolling required in basic block SLP

This fails the check in vect_record_nunits because the group size of 5 may be
larger than the number of subparts of vector([4,4]) unsigned int (which could
be as low as 4) and 5 is never an integral multiple of [4,4].

The vectoriser therefore splits the group of 5SI into 4SI + 1SI:
I had the impression the intent of this series is to _not_ split the
groups in this case.  On x86 with V2DImode / V4SImode (aka SSE2)
Not exactly. Richard Sandiford did tell me (months ago) that this task is about trying to avoid splitting, but I think that is not the whole story. Richard's initial example of a function that is not currently vectorised, but could be with tail-predication, was:

void
foo (char *x, int n)
{
  x[0] += 1;
  x[1] += 2;
  x[2] += 1;
  x[3] += 2;
  x[4] += 1;
  x[5] += 2;
}

A group of 6QI such as that shown in the function above would not need to be split because each lane is only one byte wide, not a double word (unlike in your example of a conversion from 5DF to 5SI). A group of 6QI can always be stored in one vector of type VNx16QI, because VNx16QI's minimum number of lanes is 16.

    ptrue    p7.b, vl6
    ptrue    p6.b, all
    ld1b    z31.b, p7/z, [x0] ; one predicated load
    adrp    x1, .LC0
    add    x1, x1, :lo12:.LC0
    ld1rqb    z30.b, p6/z, [x1]
    add    z30.b, z31.b, z30.b
    st1b    z30.b, p7, [x0] ; one predicated store
    ret

If some target architecture provides both VNx8DF and VNx8SI then your example conversion wouldn't result in a split either because the group size of 5 would certainly be smaller than the number of subparts of vector([8,8]) double and the fact that 5 is not an integral multiple of [8,8] would be irrelevant. (SVE doesn't provide either type in implementations that I'm aware of.)

However, I believe it could also be beneficial to be able to vectorise functions with more than a small number of operations in them (e.g., 26 instead of 6 operations):

void
foo (char *x, int n)
{
  x[0] += 1;
  x[1] += 2;
  x[2] += 1;
  x[3] += 2;
  x[4] += 1;
  x[5] += 2;
  x[6] += 1;
  x[7] += 2;
  x[8] += 1;
  x[9] += 2;
  x[10] += 1;
  x[11] += 2;
  x[12] += 1;
  x[13] += 2;
  x[14] += 1;
  x[15] += 2;
  x[16] += 1;
  x[17] += 2;
  x[18] += 1;
  x[19] += 2;
  x[20] += 1;
  x[21] += 2;
  x[22] += 1;
  x[23] += 2;
  x[24] += 1;
  x[25] += 2;
}

Admittedly, such cases are probably rarer than small groups in real code.

In such cases, even a group of byte-size operations might need to be split in order to be vectorised. e.g., a group of 26QI additions could be vectorised with VNx16QI as 16QI + 10QI. A mask would be generated for both groups:

void foo (char * x, int n)
{
  char * vectp.14;
  vector([16,16]) char * vectp_x.13;
  vector([16,16]) char vect__34.12;
  vector([16,16]) char vect__33.11;
  char * vectp.10;
  vector([16,16]) char * vectp_x.9;
  char * vectp.8;
  vector([16,16]) char * vectp_x.7;
  vector([16,16]) char vect__2.6;
  vector([16,16]) char vect__1.5;
  char * vectp.4;
  vector([16,16]) char * vectp_x.3;
  vector([16,16]) <signed-boolean:1> slp_mask_82;
  vector([16,16]) <signed-boolean:1> slp_mask_86;
  vector([16,16]) <signed-boolean:1> slp_mask_89;
  vector([16,16]) <signed-boolean:1> slp_mask_93;

  <bb 2> [local count: 1073741824]:
  vectp.4_81 = x_54(D);
  slp_mask_82 = .WHILE_ULT (0, 16, { 0, ... });
  vect__1.5_83 = .MASK_LOAD (vectp.4_81, 8B, slp_mask_82, { 0, ... });
  vect__2.6_84 = vect__1.5_83 + { 1, 2, ... };
  vectp.8_85 = x_54(D);
  slp_mask_86 = .WHILE_ULT (0, 16, { 0, ... });
  .MASK_STORE (vectp.8_85, 8B, slp_mask_86, vect__2.6_84);
  vectp.10_88 = x_54(D) + 16;
  slp_mask_89 = .WHILE_ULT (0, 10, { 0, ... });
  vect__33.11_90 = .MASK_LOAD (vectp.10_88, 8B, slp_mask_89, { 0, ... });
  vect__34.12_91 = vect__33.11_90 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0, 0, 0, 0, 0, 0, ... };
  vectp.14_92 = x_54(D) + 16;
  slp_mask_93 = .WHILE_ULT (0, 10, { 0, ... });
  .MASK_STORE (vectp.14_92, 8B, slp_mask_93, vect__34.12_91);
  return;

}

If advantageous, the AArch64 backend later substitutes Advanced SIMD instructions for the group that uses variable-length vector type with a mask of a known regular length:

mov x1, x0 mov w2, 513 ptrue p6.b, all ldr q29, [x0] ; first load is replaced with Advanced SIMD mov z28.h, w2 add z28.b, z29.b, z28.b ; first add is done using SVE (z29.b aliases q29) mov x3, 10 whilelo p7.b, xzr, x3 adrp x2, .LC0 add x2, x2, :lo12:.LC0 ld1rqb z30.b, p6/z, [x2] str q28, [x1], 16 ; first store is replaced with Advanced SIMD (q28 aliases z28.b) ld1b z31.b, p7/z, [x1] ; second load is predicated SVE add z30.b, z31.b, z30.b ; second add is also done using SVE st1b z30.b, p7, [x1] ; second store is predicated SVE ret

With -msve-vector-bits=128 the GIMPLE produced by the vectoriser doesn't specify any masks at all, but instead splits the group of 26 into 16 + 8 + 2:

void foo (char * x, int n) { char * vectp.20; vector(2) char * vectp_x.19; vector(2) char vect__50.18; vector(2) char vect__49.17; char * vectp.16; vector(2) char * vectp_x.15; char * vectp.14; vector(8) char * vectp_x.13; vector(8) char vect__34.12; vector(8) char vect__33.11; char * vectp.10; vector(8) char * vectp_x.9; char * vectp.8; vector(16) char * vectp_x.7; vector(16) char vect__2.6; vector(16) char vect__1.5; char * vectp.4; vector(16) char * vectp_x.3; <bb 2> [local count: 1073741824]: vectp.4_81 = x_54(D); vect__1.5_82 = MEM <vector(16) char> [(char *)vectp.4_81]; vect__2.6_84 = vect__1.5_82 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; vectp.4_83 = vectp.4_81 + 10; vectp.8_85 = x_54(D); MEM <vector(16) char> [(char *)vectp.8_85] = vect__2.6_84; vectp.10_87 = x_54(D) + 16; vect__33.11_88 = MEM <vector(8) char> [(char *)vectp.10_87]; vect__34.12_90 = vect__33.11_88 + { 1, 2, 1, 2, 1, 2, 1, 2 }; vectp.10_89 = x_54(D) + 34; vectp.14_91 = x_54(D) + 16; MEM <vector(8) char> [(char *)vectp.14_91] = vect__34.12_90; vectp.16_93 = x_54(D) + 24; vect__49.17_94 = MEM <vector(2) char> [(char *)vectp.16_93]; vect__50.18_96 = vect__49.17_94 + { 1, 2 }; vectp.16_95 = x_54(D) + 48; _49 = MEM[(char *)x_54(D) + 24B]; _50 = _49 + 1; _51 = MEM[(char *)x_54(D) + 25B]; _52 = _51 + 2; vectp.20_97 = x_54(D) + 24; MEM <vector(2) char> [(char *)vectp.20_97] = vect__50.18_96; return; }

The AArch64 backend still uses SVE if available though:

    adrp    x1, .LC0
    ldr    d29, [x0, 16] ; load the middle 8 bytes using Advanced SIMD
    ptrue    p7.b, vl16 ; this SVE mask is actually for 2 lanes, when interpreted as doubles later!
    ldr    q27, [x0] ; load the first 16 bytes using Advanced SIMD
    index    z30.d, #1, #1
    ldr    d28, [x1, #:lo12:.LC0]
    adrp    x1, .LC1
    ldr    q26, [x1, #:lo12:.LC1]
    add    v28.8b, v29.8b, v28.8b ; add the middle 8 bytes using Advanced SIMD (v29.8b aliases d29)
    add    x1, x0, 24 ; offset to the last two bytes [x0,24] and [x0,25]
    add    v26.16b, v27.16b, v26.16b ; add the first 16 bytes using Advanced SIMD (v27.16b aliases q27)
    str    d28, [x0, 16] ; store the middle 8 bytes using Advanced SIMD
    str    q26, [x0] ; store the first 16 bytes using Advanced SIMD
    ld1b    z31.d, p7/z, [x1] ; load the last two bytes using SVE
    add    z30.b, z31.b, z30.b
    st1b    z30.d, p7, [x1] ; store the last two bytes using SVE
    ret

So you see there is only a loose relationship between GIMPLE vector types and instructions chosen by the backend.

you'd have three V2DImode vectors, the last with one masked lane and two 
V4SImode vectors, the last with three masked lanes.
The 2nd V2DImode -> V4SImode (2nd because two output vectors)
conversion expects two V2DImode inputs because it uses two-to-one
vector pack instructions.  But the 2nd V2DImode input does not exist.

I'm not familiar with other CPU architectures, but I suspect they are neither helped nor hindered by my change.

That said, downthread you have comments that only a single vector
element is supported when using masked operation (I don't remember
exactly where).  So you are hoping that the group splitting provides
you with a fully "leaf" situation here?
I think it's an invariant.
Keep in mind that splitting is not always a good option, like with

  a[0] = b[0];
  a[1] = b[2];
  a[2] = b[1];
  a[3] = b[3];

we do not split along V2DImode boundaries but having 2xV2DImode
allows to handle the loads efficiently with shuffling.  Similar
situations may arise when there's vector parts.

That said, if you think the current limitiation to leafs does not
restrict us design-wise then it's an OK initial limitation.
Thanks!

--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/

Reply via email to