On 28/10/2025 13:29, Richard Biener wrote:
+/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE that
+   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
+   Masking is only required for the tail, therefore NULL_TREE is returned for
+   every value of INDEX except the last.  Insert any set-up statements before
+   GSI.  */
I think it might happen that some vectors are fully masked, say for
a conversion from double to int and V2DImode vs. V4SImode when we
have 5 lanes the conversion likely expects 4 V2DImode inputs to
produce 2 V4SImode outputs, but the 4th V2DImode input has no active
lanes at all.

But maybe you handle this situation differently, I'll see.

You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI inactive, and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI (8SI - 5SI = 3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked).

I don't think that the "1 of 2DI is fully masked" would ever happen though, because a group of 5DI would be split long before the vectoriser attempts to materialise masks. The only reason that a group of 5DI might be allowed to survive that long would be if the number of subparts of the natural vector type (the one currently being tried by vect_slp_region) were at least 5, a factor of 5, or both. No such vector types exist.

For example, consider this translation unit:

#include <stdint.h>

void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5])
{
  (*si)[0] = (*di)[0];
  (*si)[1] = (*di)[1];
  (*si)[2] = (*di)[2];
  (*si)[3] = (*di)[3];
  (*si)[4] = (*di)[4];
}

Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve --param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable) as:

convert:
.LFB0:
        .cfi_startproc
        ldp     q30, q31, [x0] ; vector load the first four lanes
        ptrue   p7.d, vl2 ; enable two lanes for vector stores
        add     x2, x1, 8
        ldr     x0, [x0, 32] ; load the fifth lane
        st1w    z30.d, p7, [x1] ; store least-significant 32 bits of the first two lanes         st1w    z31.d, p7, [x2] ; store least-significant 32 bits of lanes 3 and 4         str     w0, [x1, 16] ; store least-significant 32 bits of fifth lane
        ret
        .cfi_endproc

The slp2 dump shows:

note:   Starting SLP discovery for
note:     (*si_13(D))[0] = _2;
note:     (*si_13(D))[1] = _4;
note:     (*si_13(D))[2] = _6;
note:     (*si_13(D))[3] = _8;
note:     (*si_13(D))[4] = _10;
note:   Created SLP node 0x4bd9e00
note:   starting SLP discovery for node 0x4bd9e00
note:   get vectype for scalar type (group size 5): uint32_t
note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 5): vector([4,4]) unsigned int
note:   vectype: vector([4,4]) unsigned int
note:   nunits = [4,4]
missed:   Build SLP failed: unrolling required in basic block SLP

This fails the check in vect_record_nunits because the group size of 5 may be larger than the number of subparts of vector([4,4]) unsigned int (which could be as low as 4) and 5 is never an integral multiple of [4,4].

The vectoriser therefore splits the group of 5SI into 4SI + 1SI:

note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 4): vector([4,4]) unsigned int
note:   Splitting SLP group at stmt 4
note:   Split group into 4 and 1
note:   Starting SLP discovery for
note:     (*si_13(D))[0] = _2;
note:     (*si_13(D))[1] = _4;
note:     (*si_13(D))[2] = _6;
note:     (*si_13(D))[3] = _8;
note:   Created SLP node 0x4bd9ec0
note:   starting SLP discovery for node 0x4bd9ec0
note:   get vectype for scalar type (group size 4): uint32_t
note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 4): vector([4,4]) unsigned int
note:   vectype: vector([4,4]) unsigned int
note:   nunits = [4,4]
note:   Build SLP for (*si_13(D))[0] = _2;
note:   Build SLP for (*si_13(D))[1] = _4;
note:   Build SLP for (*si_13(D))[2] = _6;
note:   Build SLP for (*si_13(D))[3] = _8;
note:   vect_is_simple_use: operand (unsigned int) _1, type of def: internal
note:   vect_is_simple_use: operand (unsigned int) _3, type of def: internal
note:   vect_is_simple_use: operand (unsigned int) _5, type of def: internal
note:   vect_is_simple_use: operand (unsigned int) _7, type of def: internal

... which goes well untill it looks at the 64-bit inputs:

note:   Created SLP node 0x4bda040
note:   starting SLP discovery for node 0x4bda040
note:   get vectype for scalar type (group size 4): const uint64_t
note:   get_vectype_for_scalar_type: natural type for const uint64_t (ignoring group size 4): const vector([2,2]) long unsigned int
note:   vectype: const vector([2,2]) long unsigned int
note:   nunits = [2,2]
missed:   Build SLP failed: unrolling required in basic block SLP

This fails the check in vect_record_nunits because the group size of 4 may be larger than the number of subparts of vector([2,2]) unsigned int (which could be as low as 2) and 4 is not necessarily an integral multiple of [2,2] (e.g. the polynomial vector length could be 2+(2*3) if the vectors are 512 bit).

The vectoriser doesn't give up though. Instead, it falls back to scalars for the external node representing the 64-bit inputs:

note:   Build SLP for _1 = (*di_12(D))[0];
note:   Build SLP for _3 = (*di_12(D))[1];
note:   Build SLP for _5 = (*di_12(D))[2];
note:   Build SLP for _7 = (*di_12(D))[3];
note:   SLP discovery for node 0x4bda040 failed
note:   Building vector operands from scalars
note:   Created SLP node 0x4bda100
note:   SLP discovery for node 0x4bd9f80 succeeded
note:   SLP discovery for node 0x4bd9ec0 succeeded
note:   SLP size 3 vs. limit 16.
note:   Final SLP tree for instance 0x4b174b0:
note:   node 0x4bd9ec0 (nunits.min=4, nunits.max=4, refcnt=2) vector([4,4]) unsigned int
note:   op template: (*si_13(D))[0] = _2;
note:       stmt 0 (*si_13(D))[0] = _2;
note:       stmt 1 (*si_13(D))[1] = _4;
note:       stmt 2 (*si_13(D))[2] = _6;
note:       stmt 3 (*si_13(D))[3] = _8;
note:       children 0x4bd9f80
note:   node 0x4bd9f80 (nunits.min=4, nunits.max=4, refcnt=2) vector([4,4]) unsigned int
note:   op template: _2 = (unsigned int) _1;
note:       stmt 0 _2 = (unsigned int) _1;
note:       stmt 1 _4 = (unsigned int) _3;
note:       stmt 2 _6 = (unsigned int) _5;
note:       stmt 3 _8 = (unsigned int) _7;
note:       children 0x4bda100
note:   node (external) 0x4bda100 (nunits.min=18446744073709551615, nunits.max=1, refcnt=1)
note:       { _1, _3, _5, _7 }

The convert node wants vector([2,2]) long unsigned int (2 64-bit values), which doesn't divide 4 lanes exactly, so the vectoriser falls back to building from scalars:

note:   === vect_slp_analyze_operations ===
note:   ==> examining statement: _2 = (unsigned int) _1;
note:   get_vectype_for_scalar_type: natural type for long unsigned int (ignoring group size 4): vector([2,2]) long unsigned int
note:   inferred vector type vector([2,2]) long unsigned int
missed:   lanes=4 is not divisible by subparts=2.
missed:   incompatible vector types for invariants
note:   get_vectype_for_scalar_type: natural type for long unsigned int (ignoring group size 4): vector([2,2]) long unsigned int note:   get_vectype_for_scalar_type: natural type for long unsigned int (ignoring group size 4): vector([2,2]) long unsigned int missed:   not vectorized: relevant stmt not supported: _2 = (unsigned int) _1;
note:   Building vector operands of 0x4bd9f80 from scalars instead
note:   ==> examining statement: (*si_13(D))[0] = _2;
note:   updated vectype of operand 0x4bd9f80 with 4 lanes to vector([4,4]) unsigned int
note:   vect_model_store_cost: aligned.
note:   vect_model_store_cost: inside_cost = 1, prologue_cost = 0 .
note:   vect_prologue_cost_for_slp: node 0x4bd9f80, vector type vector([4,4]) unsigned int, group_size 4
note:   === vect_bb_partition_graph ===
note: ***** Analysis succeeded with vector mode VNx2DI
note: SLPing BB part

However, the vectorisation with mode VNx2DI is not deemed profitable:

note: Costing subgraph:
note: node 0x4bd9ec0 (nunits.min=4, nunits.max=4, refcnt=1) vector([4,4]) unsigned int
note: op template: (*si_13(D))[0] = _2;
note:     stmt 0 (*si_13(D))[0] = _2;
note:     stmt 1 (*si_13(D))[1] = _4;
note:     stmt 2 (*si_13(D))[2] = _6;
note:     stmt 3 (*si_13(D))[3] = _8;
note:     children 0x4bd9f80
note: node (external) 0x4bd9f80 (nunits.min=4, nunits.max=4, refcnt=1) vector([4,4]) unsigned int
note:     stmt 0 _2 = (unsigned int) _1;
note:     stmt 1 _4 = (unsigned int) _3;
note:     stmt 2 _6 = (unsigned int) _5;
note:     stmt 3 _8 = (unsigned int) _7;
note:     children 0x4bda100
note: node (external) 0x4bda100 (nunits.min=18446744073709551615, nunits.max=1, refcnt=1)
note:     { _1, _3, _5, _7 }
note: Cost model analysis:
_2 1 times scalar_store costs 1 in body
_4 1 times scalar_store costs 1 in body
_6 1 times scalar_store costs 1 in body
_8 1 times scalar_store costs 1 in body
_2 1 times vector_store costs 1 in body
node 0x4bd9f80 1 times vec_construct costs 3 in prologue
note: Cost model analysis for part in loop 0:
  Vector cost: 11
  Scalar cost: 4
missed: not vectorized: vectorization is not profitable.
note: ***** The result for vector mode VNx16QI would be the same
note: ***** The result for vector mode VNx8QI would be the same
note: ***** The result for vector mode VNx4QI would be the same

The vectoriser then successfully analyses the same block with VNx2QI:

note:   === vect_analyze_slp ===
note:   Starting SLP discovery for
note:     (*si_13(D))[0] = _2;
note:     (*si_13(D))[1] = _4;
note:     (*si_13(D))[2] = _6;
note:     (*si_13(D))[3] = _8;
note:     (*si_13(D))[4] = _10;
note:   Created SLP node 0x4bd9ec0
note:   starting SLP discovery for node 0x4bd9ec0
note:   get vectype for scalar type (group size 5): uint32_t
note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 5): vector([2,2]) unsigned int
note:   vectype: vector([2,2]) unsigned int
note:   nunits = [2,2]
missed:   Build SLP failed: unrolling required in basic block SLP
note:   Build SLP for (*si_13(D))[0] = _2;
note:   Build SLP for (*si_13(D))[1] = _4;
note:   Build SLP for (*si_13(D))[2] = _6;
note:   Build SLP for (*si_13(D))[3] = _8;
note:   Build SLP for (*si_13(D))[4] = _10;
note:   SLP discovery for node 0x4bd9ec0 failed

It splits the group of 5 into 4 + 1:


note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 4): vector([2,2]) unsigned int
note:   Splitting SLP group at stmt 4
note:   Split group into 4 and 1
note:   Starting SLP discovery for
note:     (*si_13(D))[0] = _2;
note:     (*si_13(D))[1] = _4;
note:     (*si_13(D))[2] = _6;
note:     (*si_13(D))[3] = _8;
note:   Created SLP node 0x4bd9f80
note:   starting SLP discovery for node 0x4bd9f80
note:   get vectype for scalar type (group size 4): uint32_t
note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 4): vector([2,2]) unsigned int
note:   vectype: vector([2,2]) unsigned int
note:   nunits = [2,2]
missed:   Build SLP failed: unrolling required in basic block SLP
note:   Build SLP for (*si_13(D))[0] = _2;
note:   Build SLP for (*si_13(D))[1] = _4;
note:   Build SLP for (*si_13(D))[2] = _6;
note:   Build SLP for (*si_13(D))[3] = _8;
note:   SLP discovery for node 0x4bd9f80 failed

It then splits the group of 4 into 2 + 2:

note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 2): vector([2,2]) unsigned int
note:   Splitting SLP group at stmt 2
note:   Split group into 2 and 2
note:   Starting SLP discovery for
note:     (*si_13(D))[0] = _2;
note:     (*si_13(D))[1] = _4;
note:   Created SLP node 0x4bda100
note:   starting SLP discovery for node 0x4bda100
note:   get vectype for scalar type (group size 2): uint32_t
note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring group size 2): vector([2,2]) unsigned int
note:   vectype: vector([2,2]) unsigned int
note:   nunits = [2,2]
note:   Build SLP for (*si_13(D))[0] = _2;
note:   Build SLP for (*si_13(D))[1] = _4;
note:   vect_is_simple_use: operand (unsigned int) _1, type of def: internal
note:   vect_is_simple_use: operand (unsigned int) _3, type of def: internal
note:   Created SLP node 0x4bd9e00
note:   starting SLP discovery for node 0x4bd9e00
note:   get vectype for scalar type (group size 2): unsigned int
note:   get_vectype_for_scalar_type: natural type for unsigned int (ignoring group size 2): vector([2,2]) unsigned int
note:   vectype: vector([2,2]) unsigned int
note:   nunits = [2,2]
note:   Build SLP for _2 = (unsigned int) _1;
note:   Build SLP for _4 = (unsigned int) _3;
note:   vect_is_simple_use: operand (*di_12(D))[0], type of def: internal
note:   vect_is_simple_use: operand (*di_12(D))[1], type of def: internal
note:   Created SLP node 0x4bda040
note:   starting SLP discovery for node 0x4bda040
note:   get vectype for scalar type (group size 2): const uint64_t
note:   get_vectype_for_scalar_type: natural type for const uint64_t (ignoring group size 2): const vector([2,2]) long unsigned int
note:   vectype: const vector([2,2]) long unsigned int
note:   nunits = [2,2]
note:   Build SLP for _1 = (*di_12(D))[0];
note:   Build SLP for _3 = (*di_12(D))[1];
note:   SLP discovery for node 0x4bda040 succeeded
note:   SLP discovery for node 0x4bd9e00 succeeded
note:   SLP discovery for node 0x4bda100 succeeded
note:   SLP size 3 vs. limit 16.
note:   Final SLP tree for instance 0x4b174b0:
note:   node 0x4bda100 (nunits.min=2, nunits.max=2, refcnt=2) vector([2,2]) unsigned int
note:   op template: (*si_13(D))[0] = _2;
note:       stmt 0 (*si_13(D))[0] = _2;
note:       stmt 1 (*si_13(D))[1] = _4;
note:       children 0x4bd9e00
note:   node 0x4bd9e00 (nunits.min=2, nunits.max=2, refcnt=2) vector([2,2]) unsigned int
note:   op template: _2 = (unsigned int) _1;
note:       stmt 0 _2 = (unsigned int) _1;
note:       stmt 1 _4 = (unsigned int) _3;
note:       children 0x4bda040
note:   node 0x4bda040 (nunits.min=2, nunits.max=2, refcnt=2) const vector([2,2]) long unsigned int
note:   op template: _1 = (*di_12(D))[0];
note:       stmt 0 _1 = (*di_12(D))[0];
note:       stmt 1 _3 = (*di_12(D))[1];
note:       load permutation { 0 1 }

Unlike the previous attempt, this one is deemed profitable.

The resultant GIMPLE is:

void convert (const uint64_t[5] * const di, uint32_t[5] * const si)
{
  uint32_t * vectp.14;
  vector([2,2]) unsigned int * vectp_si.13;
  vector([2,2]) unsigned int vect__6.12;
  const vector([2,2]) long unsigned int vect__5.11;
  const uint64_t * vectp.10;
  const vector([2,2]) long unsigned int * vectp_di.9;
  uint32_t * vectp.8;
  vector([2,2]) unsigned int * vectp_si.7;
  vector([2,2]) unsigned int vect__2.6;
  const vector([2,2]) long unsigned int vect__1.5;
  const uint64_t * vectp.4;
  const vector([2,2]) long unsigned int * vectp_di.3;
  long unsigned int _1;
  unsigned int _2;
  long unsigned int _3;
  unsigned int _4;
  long unsigned int _5;
  unsigned int _6;
  long unsigned int _7;
  unsigned int _8;
  long unsigned int _9;
  unsigned int _10;
  vector([2,2]) <signed-boolean:8> slp_mask_20;
  vector([2,2]) <signed-boolean:8> slp_mask_24;
  vector([2,2]) <signed-boolean:8> slp_mask_27;
  vector([2,2]) <signed-boolean:8> slp_mask_31;

  <bb 2> [local count: 1073741824]:
  vectp.4_19 = &(*di_12(D))[0];
  slp_mask_20 = .WHILE_ULT (0, 2, { 0, ... });
  vect__1.5_21 = .MASK_LOAD (vectp.4_19, 64B, slp_mask_20, { 0, ... });
  vect__2.6_22 = (vector([2,2]) unsigned int) vect__1.5_21;
  _1 = (*di_12(D))[0];
  _2 = (unsigned int) _1;
  _3 = (*di_12(D))[1];
  _4 = (unsigned int) _3;
  vectp.8_23 = &(*si_13(D))[0];
  slp_mask_24 = .WHILE_ULT (0, 2, { 0, ... });
  .MASK_STORE (vectp.8_23, 32B, slp_mask_24, vect__2.6_22);
  vectp.10_26 = &(*di_12(D))[2];
  slp_mask_27 = .WHILE_ULT (0, 2, { 0, ... });
  vect__5.11_28 = .MASK_LOAD (vectp.10_26, 64B, slp_mask_27, { 0, ... });
  vect__6.12_29 = (vector([2,2]) unsigned int) vect__5.11_28;
  _5 = (*di_12(D))[2];
  _6 = (unsigned int) _5;
  _7 = (*di_12(D))[3];
  _8 = (unsigned int) _7;
  vectp.14_30 = &(*si_13(D))[2];
  slp_mask_31 = .WHILE_ULT (0, 2, { 0, ... });
  .MASK_STORE (vectp.14_30, 32B, slp_mask_31, vect__6.12_29);
  _9 = (*di_12(D))[4];
  _10 = (unsigned int) _9;
  (*si_13(D))[4] = _10;
  return;
}

--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/

Reply via email to