On 12/11/2025 14:38, Richard Biener wrote:
On Tue, 11 Nov 2025, Christopher Bazley wrote:

On 07/11/2025 13:53, Richard Biener wrote:
On Thu, 6 Nov 2025, Christopher Bazley wrote:

On 05/11/2025 12:25, Richard Biener wrote:
On Tue, 4 Nov 2025, Christopher Bazley wrote:

On 28/10/2025 13:29, Richard Biener wrote:
+/* Materialize mask number INDEX for a group of scalar stmts in
SLP_NODE
that
+   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX <
NVECTORS.
+   Masking is only required for the tail, therefore NULL_TREE is
returned
for
+   every value of INDEX except the last.  Insert any set-up statements
before
+   GSI.  */
I think it might happen that some vectors are fully masked, say for
a conversion from double to int and V2DImode vs. V4SImode when we
have 5 lanes the conversion likely expects 4 V2DImode inputs to
produce 2 V4SImode outputs, but the 4th V2DImode input has no active
lanes at all.

But maybe you handle this situation differently, I'll see.
You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI
inactive,
and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI (8SI - 5SI
=
3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked).

I don't think that the "1 of 2DI is fully masked" would ever happen
though,
because a group of 5DI would be split long before the vectoriser attempts
to
materialise masks. The only reason that a group of 5DI might be allowed
to
survive that long would be if the number of subparts of the natural
vector
type (the one currently being tried by vect_slp_region) were at least 5,
a
factor of 5, or both. No such vector types exist.

For example, consider this translation unit:

#include <stdint.h>

void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5])
{
     (*si)[0] = (*di)[0];
     (*si)[1] = (*di)[1];
     (*si)[2] = (*di)[2];
     (*si)[3] = (*di)[3];
     (*si)[4] = (*di)[4];
}

Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable)
as:

convert:
.LFB0:
           .cfi_startproc
           ldp     q30, q31, [x0] ; vector load the first four lanes
           ptrue   p7.d, vl2 ; enable two lanes for vector stores
           add     x2, x1, 8
           ldr     x0, [x0, 32] ; load the fifth lane
           st1w    z30.d, p7, [x1] ; store least-significant 32 bits of
the first two lanes
         st1w    z31.d, p7, [x2] ; store least-significant 32 bits of
lanes
3
and 4
           str     w0, [x1, 16] ; store least-significant 32 bits of fifth
   lane
           ret
           .cfi_endproc

The slp2 dump shows:

note:   Starting SLP discovery for
note:     (*si_13(D))[0] = _2;
note:     (*si_13(D))[1] = _4;
note:     (*si_13(D))[2] = _6;
note:     (*si_13(D))[3] = _8;
note:     (*si_13(D))[4] = _10;
note:   Created SLP node 0x4bd9e00
note:   starting SLP discovery for node 0x4bd9e00
note:   get vectype for scalar type (group size 5): uint32_t
note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 5): vector([4,4]) unsigned int
note:   vectype: vector([4,4]) unsigned int
note:   nunits = [4,4]
missed:   Build SLP failed: unrolling required in basic block SLP

This fails the check in vect_record_nunits because the group size of 5
may
be
larger than the number of subparts of vector([4,4]) unsigned int (which
could
be as low as 4) and 5 is never an integral multiple of [4,4].

The vectoriser therefore splits the group of 5SI into 4SI + 1SI:
I had the impression the intent of this series is to _not_ split the
groups in this case.  On x86 with V2DImode / V4SImode (aka SSE2)
Not exactly. Richard Sandiford did tell me (months ago) that this task is
about trying to avoid splitting, but I think that is not the whole story.
Richard's initial example of a function that is not currently vectorised,
but
could be with tail-predication, was:

void
foo (char *x, int n)
{
    x[0] += 1;
    x[1] += 2;
    x[2] += 1;
    x[3] += 2;
    x[4] += 1;
    x[5] += 2;
}

A group of 6QI such as that shown in the function above would not need to
be
split because each lane is only one byte wide, not a double word (unlike in
your example of a conversion from 5DF to 5SI). A group of 6QI can always be
stored in one vector of type VNx16QI, because VNx16QI's minimum number of
lanes is 16.

      ptrue    p7.b, vl6
      ptrue    p6.b, all
      ld1b    z31.b, p7/z, [x0] ; one predicated load
      adrp    x1, .LC0
      add    x1, x1, :lo12:.LC0
      ld1rqb    z30.b, p6/z, [x1]
      add    z30.b, z31.b, z30.b
      st1b    z30.b, p7, [x0] ; one predicated store
      ret

If some target architecture provides both VNx8DF and VNx8SI then your
example
conversion wouldn't result in a split either because the group size of 5
would
certainly be smaller than the number of subparts of vector([8,8]) double
and
the fact that 5 is not an integral multiple of [8,8] would be irrelevant.
(SVE
doesn't provide either type in implementations that I'm aware of.)

However, I believe it could also be beneficial to be able to vectorise
functions with more than a small number of operations in them (e.g., 26
instead of 6 operations):

void
foo (char *x, int n)
{
    x[0] += 1;
    x[1] += 2;
    x[2] += 1;
    x[3] += 2;
    x[4] += 1;
    x[5] += 2;
    x[6] += 1;
    x[7] += 2;
    x[8] += 1;
    x[9] += 2;
    x[10] += 1;
    x[11] += 2;
    x[12] += 1;
    x[13] += 2;
    x[14] += 1;
    x[15] += 2;
    x[16] += 1;
    x[17] += 2;
    x[18] += 1;
    x[19] += 2;
    x[20] += 1;
    x[21] += 2;
    x[22] += 1;
    x[23] += 2;
    x[24] += 1;
    x[25] += 2;
}

Admittedly, such cases are probably rarer than small groups in real code.

In such cases, even a group of byte-size operations might need to be split
in
order to be vectorised. e.g., a group of 26QI additions could be vectorised
with VNx16QI as 16QI + 10QI. A mask would be generated for both groups:
Note you say "split" and mean you have two vector operations in the end.
But I refer to with "split" to the split into two different SLP graphs,
usually, even with BB vectorization, a single SLP node can happily
represent multiple vectors (with the same vector type) when necessary
to fill all lanes.
Thanks for clarifying that.

My original concept of splitting was probably based on something Richard
Sandiford said about the desirability of using Advanced SIMD instead of SVE to
vectorise the part of a large group that does not require tail-predication. At
that time, I was not aware that the target backend could automatically
generate Advanced SIMD instructions for WHILE_ULT operations in which the mask
has all bits set. I therefore assumed that it would be necessary to split such
groups. Splitting is also a natural consequence of the existing control flow
in the vect_analyze_slp_instance function.

But to agree to that we still might want to do some splitting, at least
on x86 where we have multiple vector sizes (and thus types for the
same element type), your first example with 6 lanes could be split
into a V4QImode subgraph and a V2QImode subgraph.  I don't think
x86 has V2QImode, but just make that V4DImode and V2DImode.  A
variant without the need for splitting would be using V2DImode
(with three vectors) or a variant using V4DImode and masking
for the second vector.
Is your concern that adding !known_ge (nunits.min, group_size) to the
conjunction in the vect_analyze_slp_instance function prevents splitting of BB
SLP groups known to be smaller than the minimum number of lanes of any of the
chosen vector types? Can such groups really be usefully split?
If we have 5 lanes and V8DI we can either not split and mask or
split to V4DI and a scalar (a masked V2DI should be left unvectorized).

Your example is almost the same as the one I worked through below (6 lanes and V8DI).

IIRC you adjusted get_vectype_for_scalar_type to get you V8DI for
the 5 lane case while I think currently the function refuses to return
a vector with more lanes than requested.

Yes, exactly. The result would currently be "Build SLP failed: store group size not a multiple of the vector size in basic block SLP" and vect_analyze_slp_instance returning false, instead of the group being split. So the effect of my change is not to prevent the group from being split, but to give it a chance to be vectorised. That doesn't seem problematic for fixed-multi-length ISAs.

That said, my overall concern is with how your changes work on a
fixed-multi-length ISA.

While we iterate over modes as you say above, the actual mode isn't
really enforced for BB vectorization since the number of requested
lanes is allowed to override it.  So I'm not sure how we get to
a proper choice with the above V8DI vs. V4DI + scalar, or with 7
lanes to V8DI vs. 2xV4DI.  Which is why my gut feeling tells me
that some analysis on the unsplit SLP graph is probably necessary
to derive at an optimal vector mode / split configuration.

An iterative approach should work provided that there's an opportunity to compare the cost of the code generated with VNx16 (16 masked to 5) with the cost of the code generated with V4 (4 + 1) and the cost of the code generated with V2 (2 + 2 + 1).

One thing it won't do, if I understand correctly, is allow combinations like V4 + V8 + VNx16, but I currently lack the imagination to envisage how those could be more efficient than the current algorithm. In particular, the backend implicitly converts VNx16 + VNx16 to V16 + VNx16 (as previously discussed).

But I think at this point we want to get the basics working without
breaking too much of what currently works fine.

Sounds good to me.

Let's suppose the group size is 6 and the natural vector type (for the current
iteration of the outer loop) is V8DI.

Previously, this example would have failed the following test (condition
true):

       if (!max_nunits.is_constant (&const_max_nunits) || const_max_nunits >
group_size)

which would have resulted in "Build SLP failed: store group size not a
multiple of the vector size in basic block SLP" and vect_analyze_slp_instance
returning false, instead of the group being split.

Any split would only occur when the next iteration of the outer loop selects
V4DI, for which !known_ge (nunits.min, group_size) is true with my changes to
the function (because 4 < 8). Consequently, the BB SLP block would still be
entered, and the const_max_nunits > group_size test would be repeated. This
time would pass (condition false) because 4 <= 8, therefore "SLP discovery
succeeded but node needs splitting" and the group could be split into V4DImode
and V2DImode as you described.

Your AdvSIMD substitution for the larger case could be done by
splitting the graph and choosing AdvSIMD for the half that does
not need predication but SVE for the other half.
That's what the current implementation does.
That said, as long as the vector type is the same for each
part covering distinct lanes there is no need for splitting.
What I'd like to understand is whether the implementation at
hand from you for the masking assumes that if masking is required
(we padded lanes) whether that requires there to be exactly
one hardware vector for each SLP node.  Below you say
that's an "invariant", so that's a yes?
The vect_analyze_slp_instance function only creates a new SLP instance for BB
vectorisation with an unrolling factor not equal to one if the minimum number
of lanes for all of the vector types is sufficient to store the whole group.
That implies that there is exactly one hardware vector.

The vect_get_num_copies function also relies on that assumption:

   vf *= SLP_TREE_LANES (node);
   tree vectype = SLP_TREE_VECTYPE (node);
   if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf))
     return 1;

Otherwise, exact_div would fail in a callee, vect_get_num_vectors.

My current implementation of the vect_slp_get_bb_mask function returns
NULL_TREE (i.e. 'no mask') for all vectors of an SLP node with multiple
vectors:

   /* vect_get_num_copies only allows a partial vector if it is the only
      vector.  */
   if (nvectors > 1)
     return NULL_TREE;

That means the tail of such a node would not be masked correctly if it needs
to be masked at all. Even if that guard were removed, the following statement
would also need to be made more complex to handle cases in which the group
size is not also the number of active lanes:

tree end_index = build_int_cst (cmp_type, group_size);
Yes, I know.  Which is why I wonder about the overall design how to
[not] represent the padding lanes.

Cases where there is more than one vector and only the last vector is partial can already be represented using my proposed scheme. It would be relatively trivial to update vect_slp_get_bb_mask to handle such cases, but I thought it might be misleading to do so because that would imply such cases can occur, whereas (currently) they cannot.

Would you like me to update vect_slp_get_bb_mask to handle such cases anyway, even though it would be dead code?

--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/

Reply via email to