This message describes how the SVE patches implement full-loop predication/masking. To give an example:
void foo (int *restrict dst, int *restrict src, int count) { for (int i = 0; i < count; ++i) dst[i] = src[i] + 1; } compiles to: foo: cmp w2, 0 ble .L1 mov x3, 0 sxtw x2, w2 whilelo p0.s, xzr, x2 .L3: ld1w z0.s, p0/z, [x1, x3, lsl 2] add z0.s, z0.s, #1 st1w z0.s, p0, [x0, x3, lsl 2] incw x3 whilelo p0.s, x3, x2 bne .L3 .L1: ret Here p0 is a predicate register, z0 is a vector register, and /z indicates that inactive elements of the result should be set to zero. WHILELO P0.S, Xa, Xb creates a predicate in which element I is true if Xa + J < Xb for all 0 <= J <= I. (This means for example that there are no true elements after a false one.) WHILELO also sets the flags based on the contents of the predicate. The vectoriser changes are relatively simple. We just need to add a loop-wide mask that is used in the following contexts: (1) to predicate any statement with potential side effects, including most obviously loads and stores. (This should also include floating-point arithmetic when -ffast-math isn't specified; I hope to fix that in the coming days.) (2) to mask the result of vectorised conditions, such as those that feed IFN_MASK_LOADs and IFN_MASK_STOREs. (3) to predicate the accumulation of reduction inputs. For this we used new IFN_COND_* internal functions that conditionally apply an operation to elements of the first vector operand. E.g.: R = IFN_COND_ADD (COND, A, B) implements: R[I] = COND[I] ? A[I] + B[I] : A[I] (This is also what the predicated floating-point operations in (1) are likely to use.) If a vectorised statement needs N copies, we need to apply N-1 levels of unpacks on the loop mask to get the appropriate mask input. For example, if a statement needs 4 copies, we'd have: loop mask / \ unpack high unpack low / \ / \ unpack high unpack low unpack high unpack low | | | | copy 0 copy 1 copy 2 copy 3 We record during the analysis phase how many levels of unpacks are needed and before the transform phase create SSA names for each node of the unpack tree. (Note that for SVE it would often be better to keep the number of elements the same for all statements and use unpacked data where necessary. That isn't implemented yet though.) The WHILELO itself is represented as a new internal function: IFN_WHILE_ULT. The prototype is: M = IFN_WHILE_ULT (START, LIMIT, ZERO) where START and LIMIT are the two scalar operands described above and where ZERO is the zero constant for the type of M. This zero argument is needed to distinguish WHILEs for different element sizes and element counts; without it, any call with the same START and LIMIT would appear to be the same operation. The test of M for the branch-back condition uses a plain EQ/NE_EXPR, in the same way as for optimize_mask_stores. Using fully-masked loops mean that it's possible and worthwhile to vectorise something whose trip count is (or might be) less than the vectorisation factor. Although vector-length agnostic output is the default, the SVE port does allow the user to specify a particular vector length on the command line if they want to. One particular area where this makes a difference is alignment. It doesn't really make sense to optimise for alignment when the vector length is unknown, especially since SVE vectors don't need to be a power of 2 in size. However, when a particular vector length is specified there is the option of trying to align pointers to that length, like we do for fixed-length vector architectures. (There's no guarantee that this is a win, but the transformation is at least possible.) Fully-masked loops support alignment optimisations by using a partial first iteration. If a pointer starts X elements beyond an aligned address, we adjust the pointers down to the aligned address and ensure that the first mask has X leading false bits. Any inductions then have to be adjusted by -X times the step (which is the opposite direction to when a prologue is used). For example: void foo (int *restrict dst, int count) { for (int i = 0; i < count; ++i) dst[i] += i; } compiled at -O3 for a 512-bit vector length gives: foo: cmp w1, 0 ble .L1 ubfx x2, x0, 2, 4 // x2 = X (i.e. misalignment in elems) mov x4, 0 sub x0, x0, x2, lsl 2 // x0 = aligned dst neg w3, w2 // w3 = -X add x1, x2, x1, sxtw // x1 = X + count whilelo p0.s, xzr, x2 // p0 is true for elems < X whilelo p1.s, xzr, x1 // p1 is true for elems < X + count index z0.s, w3, #1 // z0 = { -X, -X+1, -X+2, ... } bic p0.b, p1/z, p1.b, p0.b // p0 = p1 & ~p0 .L3: ld1w z1.s, p0/z, [x0, x4, lsl 2] add z1.s, z0.s, z1.s st1w z1.s, p0, [x0, x4, lsl 2] add z0.s, z0.s, #16 add x4, x4, 16 whilelo p0.s, x4, x1 bne .L3 .L1: ret The result of the NEG is the -X described above. The INDEX then returns { -X, -X+1, -X+2, ... }, so that element X onwards is a linear series { 0, 1, 2, ... }. The two WHILEs and BIC before the loop create a predicate in which element E is set if X <= E < X + count. This copes with cases in which the iteration count is small and we have both leading and trailing false bits in the first (and only) iteration. Thanks, Richard