Re: [PATCH]AArch64: Don't enable ptest elimination for partial vectors [PR124162]

Richard Sandiford Tue, 03 Mar 2026 06:21:01 -0800

Tamar Christina <[email protected]> writes:
> The following loop
>
> char b = 41;
> int main() {
>   signed char a[31];
> #pragma GCC novector
>   for (int c = 0; c < 31; ++c)
>     a[c] = c * c + c % 5;
>   {
>     signed char *d = a;
> #pragma GCC novector
>     for (int c = 0; c < 31; ++c, b += -16)
>       d[c] += b;
>   }
>   for (int c = 0; c < 31; ++c) {
>     signed char e = c * c + c % 5 + 41 + c * -16;
>     if (a[c] != e)
>       __builtin_abort();
>   }
> }
>
> compiled with -O2 -ftree-vectorize -msve-vector-bits=256 -march=armv8.2-a+sve
>
> generates
>
>         ptrue   p6.b, vl32
>         add     x2, x2, :lo12:.LC0
>         add     w5, w5, 16
>         ld1rw   z25.s, p6/z, [x2]
>         strb    w5, [x6, #:lo12:.LANCHOR0]
>         mov     w0, 0
>         mov     p7.b, p6.b
>         mov     w2, 31
>         index   z30.s, #0, #1
>         mov     z26.s, #5
>         mov     z27.b, #41
> .L6:
>         mov     z29.d, z30.d
>         movprfx z28, z30
>         add     z28.b, z28.b, #240
>         mad     z29.b, p6/m, z28.b, z27.b
>         mov     w3, w0
>         movprfx z31, z30
>         smulh   z31.s, p6/m, z31.s, z25.s
>         add     w0, w0, 8
>         asr     z31.s, z31.s, #1
>         msb     z31.s, p6/m, z26.s, z30.s
>         add     z31.b, z31.b, z29.b
>         ld1b    z29.s, p7/z, [x1]
>         cmpne   p7.b, p7/z, z31.b, z29.b
>         b.any   .L15
>         add     x1, x1, 8
>         add     z30.s, z30.s, #8
>         whilelo p7.s, w0, w2
>         b.any   .L6
>
> Which uses a predicate for the first iteration where all bits are 1. i.e. all
> lanes active. This causes the result of the cmpne to set the wrong CC flags.
> The second iteration uses
>
>         whilelo p7.s, w0, w2
>
> which gives the correct mask layout going forward.
>
> This is due to the CSE'ing code that tries to share predicates as much as
> possible.  In aarch64_expand_mov_immediate we do during predicate generation
>
>         /* Only the low bit of each .H, .S and .D element is defined,
>            so we can set the upper bits to whatever we like.  If the
>            predicate is all-true in MODE, prefer to set all the undefined
>            bits as well, so that we can share a single .B predicate for
>            all modes.  */
>         if (imm == CONSTM1_RTX (mode))
>           imm = CONSTM1_RTX (VNx16BImode);
>
> which essentially maps all predicates to .b unless the predicate is created
> outside the immediate expansion code.
>
> It creates the sparse predicate for data lane VNx4QI from a VNx16QI and then
> has a "conversion" operation. The conversion operation results in a simple 
> copy:
>
>         mov     p7.b, p6.b
>
> because in the data model for partial vectors the upper lanes are *don't 
> care*.
> So computations using this vector are fine.  However for comparisons, or any
> operations setting flags the predicate value does matter otherwise we get the
> wrong flags as the above.
>
> Additionally we don't have a way to distinguish based on the predicate alone
> whether the operation is partial or not. i.e. we have no "partial predicate"
> modes.
>
> two ways to solve this:
>
> 1. restore the ptest for partial vector compares.  This would slow down the 
> loop
>    though and introduce a second ptrue .s, VL8 predicate.
>
> 2. disable the sharing of partial vector predicates.  This allows us to remove
>    the ptest.  Since the ptest would introduce a second predicate here anyway
>    I'm leaning towards disabling sharing between partial and full predicates.
>
> For the patch I ended up going with 1.  The reason is that this is that
> unsharing the predicate does end up being pessimistic loops that only operate 
> on
> full vectors only (which are the majority of the cases).
>
> I also don't fully understand all the places we depend on this sharing (and
> about 3600 ACLE tests fail assembler scans).  I suspect one way to possibly
> deal with this is to perform the sharing on GIMPLE using the new isel hook and
> make RTL constant expansion not manually force it.  Since in gimple it's easy 
> to
> follow compares from a back-edge to figure out if it's safe to share the
> predicate.
>
> I also tried changing it so that during cond_vec_cbranch_any we perform an AND
> with the proper partial predicate.  But unfortunately folding doesn't realize
>
> that the and on the latch edge is useless (e.g.  whilelo p7.s, w0, w2 makes it
> a no-op) and that the AND should be moved outside the loop.  This is something
> that again isel should be able to do.
>
> I also tried changing the
>
>         mov     p7.b, p6.b
>
> into an AND, while this worked, folding didn't quite get that the and can be
> eliminated.  And this also pessimists actual register copies.
>
> So for now I just undo ptest elimination for partial vectors for GCC 16 and 
> will
> revisit it for GCC 17 when we extend ptest elimination.
>
> Unless someone has any other ideas?


FWIW, I agree this is the right way to go for GCC 16.  (2) on its own
wouldn't make the SVE_I cond_vec_cbranch_any patterns valid, since the
"upper bits are don't care" is a property of the mode definitions.
General RTL optimisers would be allowed to do the same thing as the
aarch64_expand_mov_immediate code that you quote above.

If we wanted the predicates to be fully-defined:

VECTOR_BOOL_MODE (VNx16BI, 16, BI, 2);
VECTOR_BOOL_MODE (VNx8BI, 8, BI, 2);
VECTOR_BOOL_MODE (VNx4BI, 4, BI, 2);
VECTOR_BOOL_MODE (VNx2BI, 2, BI, 2);

would need to become:

VECTOR_BOOL_MODE (VNx16BI, 16, BI, 2);
VECTOR_BOOL_MODE (VNx8BI, 8, B2I, 2);
VECTOR_BOOL_MODE (VNx4BI, 4, B4I, 2);
VECTOR_BOOL_MODE (VNx2BI, 2, B8I, 2);

similarly to MVE.  But we'd then probably run into the problem that SVE
wants a different canonicalisation from MVE: MVE wants sign-extension
whereas SVE would want zero-extension.

I'm not sure the trade-off would be worth it.  We'd save an AND for
this case, but would introduce unnecessary ANDs in places where the
ISA really does treat the upper bits as don't care.

I suppose we could go for a half-way house in which we do add the
partial predicate modes that you mention above, and require that
(say) VNx2HI is predicated by something in which bit 4x+2 is
guaranteed to be zero.  But that sounds difficult to do and would
still introduce unnecessary ANDs in some cases.

Thanks,
Richard

>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>       PR target/124162
>       * config/aarch64/aarch64-sve.md (cond_vec_cbranch_any,
>       cond_vec_cbranch_all): Drop partial vectors support.
>
> gcc/testsuite/ChangeLog:
>
>       PR target/124162
>       * gcc.target/aarch64/sve/vect-early-break-cbranch_16.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 
> 97fd95169590c8e074422c29e81043238e128a4c..019630eb8d21941b0ca718838ae6ef23e8aaedab
>  100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -9877,12 +9877,12 @@ (define_expand "<optab><mode>"
>  ;; instead and about the ptest.
>  (define_expand "<optab><mode>"
>    [(set (pc)
> -     (unspec:SVE_I
> +     (unspec:SVE_FULL_I
>         [(if_then_else
>           (match_operator 0 "aarch64_comparison_operator"
>             [(match_operand:<VPRED> 1 "register_operand")
> -            (match_operand:SVE_I 2 "register_operand")
> -            (match_operand:SVE_I 3 "aarch64_simd_reg_or_zero")])
> +            (match_operand:SVE_FULL_I 2 "register_operand")
> +            (match_operand:SVE_FULL_I 3 "aarch64_simd_reg_or_zero")])
>           (label_ref (match_operand 4 ""))
>           (pc))]
>        COND_CBRANCH_CMP))]
> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..9bb251f5f4766b3cb09f46b08c5f8c3dd38f613e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
> @@ -0,0 +1,22 @@
> +/* { dg-do run { target aarch64_sve_hw } } */
> +/* { dg-options "-Ofast --param aarch64-autovec-preference=sve-only" } */
> +/* { dg-require-effective-target lp64 } */
> +
> +char b = 41;
> +int main() {
> +  signed char a[31];
> +#pragma GCC novector
> +  for (int c = 0; c < 31; ++c)
> +    a[c] = c * c + c % 5;
> +  {
> +    signed char *d = a;
> +#pragma GCC novector
> +    for (int c = 0; c < 31; ++c, b += -16)
> +      d[c] += b;
> +  }
> +  for (int c = 0; c < 31; ++c) {
> +    signed char e = c * c + c % 5 + 41 + c * -16;
> +    if (a[c] != e)
> +      __builtin_abort();
> +  }
> +}

Re: [PATCH]AArch64: Don't enable ptest elimination for partial vectors [PR124162]

Reply via email to