Tamar Christina <[email protected]> writes:
> The following loop
>
> char b = 41;
> int main() {
> signed char a[31];
> #pragma GCC novector
> for (int c = 0; c < 31; ++c)
> a[c] = c * c + c % 5;
> {
> signed char *d = a;
> #pragma GCC novector
> for (int c = 0; c < 31; ++c, b += -16)
> d[c] += b;
> }
> for (int c = 0; c < 31; ++c) {
> signed char e = c * c + c % 5 + 41 + c * -16;
> if (a[c] != e)
> __builtin_abort();
> }
> }
>
> compiled with -O2 -ftree-vectorize -msve-vector-bits=256 -march=armv8.2-a+sve
>
> generates
>
> ptrue p6.b, vl32
> add x2, x2, :lo12:.LC0
> add w5, w5, 16
> ld1rw z25.s, p6/z, [x2]
> strb w5, [x6, #:lo12:.LANCHOR0]
> mov w0, 0
> mov p7.b, p6.b
> mov w2, 31
> index z30.s, #0, #1
> mov z26.s, #5
> mov z27.b, #41
> .L6:
> mov z29.d, z30.d
> movprfx z28, z30
> add z28.b, z28.b, #240
> mad z29.b, p6/m, z28.b, z27.b
> mov w3, w0
> movprfx z31, z30
> smulh z31.s, p6/m, z31.s, z25.s
> add w0, w0, 8
> asr z31.s, z31.s, #1
> msb z31.s, p6/m, z26.s, z30.s
> add z31.b, z31.b, z29.b
> ld1b z29.s, p7/z, [x1]
> cmpne p7.b, p7/z, z31.b, z29.b
> b.any .L15
> add x1, x1, 8
> add z30.s, z30.s, #8
> whilelo p7.s, w0, w2
> b.any .L6
>
> Which uses a predicate for the first iteration where all bits are 1. i.e. all
> lanes active. This causes the result of the cmpne to set the wrong CC flags.
> The second iteration uses
>
> whilelo p7.s, w0, w2
>
> which gives the correct mask layout going forward.
>
> This is due to the CSE'ing code that tries to share predicates as much as
> possible. In aarch64_expand_mov_immediate we do during predicate generation
>
> /* Only the low bit of each .H, .S and .D element is defined,
> so we can set the upper bits to whatever we like. If the
> predicate is all-true in MODE, prefer to set all the undefined
> bits as well, so that we can share a single .B predicate for
> all modes. */
> if (imm == CONSTM1_RTX (mode))
> imm = CONSTM1_RTX (VNx16BImode);
>
> which essentially maps all predicates to .b unless the predicate is created
> outside the immediate expansion code.
>
> It creates the sparse predicate for data lane VNx4QI from a VNx16QI and then
> has a "conversion" operation. The conversion operation results in a simple
> copy:
>
> mov p7.b, p6.b
>
> because in the data model for partial vectors the upper lanes are *don't
> care*.
> So computations using this vector are fine. However for comparisons, or any
> operations setting flags the predicate value does matter otherwise we get the
> wrong flags as the above.
>
> Additionally we don't have a way to distinguish based on the predicate alone
> whether the operation is partial or not. i.e. we have no "partial predicate"
> modes.
>
> two ways to solve this:
>
> 1. restore the ptest for partial vector compares. This would slow down the
> loop
> though and introduce a second ptrue .s, VL8 predicate.
>
> 2. disable the sharing of partial vector predicates. This allows us to remove
> the ptest. Since the ptest would introduce a second predicate here anyway
> I'm leaning towards disabling sharing between partial and full predicates.
>
> For the patch I ended up going with 1. The reason is that this is that
> unsharing the predicate does end up being pessimistic loops that only operate
> on
> full vectors only (which are the majority of the cases).
>
> I also don't fully understand all the places we depend on this sharing (and
> about 3600 ACLE tests fail assembler scans). I suspect one way to possibly
> deal with this is to perform the sharing on GIMPLE using the new isel hook and
> make RTL constant expansion not manually force it. Since in gimple it's easy
> to
> follow compares from a back-edge to figure out if it's safe to share the
> predicate.
>
> I also tried changing it so that during cond_vec_cbranch_any we perform an AND
> with the proper partial predicate. But unfortunately folding doesn't realize
>
> that the and on the latch edge is useless (e.g. whilelo p7.s, w0, w2 makes it
> a no-op) and that the AND should be moved outside the loop. This is something
> that again isel should be able to do.
>
> I also tried changing the
>
> mov p7.b, p6.b
>
> into an AND, while this worked, folding didn't quite get that the and can be
> eliminated. And this also pessimists actual register copies.
>
> So for now I just undo ptest elimination for partial vectors for GCC 16 and
> will
> revisit it for GCC 17 when we extend ptest elimination.
>
> Unless someone has any other ideas?
FWIW, I agree this is the right way to go for GCC 16. (2) on its own
wouldn't make the SVE_I cond_vec_cbranch_any patterns valid, since the
"upper bits are don't care" is a property of the mode definitions.
General RTL optimisers would be allowed to do the same thing as the
aarch64_expand_mov_immediate code that you quote above.
If we wanted the predicates to be fully-defined:
VECTOR_BOOL_MODE (VNx16BI, 16, BI, 2);
VECTOR_BOOL_MODE (VNx8BI, 8, BI, 2);
VECTOR_BOOL_MODE (VNx4BI, 4, BI, 2);
VECTOR_BOOL_MODE (VNx2BI, 2, BI, 2);
would need to become:
VECTOR_BOOL_MODE (VNx16BI, 16, BI, 2);
VECTOR_BOOL_MODE (VNx8BI, 8, B2I, 2);
VECTOR_BOOL_MODE (VNx4BI, 4, B4I, 2);
VECTOR_BOOL_MODE (VNx2BI, 2, B8I, 2);
similarly to MVE. But we'd then probably run into the problem that SVE
wants a different canonicalisation from MVE: MVE wants sign-extension
whereas SVE would want zero-extension.
I'm not sure the trade-off would be worth it. We'd save an AND for
this case, but would introduce unnecessary ANDs in places where the
ISA really does treat the upper bits as don't care.
I suppose we could go for a half-way house in which we do add the
partial predicate modes that you mention above, and require that
(say) VNx2HI is predicated by something in which bit 4x+2 is
guaranteed to be zero. But that sounds difficult to do and would
still introduce unnecessary ANDs in some cases.
Thanks,
Richard
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> PR target/124162
> * config/aarch64/aarch64-sve.md (cond_vec_cbranch_any,
> cond_vec_cbranch_all): Drop partial vectors support.
>
> gcc/testsuite/ChangeLog:
>
> PR target/124162
> * gcc.target/aarch64/sve/vect-early-break-cbranch_16.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md
> b/gcc/config/aarch64/aarch64-sve.md
> index
> 97fd95169590c8e074422c29e81043238e128a4c..019630eb8d21941b0ca718838ae6ef23e8aaedab
> 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -9877,12 +9877,12 @@ (define_expand "<optab><mode>"
> ;; instead and about the ptest.
> (define_expand "<optab><mode>"
> [(set (pc)
> - (unspec:SVE_I
> + (unspec:SVE_FULL_I
> [(if_then_else
> (match_operator 0 "aarch64_comparison_operator"
> [(match_operand:<VPRED> 1 "register_operand")
> - (match_operand:SVE_I 2 "register_operand")
> - (match_operand:SVE_I 3 "aarch64_simd_reg_or_zero")])
> + (match_operand:SVE_FULL_I 2 "register_operand")
> + (match_operand:SVE_FULL_I 3 "aarch64_simd_reg_or_zero")])
> (label_ref (match_operand 4 ""))
> (pc))]
> COND_CBRANCH_CMP))]
> diff --git
> a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
> b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..9bb251f5f4766b3cb09f46b08c5f8c3dd38f613e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
> @@ -0,0 +1,22 @@
> +/* { dg-do run { target aarch64_sve_hw } } */
> +/* { dg-options "-Ofast --param aarch64-autovec-preference=sve-only" } */
> +/* { dg-require-effective-target lp64 } */
> +
> +char b = 41;
> +int main() {
> + signed char a[31];
> +#pragma GCC novector
> + for (int c = 0; c < 31; ++c)
> + a[c] = c * c + c % 5;
> + {
> + signed char *d = a;
> +#pragma GCC novector
> + for (int c = 0; c < 31; ++c, b += -16)
> + d[c] += b;
> + }
> + for (int c = 0; c < 31; ++c) {
> + signed char e = c * c + c % 5 + 41 + c * -16;
> + if (a[c] != e)
> + __builtin_abort();
> + }
> +}