> -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: Wednesday, April 16, 2025 9:57 AM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > Subject: Re: [PATCH]middle-end: fix masking for partial vectors and early > break > [PR119351] > > On Wed, 16 Apr 2025, Tamar Christina wrote: > > > Hi All, > > > > The following testcase shows an incorrect masked codegen: > > > > #define N 512 > > #define START 1 > > #define END 505 > > > > int x[N] __attribute__((aligned(32))); > > > > int __attribute__((noipa)) > > foo (void) > > { > > int z = 0; > > for (unsigned int i = START; i < END; ++i) > > { > > z++; > > if (x[i] > 0) > > continue; > > > > return z; > > } > > return -1; > > } > > > > notice how there's a continue there instead of a break. This means we > > generate > > a control flow where success stays within the loop iteration: > > I think that's just a fancy way of rotating the loop. > > So the key issue is that when the exit test is "inverted", aka it > stays in the loop when true and exits when false, we do > > if (flow_bb_inside_loop_p (LOOP_VINFO_LOOP (loop_vinfo), > exit_true_edge->dest)) > { > new_code = EQ_EXPR; > reduc_optab = and_optab; > reduc_op = BIT_AND_EXPR; > cst = build_minus_one_cst (vectype); > > and with PFA with mask and the initial loop mask of { 0, 0, -1, -1 } > we then exit early and the scalar loop does not correctly handle > this case (apart from it being a missed optimization). For the > regular non-inverted case we use
Indeed. > > auto new_code = NE_EXPR; > auto reduc_optab = ior_optab; > auto reduc_op = BIT_IOR_EXPR; > tree cst = build_zero_cst (vectype); > > and that is fine. > > > > > mask_patt_9.12_46 = vect__1.11_45 > { 0, 0, 0, 0 }; > > vec_mask_and_47 = mask_patt_9.12_46 & loop_mask_41; > > if (vec_mask_and_47 == { -1, -1, -1, -1 }) > > goto <bb 4>; [41.48%] > > else > > goto <bb 15>; [58.52%] > > > > However when loop_mask_41 is a partial mask this comparison can lead to an > > incorrect match. In this case the mask is: > > > > # loop_mask_41 = PHI <next_mask_63(6), { 0, -1, -1, -1 }(2)> > > > > due to peeling for alignment with masking and compiling with > > -msve-vector-bits=128. > > > > At codegen time we generate: > > > > ptrue p15.s, vl4 > > ptrue p7.b, vl1 > > not p7.b, p15/z, p7.b > > .L5: > > ld1w z29.s, p7/z, [x1, x0, lsl 2] > > cmpgt p7.s, p7/z, z29.s, #0 > > not p7.b, p15/z, p7.b > > ptest p15, p7.b > > b.none .L2 > > ...<early exit>... > > > > notice how at expand time the basic blocks are inverted and a not is > > generated. > > But the generated not is unmasked (or predicated over an ALL true mask in > > this > > case). This has the unintended side-effect of flipping the results of the > > inactive lanes (which were zero'd by the cmpgt) into -1. Which then > > incorrectly > > causes us to not take the branch to .L2. > > > > This is happening because the expander has no context about the mask, and > since > > we can't mask a gcond, we do the next best thing which is to mask both > operands. > > So you make this sound as if it were a bug in the expander because > "it doesn't know"? I think a compare against {-1,...} is flawed, > this case needs to compare against loop_mask, not all-ones, no? > > So instead of > > > vec_mask_and_47 = mask_patt_9.12_46 & loop_mask_41; > > if (vec_mask_and_47 == { -1, -1, -1, -1 }) > > do > > > vec_mask_and_47 = mask_patt_9.12_46 & loop_mask_41; > > if (vec_mask_and_47 == loop_mask_41) > > which is sort-of what you do, of course, just in an odd way (IMO). Yeah, but the reason I did this is that the RVV vec_len helper vect_gen_loop_len_mask doesn't give me the mask. So I guess the question what do I use in this case. I will have to refactor vect_gen_loop_len_mask to split it into something that can get the mask and does the masking like SVE. I did it this way as I was expecting the optimizers to take care of it anyway. Do you still want this change? Thanks, Tamar > > Richard. > > > We already mask the compare, but this patch now also masks the constant. In > the > > normal case this means we drop it since {0, ..} & mask = {0, ..} but in the > > case > > of an forall comparison we'll keep the mask, allowing the generated code to > > correctly mask the results. > > > > For the above we now generate: > > > > .L5: > > ld1w z28.s, p7/z, [x1, x0, lsl 2] > > cmpgt p14.s, p7/z, z28.s, #0 > > eors p7.b, p15/z, p7.b, p14.b > > b.none .L2 > > > > This fixes gromacs with > 1 OpenMP threads and improves performance. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu > > -m32, -m64 and no issues. > > > > Ok for master? and backport to GCC-14? > > > > Thanks, > > Tamar > > > > > > gcc/ChangeLog: > > > > PR tree-optimization/119351 > > * tree-vect-stmts.cc (vectorizable_early_exit): Mask both operands of > > the gcond for partial masking support. > > > > gcc/testsuite/ChangeLog: > > > > PR tree-optimization/119351 > > * gcc.target/aarch64/sve/pr119351.c: New test. > > * gcc.target/aarch64/sve/pr119351_run.c: New test. > > > > --- > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c > b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..198f7edb0fc01bfc74ae231 > db7823e9a6f0bc119 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c > > @@ -0,0 +1,38 @@ > > +/* Fix for PR119351 alignment peeling with vectors and VLS. */ > > +/* { dg-do compile } */ > > +/* { dg-options "-Ofast -msve-vector-bits=256 --param aarch64-autovec- > preference=sve-only -fdump-tree-vect-details" } */ > > +/* { dg-final { check-function-bodies "**" "" ""} } */ > > + > > +#define N 512 > > +#define START 1 > > +#define END 505 > > + > > +int x[N] __attribute__((aligned(32))); > > + > > +/* > > +** foo: > > +** ... > > +** ld1w z[0-9]+.s, p[0-9]+/z, \[x[0-9], x[0-9], lsl 2\] > > +** cmpgt p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0 > > +** eors p[0-9]+.b, p[0-9]+/z, p[0-9]+.b, p[0-9]+.b > > +** ... > > +*/ > > + > > +int __attribute__((noipa)) > > +foo (void) > > +{ > > + int z = 0; > > + for (unsigned int i = START; i < END; ++i) > > + { > > + z++; > > + if (x[i] > 0) > > + continue; > > + > > + return z; > > + } > > + return -1; > > +} > > + > > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ > > +/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Alignment of access forced using peeling" > > "vect" > } } */ > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr119351_run.c > b/gcc/testsuite/gcc.target/aarch64/sve/pr119351_run.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..d36ab0eb7a900504e7dc2 > 266ec5a19d1beeb5123 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr119351_run.c > > @@ -0,0 +1,20 @@ > > +/* Fix for PR119351 alignment peeling with vectors and VLS. */ > > +/* { dg-do run { target aarch64_sve_hw } } */ > > +/* { dg-options "-Ofast --param aarch64-autovec-preference=sve-only" } */ > > +/* { dg-additional-options "-msve-vector-bits=256" { target > aarch64_sve256_hw } } */ > > +/* { dg-additional-options "-msve-vector-bits=128" { target > aarch64_sve128_hw } } */ > > + > > +#include "pr119351.c" > > + > > +int __attribute__ ((optimize (1))) > > +main (void) > > +{ > > + x[0] = 1; > > + x[1] = 21; > > + x[2] = 39; > > + x[3] = 59; > > + int res = foo (); > > + if (res != 4) > > + __builtin_abort (); > > + return 0; > > +} > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index > 7f874354e75e8d4d3a7196df4e9b559ac641827c..8f67483a943705d89e3f6d7 > 44cb2bdb5cf8b44db 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -13781,6 +13781,20 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > > > gcc_assert (new_temp); > > > > + /* We have to mask both operands of the gcond because if one element of > > the > > + mask is inactive and was zero'd we would incorrectly compare true for > > the > > + inactive element. */ > > + if (masked_loop_p) > > + { > > + tree mask > > + = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies, vectype, 0); > > + cst > > + = prepare_vec_mask (loop_vinfo, TREE_TYPE (mask), mask, cst, > &cond_gsi); > > + } > > + else if (len_loop_p) > > + cst = vect_gen_loop_len_mask (loop_vinfo, gsi, &cond_gsi, lens, > > + ncopies, vectype, cst, 0, 1); > > + > > gimple_cond_set_condition (cond_stmt, new_code, new_temp, cst); > > update_stmt (orig_stmt); > > > > > > > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)