> -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: Wednesday, September 3, 2025 8:36 AM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead of > inclusive OR when reducing comparison values > > On Tue, 2 Sep 2025, Tamar Christina wrote: > > > > -----Original Message----- > > > From: Richard Biener <rguent...@suse.de> > > > Sent: Tuesday, September 2, 2025 3:45 PM > > > To: Tamar Christina <tamar.christ...@arm.com> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > > Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead > > > of > > > inclusive OR when reducing comparison values > > > > > > On Tue, 2 Sep 2025, Tamar Christina wrote: > > > > > > > > -----Original Message----- > > > > > From: Richard Biener <rguent...@suse.de> > > > > > Sent: Tuesday, September 2, 2025 3:08 PM > > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > > > > Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression > > > > > instead > of > > > > > inclusive OR when reducing comparison values > > > > > > > > > > On Tue, 2 Sep 2025, Tamar Christina wrote: > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Richard Biener <rguent...@suse.de> > > > > > > > Sent: Tuesday, September 2, 2025 1:30 PM > > > > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > > > > > > Subject: Re: [PATCH v2 3/3]middle-end: Use addhn for compression > instead > > > of > > > > > > > inclusive OR when reducing comparison values > > > > > > > > > > > > > > On Tue, 2 Sep 2025, Tamar Christina wrote: > > > > > > > > > > > > > > > Given a sequence such as > > > > > > > > > > > > > > > > int foo () > > > > > > > > { > > > > > > > > #pragma GCC unroll 4 > > > > > > > > for (int i = 0; i < N; i++) > > > > > > > > if (a[i] == 124) > > > > > > > > return 1; > > > > > > > > > > > > > > > > return 0; > > > > > > > > } > > > > > > > > > > > > > > > > where a[i] is long long, we will unroll the loop and use an OR > > > > > > > > reduction > for > > > > > > > > early break on Adv. SIMD. Afterwards the sequence is followed > > > > > > > > by a > > > > > compression > > > > > > > > sequence to compress the 128-bit vectors into 64-bits for use > > > > > > > > by the > > > branch. > > > > > > > > > > > > > > > > However if we have support for add halving and narrowing then we > can > > > > > instead > > > > > > > of > > > > > > > > using an OR, use an ADDHN which will do the combining and > narrowing. > > > > > > > > > > > > > > > > Note that for now I only do the last OR, however if we have > > > > > > > > more than > > > one > > > > > level > > > > > > > > of unrolling we could technically chain them. I will revisit > > > > > > > > this in > another > > > > > > > > up coming early break series, however an unroll of 2 is fairly > > > > > > > > common. > > > > > > > > > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > > > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu > > > > > > > > -m32, -m64 and no issues and about a 10% improvements > > > > > > > > in this sequence for Adv. SIMD. > > > > > > > > > > > > > > > > Ok for master? > > > > > > > > > > > > > > Hmm, so you are replacing the last bitwise OR with a > > > > > > > addhn which produces a "smaller" vector. So like > > > > > > > > > > > > > > V4SI tem = V4SI | V4SI; > > > > > > > if (tem != 0) > > > > > > > > > > > > > > -> > > > > > > > > > > > > > > V4HI tem = .VEC_ADD_HALVING_NARROW (V4SI, V4SI); > > > > > > > if (tem != 0) > > > > > > > > > > > > > > whatever 'halving' now stands for (isn't that > .VEC_ADD_HIGH_NARROW?) > > > > > > > > > > > > > > > > > > > Yeah, but it retrieved the high half, so open to suggestion but I > > > > > > think any > > > name > > > > > > would be confusion.. > > > > > > > > > > > > > I can't see how that's in any way faster? (the aarch64 testcases > > > > > > > unfortunately stop matching after the addhn) > > > > > > > > > > > > > > > > > > > Which is intentional. > > > > > > > > > > > > The original code with the ORR is > > > > > > > > > > > > ldp q31, q30, [x0] > > > > > > cmeq v31.2d, v31.2d, v29.2d > > > > > > cmeq v30.2d, v30.2d, v29.2d > > > > > > orr v31.16b, v31.16b, v30.16b > > > > > > umaxp v31.4s, v31.4s, v31.4s > > > > > > fmov x3, d31 > > > > > > > > > > > > because the result of the ORR is a 128-bit vector it needs to be > compressed > > > > > > into 64 bits to be transferred to GPR so the != 0 can be performed. > > > > > > > > > > > > ADDHN does the combination and compression in one step. i.e. > > > > > > > > > > > > Orr + umaxp -> addhn. > > > > > > > > > > Ah, I see. So AdvSIMD lacks a ptest, and instead you go to gpr. > > > > > The above code does a max reduction and the fmov moves the > > > > > scalar reduction result to a GPR? But with addhn you move > > > > > the whole (64bit) vector reg to a GPR? > > > > > > > > > > > > > Indeed. > > > > > > > > > It seems to me that on the vectorizer side it's not so interesting > > > > > to know the target can do addhn but that the target can't do > > > > > a {u,}cmpv4si with EQ? That is, without the patch the vectorizer > > > > > generates a GIMPLE_COND that isn't supported by the target? > > > > > > > > > > We check for cbranch, so currently you say you can do it but emulate > > > > > it with umaxp + fmov? > > > > > > > > Indeed. > > > > > > > > > > > > > > What do you do when there's only one V4SImode vector? You > > > > > could pack (truncate) that to V4HImode, right? Aka > > > > > vec_pack_trunc (x, x) -> V8HImode and the lower V4HI / 64bit is > > > > > then 'x'? > > > > > > > > For only one V4SImode vector we use UMAXP still but with itself, > > > > It essentially throws away half the lanes of the result, but that's ok > > > > Because for a cbranch all we care about is whether any is set or all > > > > is zero. We use UMAXP because that works for vectors of bytes too > > > > whereas narrowing wouldn't. > > > > > > Ah, looked up umaxp and it's a concat + reduce adjacent lanes to > > > one with MAX. We don't have an optab scheme for such instruction > > > either ;) I think the closest are the SAD_EXPR likes, that > > > reduce the number of lanes because they are widening, but those > > > have one input only. umaxp is sth like a reduc_umax_evenodd, > > > one could imagine a reduc_umax_hilo that reduces { a0, a1, a2, a3 } > > > { b0, b1, b2, b3 } as { umax (a0, b0), umax (a1, b1), umax (a2,b2), > > > umax (a3, b3) } instead. That said, I can see how addhn is > > > useful here. > > > > > > > However when SVE is available (just wasn't chosen due to costing) > > > > the goal is to use the SVE comparison predicated to 128-bits instead. > > > > > > > > This is where my up coming patch for vec_cbranch_any and > > > > vec_cbranch_all comes in. With only one comparison we can replace > > > > it with an SVE compare + branch, removing the need for the reduction. > > > > > > Hmm, it all feels like somewhat of a delicate target costing thing > > > to me. > > > > > > > > > > > > > > > Also the inputs are vector bools(?), so you should V_C_E them to > > > > > > > data vectors before "adding" them. And check that they have > > > > > > > a vector mode that's not VnBImode for which I guess the addhn > > > > > > > semantics wouldn't be necessarily good enough. > > > > > > > > > > > > ADDHN can't be used for SVE (and so the optab isn't implemented on > > > > > > SVE > > > > > modes) > > > > > > because SVE's version is even/odd. But for SVE we also don't want > > > > > > this > > > codegen > > > > > > because SVE can branch on the result of the data compare. So we > > > > > > don't > want > > > the > > > > > > intermediate forced compression. So this is strictly for Adv. SIMD. > > > > > > > > > > > > > > > > > > > > How would you scale this to workset.length () > 2? I suppose > > > > > > > for an even number reduce to the half element size first, for > > > > > > > odd you could make it even by first reducing two vectors with IOR? > > > > > > > If small, either check for another narrowing addhn operation or > > > > > > > continue with IOR? > > > > > > > > > > > > > > > > > > > Because the instruction can't work on bytes, having > 2 just uses > > > > > > ORR > > > > > > Until we have == 2 and then ADDHN. You could use ADDHN for the > > > > > > Intermediate steps, but ADDHN hits a limit when you reach bytes. > > > > > > > > > > > > However one big benefit of using the ADDHN even in > 2 cases is that > > > > > > it prevents reassoc from breaking the ORR order we created in the > > > > > > vectorizer as it can't reassociate them back to a linear form as it > > > > > > does > > > > > > today. > > > > > > > > > > > > And the reason we can't match the ADDHN in the backend is that in > > > > > > order for us to know that the inputs are Boolean vectors we also > > > > > > have > > > > > > to match the compares. This means that the chain is longer than > > > > > > what > > > > > > combine tries since it has to match everything including the > > > > > > if_then_else > > > > > > and the reduction to set the CC. > > > > > > > > > > > > > That said, I still fail to see how addhn reduces the critical > > > > > > > latency? > > > > > > > > > > > > > > > > > > > Because it replaces 2 instruction on the critical reduction path > > > > > > with 1 > > > > > > that is half the latency of the two it replaced. See example above. > > > > > > > > > > So it's good enough to indeed combine the last two elements like your > > > > > patch does. > > > > > > > > > > That said, I still wonder about the trigger - it shouldn't be > > > > > availability of the instruction as I'd think a addhn should be > > > > > not cheaper than a simple bitwise OR. Instead it's that > > > > > cbranch on the wider vector isn't available? > > > > > > > > The addhn is actually the same latency/throughput as ORR as they're both > simple > > > > vector ALU operations on all cores. But yeah the reason this is > > > > beneficial > > > > is because of the reduction. So from that point of view it is > > > > beneficial to > > > > always use if available. > > > > > > Is it? At least only when cbranch on that smaller mode is available? > > > > Yes, it's what we use in our glibc routines like chrchr etc. which are > > inline > assembly. > > > > > > > > > The reason why I don't think the vectorizer should generate anything > > > > different than the cbranch is that as mentioned above when SVE is > > > > available (which is on all modern Arm produced cores) we can replace > > > > the cbranch with an SVE compare. > > > > > > > > So testing for cbranch is temporary until it's replaced with testing for > > > > Vec_cbranch_any and vec_cbranch_all (and deprecate cbranch) which > > > > allows us more flexibility in the back end. > > > > > > Right. > > > > > > Still an unconditional use of addhn when available looks wrong to me, > > > why'd we have the cbranch check then, anyway? > > > > Because the result of the addhn is still a vector, and the cbranch check is > > there > > to ask if the target can do the vector comparison and branch. The > > vectorizer > doesn't > > particularly care how it does it though. > > Huh, but we ask > > if (direct_optab_handler (cbranch_optab, mode) == CODE_FOR_nothing) > > in this case for V4SI, but now with availability of addhn you > instead generate a branch on V4HI. So what's the point of asking for > V4SI when you use V4HI in the end?
Doh.. yes that I agree with is bogus. Fixing. Tamar > > You should check for addhn availability plus verify you can cbranch > on the half-size vector mode I think (and otherwise not use addhn). > > > Some targets need a reduction of sorts (Adv. SIMD), we just utilize the > > fact that > > we can move the lower 64-bit of the vector as a bit pattern in one go to > > GPR. > > > > Other targets don't need any of this and just branch on the CC flags > > already set > > (SVE) > > > > Other targets have to reduce to scalar using some in-order reduction or > > similar. > > > > And some targets may just not be able to. The cbranch is there to abstract > > these > > away. > > > > The ADDHN is essentially asking whether you prefer a vector of Booleans of > > 64- > bit > > or 128-bits for the reduction of >= 2 compares with the expectation that the > vector > > of 64-bit will not be more expensive than that of 128-bits to use. > > > > I mean I could instead add a target hook that asks the target how it wants > > to > combine > > the two elements? But that feels like an abstraction that won't really be > > used by > anyone > > else.. > > Just properly test we can cbranch on the actually used mode? > > Richard. > > > Thanks, > > Tamar > > > > > > Richard. > > > > > > > Thanks, > > > > Tamar > > > > > > > > > > > > > > Richard. > > > > > > > > > > > Thanks, > > > > > > Tamar > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Tamar > > > > > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > > > > > * internal-fn.def (VEC_ADD_HALVING_NARROW): New. > > > > > > > > * doc/generic.texi: Document it. > > > > > > > > * optabs.def (vec_addh_narrow): New. > > > > > > > > * doc/md.texi: Document it. > > > > > > > > * tree-vect-stmts.cc (vectorizable_early_exit): Use > > > > > > > > addhn if > > > supported. > > > > > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > > > > > * gcc.target/aarch64/vect-early-break-addhn_1.c: New > > > > > > > > test. > > > > > > > > * gcc.target/aarch64/vect-early-break-addhn_2.c: New > > > > > > > > test. > > > > > > > > * gcc.target/aarch64/vect-early-break-addhn_3.c: New > > > > > > > > test. > > > > > > > > * gcc.target/aarch64/vect-early-break-addhn_4.c: New > > > > > > > > test. > > > > > > > > > > > > > > > > --- > > > > > > > > diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi > > > > > > > > index > > > > > > > > > > > > > > > > d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9 > > > > > > > a885d9d218d9af 100644 > > > > > > > > --- a/gcc/doc/generic.texi > > > > > > > > +++ b/gcc/doc/generic.texi > > > > > > > > @@ -1834,6 +1834,7 @@ a value from @code{enum > annot_expr_kind}, > > > the > > > > > > > third is an @code{INTEGER_CST}. > > > > > > > > @tindex IFN_VEC_WIDEN_MINUS_LO > > > > > > > > @tindex IFN_VEC_WIDEN_MINUS_EVEN > > > > > > > > @tindex IFN_VEC_WIDEN_MINUS_ODD > > > > > > > > +@tindex IFN_VEC_ADD_HALVING_NARROW > > > > > > > > @tindex VEC_UNPACK_HI_EXPR > > > > > > > > @tindex VEC_UNPACK_LO_EXPR > > > > > > > > @tindex VEC_UNPACK_FLOAT_HI_EXPR > > > > > > > > @@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions. In > the > > > case > > > > > of > > > > > > > > vector are subtracted from the odd @code{N/2} of the first to > produce > > > the > > > > > > > > vector of @code{N/2} subtractions. > > > > > > > > > > > > > > > > +@item IFN_VEC_ADD_HALVING_NARROW > > > > > > > > +This internal function performs an addition of two input > > > > > > > > vectors, > > > > > > > > +then extracts the most significant half of each result element > > > > > > > > and > > > > > > > > +narrows it back to the original element width. > > > > > > > > + > > > > > > > > +Concretely, it computes: > > > > > > > > +@code{(bits(a)/2)((a + b) >> bits(a))} > > > > > > > > + > > > > > > > > +where @code{bits(a)} is the width in bits of each input > > > > > > > > element. > > > > > > > > + > > > > > > > > +Its operands are vectors containing the same number of elements > > > > > (@code{N}) > > > > > > > > +of the same integral type. The result is a vector of length > > > > > > > > @code{N}, > with > > > > > > > > +elements of an integral type whose size is half that of the > > > > > > > > input > element > > > > > > > > +type. > > > > > > > > + > > > > > > > > +This operation currently only used for early break result > > > > > > > > compression > > > when > > > > > the > > > > > > > > +result of a vector boolean can be represented as 0 or -1. > > > > > > > > + > > > > > > > > @item VEC_UNPACK_HI_EXPR > > > > > > > > @itemx VEC_UNPACK_LO_EXPR > > > > > > > > These nodes represent unpacking of the high and low parts of > > > > > > > > the > input > > > > > vector, > > > > > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > > > > > > > > index > > > > > > > > > > > > > > > > aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9 > > > > > > > b28ba8d52e5d464 100644 > > > > > > > > --- a/gcc/doc/md.texi > > > > > > > > +++ b/gcc/doc/md.texi > > > > > > > > @@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements > of > > > size > > > > > > > S@. Find the absolute > > > > > > > > difference between operands 1 and 2 and widen the resulting > elements. > > > > > > > > Put the N/2 results of size 2*S in the output vector (operand > > > > > > > > 0). > > > > > > > > > > > > > > > > +@cindex @code{vec_addh_narrow@var{m}} instruction pattern > > > > > > > > +@item @samp{vec_addh_narrow@var{m}} > > > > > > > > +Signed or unsigned addition of two input vectors, then > > > > > > > > extracts the > > > > > > > > +most significant half of each result element and narrows it > > > > > > > > back to the > > > > > > > > +original element width. > > > > > > > > + > > > > > > > > +Concretely, it computes: > > > > > > > > +@code{(bits(a)/2)((a + b) >> bits(a))} > > > > > > > > + > > > > > > > > +where @code{bits(a)} is the width in bits of each input > > > > > > > > element. > > > > > > > > + > > > > > > > > +Its operands (@code{1} and @code{2}) are vectors containing the > same > > > > > > > number > > > > > > > > +of signed or unsigned integral elements (@code{N}) of size > > > > > > > > @code{S}. > > > The > > > > > > > > +result (operand @code{0}) is a vector of length @code{N}, with > elements > > > of > > > > > > > > +an integral type whose size is half that of @code{S}. > > > > > > > > + > > > > > > > > +This operation currently only used for early break result > > > > > > > > compression > > > when > > > > > the > > > > > > > > +result of a vector boolean can be represented as 0 or -1. > > > > > > > > + > > > > > > > > @cindex @code{vec_addsub@var{m}3} instruction pattern > > > > > > > > @item @samp{vec_addsub@var{m}3} > > > > > > > > Alternating subtract, add with even lanes doing subtract and > > > > > > > > odd > > > > > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > > > > > > > > index > > > > > > > > > > > > > > > > d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b > > > > > > > 31d0abc9adb67867 100644 > > > > > > > > --- a/gcc/internal-fn.def > > > > > > > > +++ b/gcc/internal-fn.def > > > > > > > > @@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN > > > > > (COMPLEX_ADD_ROT270, > > > > > > > ECF_CONST, cadd270, binary) > > > > > > > > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, > binary) > > > > > > > > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, > > > cmul_conj, > > > > > > > binary) > > > > > > > > DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, > > > binary) > > > > > > > > +DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW, > ECF_CONST > > > | > > > > > > > ECF_NOTHROW, > > > > > > > > + vec_addh_narrow, binary) > > > > > > > > DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS, > > > > > > > > ECF_CONST | ECF_NOTHROW, > > > > > > > > first, > > > > > > > > diff --git a/gcc/optabs.def b/gcc/optabs.def > > > > > > > > index > > > > > > > > > > > > > > > > 87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d797 > > > > > > > 2db81b542b32c9eb8 100644 > > > > > > > > --- a/gcc/optabs.def > > > > > > > > +++ b/gcc/optabs.def > > > > > > > > @@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab, > > > > > > > "vec_widen_uabd_hi_$a") > > > > > > > > OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a") > > > > > > > > OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a") > > > > > > > > OPTAB_D (vec_widen_uabd_even_optab, > "vec_widen_uabd_even_$a") > > > > > > > > +OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a") > > > > > > > > OPTAB_D (vec_addsub_optab, "vec_addsub$a3") > > > > > > > > OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4") > > > > > > > > OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4") > > > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break- > addhn_1.c > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c > > > > > > > > new file mode 100644 > > > > > > > > index > > > > > > > > > > > > > > > > 0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8 > > > > > > > b063e418a75a23c525d > > > > > > > > --- /dev/null > > > > > > > > +++ > > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c > > > > > > > > @@ -0,0 +1,33 @@ > > > > > > > > +/* { dg-do compile } */ > > > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details > > > > > > > > -std=c99" } > */ > > > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */ > > > > > > > > + > > > > > > > > +#define TYPE int > > > > > > > > +#define N 800 > > > > > > > > + > > > > > > > > +#pragma GCC target "+nosve" > > > > > > > > + > > > > > > > > +TYPE a[N]; > > > > > > > > + > > > > > > > > +/* > > > > > > > > +** foo: > > > > > > > > +** ... > > > > > > > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32 > > > > > > > > +** cmeq v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s > > > > > > > > +** cmeq v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s > > > > > > > > +** addhn v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s > > > > > > > > +** fmov x[0-9]+, d[0-9]+ > > > > > > > > +** ... > > > > > > > > +*/ > > > > > > > > + > > > > > > > > +int foo () > > > > > > > > +{ > > > > > > > > +#pragma GCC unroll 8 > > > > > > > > + for (int i = 0; i < N; i++) > > > > > > > > + if (a[i] == 124) > > > > > > > > + return 1; > > > > > > > > + > > > > > > > > + return 0; > > > > > > > > +} > > > > > > > > + > > > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" > } } > > > */ > > > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break- > addhn_2.c > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c > > > > > > > > new file mode 100644 > > > > > > > > index > > > > > > > > > > > > > > > > 0000000000000000000000000000000000000000..d67d0d13d1733935aaf80 > > > > > > > 5e59188eb8155cb5f06 > > > > > > > > --- /dev/null > > > > > > > > +++ > > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c > > > > > > > > @@ -0,0 +1,33 @@ > > > > > > > > +/* { dg-do compile } */ > > > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details > > > > > > > > -std=c99" } > */ > > > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */ > > > > > > > > + > > > > > > > > +#define TYPE long long > > > > > > > > +#define N 800 > > > > > > > > + > > > > > > > > +#pragma GCC target "+nosve" > > > > > > > > + > > > > > > > > +TYPE a[N]; > > > > > > > > + > > > > > > > > +/* > > > > > > > > +** foo: > > > > > > > > +** ... > > > > > > > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32 > > > > > > > > +** cmeq v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d > > > > > > > > +** cmeq v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d > > > > > > > > +** addhn v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d > > > > > > > > +** fmov x[0-9]+, d[0-9]+ > > > > > > > > +** ... > > > > > > > > +*/ > > > > > > > > + > > > > > > > > +int foo () > > > > > > > > +{ > > > > > > > > +#pragma GCC unroll 4 > > > > > > > > + for (int i = 0; i < N; i++) > > > > > > > > + if (a[i] == 124) > > > > > > > > + return 1; > > > > > > > > + > > > > > > > > + return 0; > > > > > > > > +} > > > > > > > > + > > > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" > } } > > > */ > > > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break- > addhn_3.c > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c > > > > > > > > new file mode 100644 > > > > > > > > index > > > > > > > > > > > > > > > > 0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8 > > > > > > > dbe98c79713eaf5607 > > > > > > > > --- /dev/null > > > > > > > > +++ > > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c > > > > > > > > @@ -0,0 +1,33 @@ > > > > > > > > +/* { dg-do compile } */ > > > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details > > > > > > > > -std=c99" } > */ > > > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */ > > > > > > > > + > > > > > > > > +#define TYPE short > > > > > > > > +#define N 800 > > > > > > > > + > > > > > > > > +#pragma GCC target "+nosve" > > > > > > > > + > > > > > > > > +TYPE a[N]; > > > > > > > > + > > > > > > > > +/* > > > > > > > > +** foo: > > > > > > > > +** ... > > > > > > > > +** ldp q[0-9]+, q[0-9]+, \[x[0-9]+\], 32 > > > > > > > > +** cmeq v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h > > > > > > > > +** cmeq v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h > > > > > > > > +** addhn v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h > > > > > > > > +** fmov x[0-9]+, d[0-9]+ > > > > > > > > +** ... > > > > > > > > +*/ > > > > > > > > + > > > > > > > > +int foo () > > > > > > > > +{ > > > > > > > > +#pragma GCC unroll 16 > > > > > > > > + for (int i = 0; i < N; i++) > > > > > > > > + if (a[i] == 124) > > > > > > > > + return 1; > > > > > > > > + > > > > > > > > + return 0; > > > > > > > > +} > > > > > > > > + > > > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect" > } } > > > */ > > > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break- > addhn_4.c > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c > > > > > > > > new file mode 100644 > > > > > > > > index > > > > > > > > > > > > > > > > 0000000000000000000000000000000000000000..8ad42b22024479283d681 > > > > > > > 4d815ef1dce411d1c72 > > > > > > > > --- /dev/null > > > > > > > > +++ > > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c > > > > > > > > @@ -0,0 +1,21 @@ > > > > > > > > +/* { dg-do compile } */ > > > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details > > > > > > > > -std=c99" } > */ > > > > > > > > + > > > > > > > > +#define TYPE char > > > > > > > > +#define N 800 > > > > > > > > + > > > > > > > > +#pragma GCC target "+nosve" > > > > > > > > + > > > > > > > > +TYPE a[N]; > > > > > > > > + > > > > > > > > +int foo () > > > > > > > > +{ > > > > > > > > +#pragma GCC unroll 32 > > > > > > > > + for (int i = 0; i < N; i++) > > > > > > > > + if (a[i] == 124) > > > > > > > > + return 1; > > > > > > > > + > > > > > > > > + return 0; > > > > > > > > +} > > > > > > > > + > > > > > > > > +/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW" > > > "vect" } } > > > > > */ > > > > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > > > > > > > index > > > > > > > > > > > > > > > > 1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb54 > > > > > > > 4c6874d7bec999a8 100644 > > > > > > > > --- a/gcc/tree-vect-stmts.cc > > > > > > > > +++ b/gcc/tree-vect-stmts.cc > > > > > > > > @@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info > > > > > loop_vinfo, > > > > > > > stmt_vec_info stmt_info, > > > > > > > > gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt > (stmt_info)); > > > > > > > > gcond *cond_stmt = as_a <gcond *>(orig_stmt); > > > > > > > > > > > > > > > > - tree cst = build_zero_cst (vectype); > > > > > > > > + tree vectype_out = vectype; > > > > > > > > auto bb = gimple_bb (cond_stmt); > > > > > > > > edge exit_true_edge = EDGE_SUCC (bb, 0); > > > > > > > > if (exit_true_edge->flags & EDGE_FALSE_VALUE) > > > > > > > > @@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info > > > > > > > loop_vinfo, stmt_vec_info stmt_info, > > > > > > > > else > > > > > > > > workset.splice (stmts); > > > > > > > > > > > > > > > > + /* See if we support ADDHN and use that for the > > > > > > > > reduction. */ > > > > > > > > + internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW; > > > > > > > > + bool addhn_supported_p > > > > > > > > + = direct_internal_fn_supported_p (ifn, vectype, > > > OPTIMIZE_FOR_SPEED); > > > > > > > > + tree narrow_type = NULL_TREE; > > > > > > > > + if (addhn_supported_p) > > > > > > > > + { > > > > > > > > + /* Calculate the narrowing type for the result. */ > > > > > > > > + auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) > > > > > > > > / 2; > > > > > > > > + auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype)); > > > > > > > > + tree itype = build_nonstandard_integer_type (halfprec, > > > unsignedp); > > > > > > > > + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); > > > > > > > > + tree tmp_type = build_vector_type (itype, nunits); > > > > > > > > + narrow_type = truth_type_for (tmp_type); > > > > > > > > + } > > > > > > > > + > > > > > > > > while (workset.length () > 1) > > > > > > > > { > > > > > > > > - new_temp = make_temp_ssa_name (vectype, NULL, > > > "vexit_reduc"); > > > > > > > > tree arg0 = workset.pop (); > > > > > > > > tree arg1 = workset.pop (); > > > > > > > > - new_stmt = gimple_build_assign (new_temp, > > > > > > > > BIT_IOR_EXPR, > > > arg0, arg1); > > > > > > > > + if (addhn_supported_p && workset.length () == 0) > > > > > > > > + { > > > > > > > > + new_stmt = gimple_build_call_internal (ifn, 2, > > > > > > > > arg0, arg1); > > > > > > > > + vectype_out = narrow_type; > > > > > > > > + new_temp = make_temp_ssa_name (vectype_out, NULL, > > > > > > > "vexit_reduc"); > > > > > > > > + gimple_call_set_lhs (as_a <gcall *> (new_stmt), > > > > > > > > new_temp); > > > > > > > > + gimple_call_set_nothrow (as_a <gcall *> > > > > > > > > (new_stmt), true); > > > > > > > > + } > > > > > > > > + else > > > > > > > > + { > > > > > > > > + new_temp = make_temp_ssa_name (vectype_out, NULL, > > > > > > > "vexit_reduc"); > > > > > > > > + new_stmt > > > > > > > > + = gimple_build_assign (new_temp, BIT_IOR_EXPR, > > > > > > > > arg0, > > > arg1); > > > > > > > > + } > > > > > > > > vect_finish_stmt_generation (loop_vinfo, stmt_info, > > > > > > > > new_stmt, > > > > > > > > &cond_gsi); > > > > > > > > workset.quick_insert (0, new_temp); > > > > > > > > @@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info > > > > > loop_vinfo, > > > > > > > stmt_vec_info stmt_info, > > > > > > > > > > > > > > > > gcc_assert (new_temp); > > > > > > > > > > > > > > > > + tree cst = build_zero_cst (vectype_out); > > > > > > > > gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, > > > > > > > > cst); > > > > > > > > update_stmt (orig_stmt); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Richard Biener <rguent...@suse.de> > > > > > > > SUSE Software Solutions Germany GmbH, > > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > > > > Nuernberg) > > > > > > > > > > > > > > > > -- > > > > > Richard Biener <rguent...@suse.de> > > > > > SUSE Software Solutions Germany GmbH, > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > > Nuernberg) > > > > > > > > > > -- > > > Richard Biener <rguent...@suse.de> > > > SUSE Software Solutions Germany GmbH, > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > Nuernberg) > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)