RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead of inclusive OR when reducing comparison values

Tamar Christina Tue, 02 Sep 2025 23:48:54 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Wednesday, September 3, 2025 8:36 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead of
> inclusive OR when reducing comparison values
> 
> On Tue, 2 Sep 2025, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguent...@suse.de>
> > > Sent: Tuesday, September 2, 2025 3:45 PM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> > > Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead 
> > > of
> > > inclusive OR when reducing comparison values
> > >
> > > On Tue, 2 Sep 2025, Tamar Christina wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <rguent...@suse.de>
> > > > > Sent: Tuesday, September 2, 2025 3:08 PM
> > > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> > > > > Subject: RE: [PATCH v2 3/3]middle-end: Use addhn for compression 
> > > > > instead
> of
> > > > > inclusive OR when reducing comparison values
> > > > >
> > > > > On Tue, 2 Sep 2025, Tamar Christina wrote:
> > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Richard Biener <rguent...@suse.de>
> > > > > > > Sent: Tuesday, September 2, 2025 1:30 PM
> > > > > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> > > > > > > Subject: Re: [PATCH v2 3/3]middle-end: Use addhn for compression
> instead
> > > of
> > > > > > > inclusive OR when reducing comparison values
> > > > > > >
> > > > > > > On Tue, 2 Sep 2025, Tamar Christina wrote:
> > > > > > >
> > > > > > > > Given a sequence such as
> > > > > > > >
> > > > > > > > int foo ()
> > > > > > > > {
> > > > > > > > #pragma GCC unroll 4
> > > > > > > >   for (int i = 0; i < N; i++)
> > > > > > > >     if (a[i] == 124)
> > > > > > > >       return 1;
> > > > > > > >
> > > > > > > >   return 0;
> > > > > > > > }
> > > > > > > >
> > > > > > > > where a[i] is long long, we will unroll the loop and use an OR 
> > > > > > > > reduction
> for
> > > > > > > > early break on Adv. SIMD.  Afterwards the sequence is followed 
> > > > > > > > by a
> > > > > compression
> > > > > > > > sequence to compress the 128-bit vectors into 64-bits for use 
> > > > > > > > by the
> > > branch.
> > > > > > > >
> > > > > > > > However if we have support for add halving and narrowing then we
> can
> > > > > instead
> > > > > > > of
> > > > > > > > using an OR, use an ADDHN which will do the combining and
> narrowing.
> > > > > > > >
> > > > > > > > Note that for now I only do the last OR, however if we have 
> > > > > > > > more than
> > > one
> > > > > level
> > > > > > > > of unrolling we could technically chain them.  I will revisit 
> > > > > > > > this in
> another
> > > > > > > > up coming early break series, however an unroll of 2 is fairly 
> > > > > > > > common.
> > > > > > > >
> > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > > > > > -m32, -m64 and no issues and about a 10% improvements
> > > > > > > > in this sequence for Adv. SIMD.
> > > > > > > >
> > > > > > > > Ok for master?
> > > > > > >
> > > > > > > Hmm, so you are replacing the last bitwise OR with a
> > > > > > > addhn which produces a "smaller" vector.  So like
> > > > > > >
> > > > > > >  V4SI tem = V4SI | V4SI;
> > > > > > >  if (tem != 0)
> > > > > > >
> > > > > > > ->
> > > > > > >
> > > > > > >  V4HI tem = .VEC_ADD_HALVING_NARROW (V4SI, V4SI);
> > > > > > >  if (tem != 0)
> > > > > > >
> > > > > > > whatever 'halving' now stands for (isn't that
> .VEC_ADD_HIGH_NARROW?)
> > > > > > >
> > > > > >
> > > > > > Yeah, but it retrieved the high half, so open to suggestion but I 
> > > > > > think any
> > > name
> > > > > > would be confusion..
> > > > > >
> > > > > > > I can't see how that's in any way faster?  (the aarch64 testcases
> > > > > > > unfortunately stop matching after the addhn)
> > > > > > >
> > > > > >
> > > > > > Which is intentional.
> > > > > >
> > > > > > The original code with the ORR is
> > > > > >
> > > > > >         ldp     q31, q30, [x0]
> > > > > >         cmeq    v31.2d, v31.2d, v29.2d
> > > > > >         cmeq    v30.2d, v30.2d, v29.2d
> > > > > >         orr     v31.16b, v31.16b, v30.16b
> > > > > >         umaxp   v31.4s, v31.4s, v31.4s
> > > > > >         fmov    x3, d31
> > > > > >
> > > > > > because the result of the ORR is a 128-bit vector it needs to be
> compressed
> > > > > > into 64 bits to be transferred to GPR so the != 0 can be performed.
> > > > > >
> > > > > > ADDHN does the combination and compression in one step. i.e.
> > > > > >
> > > > > > Orr + umaxp -> addhn.
> > > > >
> > > > > Ah, I see.  So AdvSIMD lacks a ptest, and instead you go to gpr.
> > > > > The above code does a max reduction and the fmov moves the
> > > > > scalar reduction result to a GPR?  But with addhn you move
> > > > > the whole (64bit) vector reg to a GPR?
> > > > >
> > > >
> > > > Indeed.
> > > >
> > > > > It seems to me that on the vectorizer side it's not so interesting
> > > > > to know the target can do addhn but that the target can't do
> > > > > a {u,}cmpv4si with EQ?  That is, without the patch the vectorizer
> > > > > generates a GIMPLE_COND that isn't supported by the target?
> > > > >
> > > > > We check for cbranch, so currently you say you can do it but emulate
> > > > > it with umaxp + fmov?
> > > >
> > > > Indeed.
> > > >
> > > > >
> > > > > What do you do when there's only one V4SImode vector?  You
> > > > > could pack (truncate) that to V4HImode, right?  Aka
> > > > > vec_pack_trunc (x, x) -> V8HImode and the lower V4HI / 64bit is
> > > > > then 'x'?
> > > >
> > > > For only one V4SImode vector we use UMAXP still but with itself,
> > > > It essentially throws away half the lanes of the result, but that's ok
> > > > Because for a cbranch all we care about is whether any is set or all
> > > > is zero.  We use UMAXP because that works for vectors of bytes too
> > > > whereas narrowing wouldn't.
> > >
> > > Ah, looked up umaxp and it's a concat + reduce adjacent lanes to
> > > one with MAX.  We don't have an optab scheme for such instruction
> > > either ;)  I think the closest are the SAD_EXPR likes, that
> > > reduce the number of lanes because they are widening, but those
> > > have one input only.  umaxp is sth like a reduc_umax_evenodd,
> > > one could imagine a reduc_umax_hilo that reduces { a0, a1, a2, a3 }
> > > { b0, b1, b2, b3 } as { umax (a0, b0), umax (a1, b1), umax (a2,b2),
> > > umax (a3, b3) } instead.  That said, I can see how addhn is
> > > useful here.
> > >
> > > > However when SVE is available (just wasn't chosen due to costing)
> > > > the goal is to use the SVE comparison predicated to 128-bits instead.
> > > >
> > > > This is where my up coming patch for vec_cbranch_any and
> > > > vec_cbranch_all comes in.  With only one comparison we can replace
> > > > it with an SVE compare + branch, removing the need for the reduction.
> > >
> > > Hmm, it all feels like somewhat of a delicate target costing thing
> > > to me.
> > >
> > > > >
> > > > > > > Also the inputs are vector bools(?), so you should V_C_E them to
> > > > > > > data vectors before "adding" them.  And check that they have
> > > > > > > a vector mode that's not VnBImode for which I guess the addhn
> > > > > > > semantics wouldn't be necessarily good enough.
> > > > > >
> > > > > > ADDHN can't be used for SVE (and so the optab isn't implemented on 
> > > > > > SVE
> > > > > modes)
> > > > > > because SVE's version is even/odd.  But for SVE we also don't want 
> > > > > > this
> > > codegen
> > > > > > because SVE can branch on the result of the data compare.  So we 
> > > > > > don't
> want
> > > the
> > > > > > intermediate forced compression.  So this is strictly for Adv. SIMD.
> > > > > >
> > > > > > >
> > > > > > > How would you scale this to workset.length () > 2?  I suppose
> > > > > > > for an even number reduce to the half element size first, for
> > > > > > > odd you could make it even by first reducing two vectors with IOR?
> > > > > > > If small, either check for another narrowing addhn operation or
> > > > > > > continue with IOR?
> > > > > > >
> > > > > >
> > > > > > Because the instruction can't work on bytes,  having > 2 just uses 
> > > > > > ORR
> > > > > > Until we have == 2 and then ADDHN.  You could use ADDHN for the
> > > > > > Intermediate steps, but ADDHN hits a limit when you reach bytes.
> > > > > >
> > > > > > However one big benefit of using the ADDHN even in > 2 cases is that
> > > > > > it prevents reassoc from breaking the ORR order we created in the
> > > > > > vectorizer as it can't reassociate them back to a linear form as it 
> > > > > > does
> > > > > > today.
> > > > > >
> > > > > > And the reason we can't match the ADDHN in the backend is that in
> > > > > > order for us to know that the inputs are Boolean vectors we also 
> > > > > > have
> > > > > > to match the compares.  This means that the chain is longer than 
> > > > > > what
> > > > > > combine tries since it has to match everything including the 
> > > > > > if_then_else
> > > > > > and the reduction to set the CC.
> > > > > >
> > > > > > > That said, I still fail to see how addhn reduces the critical
> > > > > > > latency?
> > > > > > >
> > > > > >
> > > > > > Because it replaces 2 instruction on the critical reduction path 
> > > > > > with 1
> > > > > > that is half the latency of the two it replaced.  See example above.
> > > > >
> > > > > So it's good enough to indeed combine the last two elements like your
> > > > > patch does.
> > > > >
> > > > > That said, I still wonder about the trigger - it shouldn't be
> > > > > availability of the instruction as I'd think a addhn should be
> > > > > not cheaper than a simple bitwise OR.  Instead it's that
> > > > > cbranch on the wider vector isn't available?
> > > >
> > > > The addhn is actually the same latency/throughput as ORR as they're both
> simple
> > > > vector ALU operations on all cores.  But yeah the reason this is 
> > > > beneficial
> > > > is because of the reduction. So from that point of view it is 
> > > > beneficial to
> > > > always use if available.
> > >
> > > Is it?  At least only when cbranch on that smaller mode is available?
> >
> > Yes, it's what we use in our glibc routines like chrchr etc. which are 
> > inline
> assembly.
> >
> > >
> > > > The reason why I don't think the vectorizer should generate anything
> > > > different than the cbranch is that as mentioned above when SVE is
> > > > available (which is on all modern Arm produced cores) we can replace
> > > > the cbranch with an SVE compare.
> > > >
> > > > So testing for cbranch is temporary until it's replaced with testing for
> > > > Vec_cbranch_any and vec_cbranch_all (and deprecate cbranch) which
> > > > allows us more flexibility in the back end.
> > >
> > > Right.
> > >
> > > Still an unconditional use of addhn when available looks wrong to me,
> > > why'd we have the cbranch check then, anyway?
> >
> > Because the result of the addhn is still a vector, and the cbranch check is 
> > there
> > to ask if the target can do the vector comparison and branch.  The 
> > vectorizer
> doesn't
> > particularly care how it does it though.
> 
> Huh, but we ask
> 
>       if (direct_optab_handler (cbranch_optab, mode) == CODE_FOR_nothing)
> 
> in this case for V4SI, but now with availability of addhn you
> instead generate a branch on V4HI.  So what's the point of asking for
> V4SI when you use V4HI in the end?


Doh.. yes that I agree with is bogus.  Fixing.

Tamar
> 
> You should check for addhn availability plus verify you can cbranch
> on the half-size vector mode I think (and otherwise not use addhn).
> 
> > Some targets need a reduction of sorts (Adv. SIMD), we just utilize the 
> > fact that
> > we can move the lower 64-bit of the vector as a bit pattern in one go to 
> > GPR.
> >
> > Other targets don't need any of this and just branch on the CC flags 
> > already set
> > (SVE)
> >
> > Other targets have to reduce to scalar using some in-order reduction or 
> > similar.
> >
> > And some targets may just not be able to.  The cbranch is there to abstract 
> > these
> > away.
> >
> > The ADDHN is essentially asking whether you prefer a vector of Booleans of 
> > 64-
> bit
> > or 128-bits for the reduction of >= 2 compares with the expectation that the
> vector
> > of 64-bit will not be more expensive than that of 128-bits to use.
> >
> > I mean I could instead add a target hook that asks the target how it wants 
> > to
> combine
> > the two elements? But that feels like an abstraction that won't really be 
> > used by
> anyone
> > else..
> 
> Just properly test we can cbranch on the actually used mode?
> 
> Richard.
> 
> > Thanks,
> > Tamar
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Tamar
> > > > > > > >
> > > > > > > > gcc/ChangeLog:
> > > > > > > >
> > > > > > > >         * internal-fn.def (VEC_ADD_HALVING_NARROW): New.
> > > > > > > >         * doc/generic.texi: Document it.
> > > > > > > >         * optabs.def (vec_addh_narrow): New.
> > > > > > > >         * doc/md.texi: Document it.
> > > > > > > >         * tree-vect-stmts.cc (vectorizable_early_exit): Use 
> > > > > > > > addhn if
> > > supported.
> > > > > > > >
> > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > >
> > > > > > > >         * gcc.target/aarch64/vect-early-break-addhn_1.c: New 
> > > > > > > > test.
> > > > > > > >         * gcc.target/aarch64/vect-early-break-addhn_2.c: New 
> > > > > > > > test.
> > > > > > > >         * gcc.target/aarch64/vect-early-break-addhn_3.c: New 
> > > > > > > > test.
> > > > > > > >         * gcc.target/aarch64/vect-early-break-addhn_4.c: New 
> > > > > > > > test.
> > > > > > > >
> > > > > > > > ---
> > > > > > > > diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> d4ac580a7a8b9cd339d26cb97f7eb963f83746a4..ff16ff47bbf45e795df0d230e9
> > > > > > > a885d9d218d9af 100644
> > > > > > > > --- a/gcc/doc/generic.texi
> > > > > > > > +++ b/gcc/doc/generic.texi
> > > > > > > > @@ -1834,6 +1834,7 @@ a value from @code{enum
> annot_expr_kind},
> > > the
> > > > > > > third is an @code{INTEGER_CST}.
> > > > > > > >  @tindex IFN_VEC_WIDEN_MINUS_LO
> > > > > > > >  @tindex IFN_VEC_WIDEN_MINUS_EVEN
> > > > > > > >  @tindex IFN_VEC_WIDEN_MINUS_ODD
> > > > > > > > +@tindex IFN_VEC_ADD_HALVING_NARROW
> > > > > > > >  @tindex VEC_UNPACK_HI_EXPR
> > > > > > > >  @tindex VEC_UNPACK_LO_EXPR
> > > > > > > >  @tindex VEC_UNPACK_FLOAT_HI_EXPR
> > > > > > > > @@ -1956,6 +1957,24 @@ vector of @code{N/2} subtractions.  In
> the
> > > case
> > > > > of
> > > > > > > >  vector are subtracted from the odd @code{N/2} of the first to
> produce
> > > the
> > > > > > > >  vector of @code{N/2} subtractions.
> > > > > > > >
> > > > > > > > +@item IFN_VEC_ADD_HALVING_NARROW
> > > > > > > > +This internal function performs an addition of two input 
> > > > > > > > vectors,
> > > > > > > > +then extracts the most significant half of each result element 
> > > > > > > > and
> > > > > > > > +narrows it back to the original element width.
> > > > > > > > +
> > > > > > > > +Concretely, it computes:
> > > > > > > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > > > > > > +
> > > > > > > > +where @code{bits(a)} is the width in bits of each input 
> > > > > > > > element.
> > > > > > > > +
> > > > > > > > +Its operands are vectors containing the same number of elements
> > > > > (@code{N})
> > > > > > > > +of the same integral type.  The result is a vector of length 
> > > > > > > > @code{N},
> with
> > > > > > > > +elements of an integral type whose size is half that of the 
> > > > > > > > input
> element
> > > > > > > > +type.
> > > > > > > > +
> > > > > > > > +This operation currently only used for early break result 
> > > > > > > > compression
> > > when
> > > > > the
> > > > > > > > +result of a vector boolean can be represented as 0 or -1.
> > > > > > > > +
> > > > > > > >  @item VEC_UNPACK_HI_EXPR
> > > > > > > >  @itemx VEC_UNPACK_LO_EXPR
> > > > > > > >  These nodes represent unpacking of the high and low parts of 
> > > > > > > > the
> input
> > > > > vector,
> > > > > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> aba93f606eca59d31c103a05b2567fd4f3be55f3..ec0193e4eee079e00168bbaf9
> > > > > > > b28ba8d52e5d464 100644
> > > > > > > > --- a/gcc/doc/md.texi
> > > > > > > > +++ b/gcc/doc/md.texi
> > > > > > > > @@ -6087,6 +6087,25 @@ vectors with N signed/unsigned elements
> of
> > > size
> > > > > > > S@.  Find the absolute
> > > > > > > >  difference between operands 1 and 2 and widen the resulting
> elements.
> > > > > > > >  Put the N/2 results of size 2*S in the output vector (operand 
> > > > > > > > 0).
> > > > > > > >
> > > > > > > > +@cindex @code{vec_addh_narrow@var{m}} instruction pattern
> > > > > > > > +@item @samp{vec_addh_narrow@var{m}}
> > > > > > > > +Signed or unsigned addition of two input vectors, then 
> > > > > > > > extracts the
> > > > > > > > +most significant half of each result element and narrows it 
> > > > > > > > back to the
> > > > > > > > +original element width.
> > > > > > > > +
> > > > > > > > +Concretely, it computes:
> > > > > > > > +@code{(bits(a)/2)((a + b) >> bits(a))}
> > > > > > > > +
> > > > > > > > +where @code{bits(a)} is the width in bits of each input 
> > > > > > > > element.
> > > > > > > > +
> > > > > > > > +Its operands (@code{1} and @code{2}) are vectors containing the
> same
> > > > > > > number
> > > > > > > > +of signed or unsigned integral elements (@code{N}) of size 
> > > > > > > > @code{S}.
> > > The
> > > > > > > > +result (operand @code{0}) is a vector of length @code{N}, with
> elements
> > > of
> > > > > > > > +an integral type whose size is half that of @code{S}.
> > > > > > > > +
> > > > > > > > +This operation currently only used for early break result 
> > > > > > > > compression
> > > when
> > > > > the
> > > > > > > > +result of a vector boolean can be represented as 0 or -1.
> > > > > > > > +
> > > > > > > >  @cindex @code{vec_addsub@var{m}3} instruction pattern
> > > > > > > >  @item @samp{vec_addsub@var{m}3}
> > > > > > > >  Alternating subtract, add with even lanes doing subtract and 
> > > > > > > > odd
> > > > > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> d2480a1bf7927476215bc7bb99c0b74197d2b7e9..cb18058d9f48cc0dff96ed4b
> > > > > > > 31d0abc9adb67867 100644
> > > > > > > > --- a/gcc/internal-fn.def
> > > > > > > > +++ b/gcc/internal-fn.def
> > > > > > > > @@ -422,6 +422,8 @@ DEF_INTERNAL_OPTAB_FN
> > > > > (COMPLEX_ADD_ROT270,
> > > > > > > ECF_CONST, cadd270, binary)
> > > > > > > >  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul,
> binary)
> > > > > > > >  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST,
> > > cmul_conj,
> > > > > > > binary)
> > > > > > > >  DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub,
> > > binary)
> > > > > > > > +DEF_INTERNAL_OPTAB_FN (VEC_ADD_HALVING_NARROW,
> ECF_CONST
> > > |
> > > > > > > ECF_NOTHROW,
> > > > > > > > +                      vec_addh_narrow, binary)
> > > > > > > >  DEF_INTERNAL_WIDENING_OPTAB_FN (VEC_WIDEN_PLUS,
> > > > > > > >                                 ECF_CONST | ECF_NOTHROW,
> > > > > > > >                                 first,
> > > > > > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> 87a8b85da1592646d0a3447572e842ceb158cd97..b2bedc3692f914c2b80d797
> > > > > > > 2db81b542b32c9eb8 100644
> > > > > > > > --- a/gcc/optabs.def
> > > > > > > > +++ b/gcc/optabs.def
> > > > > > > > @@ -492,6 +492,7 @@ OPTAB_D (vec_widen_uabd_hi_optab,
> > > > > > > "vec_widen_uabd_hi_$a")
> > > > > > > >  OPTAB_D (vec_widen_uabd_lo_optab, "vec_widen_uabd_lo_$a")
> > > > > > > >  OPTAB_D (vec_widen_uabd_odd_optab, "vec_widen_uabd_odd_$a")
> > > > > > > >  OPTAB_D (vec_widen_uabd_even_optab,
> "vec_widen_uabd_even_$a")
> > > > > > > > +OPTAB_D (vec_addh_narrow_optab, "vec_addh_narrow$a")
> > > > > > > >  OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
> > > > > > > >  OPTAB_D (vec_fmaddsub_optab, "vec_fmaddsub$a4")
> > > > > > > >  OPTAB_D (vec_fmsubadd_optab, "vec_fmsubadd$a4")
> > > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-
> addhn_1.c
> > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > > > > > > new file mode 100644
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> 0000000000000000000000000000000000000000..4ecb187513e525e0cd9b8
> > > > > > > b063e418a75a23c525d
> > > > > > > > --- /dev/null
> > > > > > > > +++ 
> > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_1.c
> > > > > > > > @@ -0,0 +1,33 @@
> > > > > > > > +/* { dg-do compile } */
> > > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details 
> > > > > > > > -std=c99" }
> */
> > > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > > > > +
> > > > > > > > +#define TYPE int
> > > > > > > > +#define N 800
> > > > > > > > +
> > > > > > > > +#pragma GCC target "+nosve"
> > > > > > > > +
> > > > > > > > +TYPE a[N];
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > +** foo:
> > > > > > > > +**     ...
> > > > > > > > +**     ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > > > > +**     cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > > > > > > +**     cmeq    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
> > > > > > > > +**     addhn   v[0-9]+.4h, v[0-9]+.4s, v[0-9]+.4s
> > > > > > > > +**     fmov    x[0-9]+, d[0-9]+
> > > > > > > > +**     ...
> > > > > > > > +*/
> > > > > > > > +
> > > > > > > > +int foo ()
> > > > > > > > +{
> > > > > > > > +#pragma GCC unroll 8
> > > > > > > > +  for (int i = 0; i < N; i++)
> > > > > > > > +    if (a[i] == 124)
> > > > > > > > +      return 1;
> > > > > > > > +
> > > > > > > > +  return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect"
> } }
> > > */
> > > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-
> addhn_2.c
> > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > > > > > > new file mode 100644
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> 0000000000000000000000000000000000000000..d67d0d13d1733935aaf80
> > > > > > > 5e59188eb8155cb5f06
> > > > > > > > --- /dev/null
> > > > > > > > +++ 
> > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_2.c
> > > > > > > > @@ -0,0 +1,33 @@
> > > > > > > > +/* { dg-do compile } */
> > > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details 
> > > > > > > > -std=c99" }
> */
> > > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > > > > +
> > > > > > > > +#define TYPE long long
> > > > > > > > +#define N 800
> > > > > > > > +
> > > > > > > > +#pragma GCC target "+nosve"
> > > > > > > > +
> > > > > > > > +TYPE a[N];
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > +** foo:
> > > > > > > > +**     ...
> > > > > > > > +**     ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > > > > +**     cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > > > > > > +**     cmeq    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
> > > > > > > > +**     addhn   v[0-9]+.2s, v[0-9]+.2d, v[0-9]+.2d
> > > > > > > > +**     fmov    x[0-9]+, d[0-9]+
> > > > > > > > +**     ...
> > > > > > > > +*/
> > > > > > > > +
> > > > > > > > +int foo ()
> > > > > > > > +{
> > > > > > > > +#pragma GCC unroll 4
> > > > > > > > +  for (int i = 0; i < N; i++)
> > > > > > > > +    if (a[i] == 124)
> > > > > > > > +      return 1;
> > > > > > > > +
> > > > > > > > +  return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect"
> } }
> > > */
> > > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-
> addhn_3.c
> > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > > > > > > new file mode 100644
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> 0000000000000000000000000000000000000000..57dbc44ae0cdcbcdccd3d8
> > > > > > > dbe98c79713eaf5607
> > > > > > > > --- /dev/null
> > > > > > > > +++ 
> > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_3.c
> > > > > > > > @@ -0,0 +1,33 @@
> > > > > > > > +/* { dg-do compile } */
> > > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details 
> > > > > > > > -std=c99" }
> */
> > > > > > > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > > > > > > +
> > > > > > > > +#define TYPE short
> > > > > > > > +#define N 800
> > > > > > > > +
> > > > > > > > +#pragma GCC target "+nosve"
> > > > > > > > +
> > > > > > > > +TYPE a[N];
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > +** foo:
> > > > > > > > +**     ...
> > > > > > > > +**     ldp     q[0-9]+, q[0-9]+, \[x[0-9]+\], 32
> > > > > > > > +**     cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > > > > > > +**     cmeq    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
> > > > > > > > +**     addhn   v[0-9]+.8b, v[0-9]+.8h, v[0-9]+.8h
> > > > > > > > +**     fmov    x[0-9]+, d[0-9]+
> > > > > > > > +**     ...
> > > > > > > > +*/
> > > > > > > > +
> > > > > > > > +int foo ()
> > > > > > > > +{
> > > > > > > > +#pragma GCC unroll 16
> > > > > > > > +  for (int i = 0; i < N; i++)
> > > > > > > > +    if (a[i] == 124)
> > > > > > > > +      return 1;
> > > > > > > > +
> > > > > > > > +  return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/* { dg-final { scan-tree-dump "VEC_ADD_HALFING_NARROW" "vect"
> } }
> > > */
> > > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-
> addhn_4.c
> > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > > > > > > new file mode 100644
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> 0000000000000000000000000000000000000000..8ad42b22024479283d681
> > > > > > > 4d815ef1dce411d1c72
> > > > > > > > --- /dev/null
> > > > > > > > +++ 
> > > > > > > > b/gcc/testsuite/gcc.target/aarch64/vect-early-break-addhn_4.c
> > > > > > > > @@ -0,0 +1,21 @@
> > > > > > > > +/* { dg-do compile } */
> > > > > > > > +/* { dg-additional-options "-O3 -fdump-tree-vect-details 
> > > > > > > > -std=c99" }
> */
> > > > > > > > +
> > > > > > > > +#define TYPE char
> > > > > > > > +#define N 800
> > > > > > > > +
> > > > > > > > +#pragma GCC target "+nosve"
> > > > > > > > +
> > > > > > > > +TYPE a[N];
> > > > > > > > +
> > > > > > > > +int foo ()
> > > > > > > > +{
> > > > > > > > +#pragma GCC unroll 32
> > > > > > > > +  for (int i = 0; i < N; i++)
> > > > > > > > +    if (a[i] == 124)
> > > > > > > > +      return 1;
> > > > > > > > +
> > > > > > > > +  return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/* { dg-final { scan-tree-dump-not "VEC_ADD_HALFING_NARROW"
> > > "vect" } }
> > > > > */
> > > > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > > > > index
> > > > > > >
> > > > >
> > >
> 1545fab364792f75bcc786ba1311b8bdc82edd70..179ce5e0a66b6f88976ffb54
> > > > > > > 4c6874d7bec999a8 100644
> > > > > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > > > > @@ -12328,7 +12328,7 @@ vectorizable_early_exit (loop_vec_info
> > > > > loop_vinfo,
> > > > > > > stmt_vec_info stmt_info,
> > > > > > > >    gimple *orig_stmt = STMT_VINFO_STMT (vect_orig_stmt
> (stmt_info));
> > > > > > > >    gcond *cond_stmt = as_a <gcond *>(orig_stmt);
> > > > > > > >
> > > > > > > > -  tree cst = build_zero_cst (vectype);
> > > > > > > > +  tree vectype_out = vectype;
> > > > > > > >    auto bb = gimple_bb (cond_stmt);
> > > > > > > >    edge exit_true_edge = EDGE_SUCC (bb, 0);
> > > > > > > >    if (exit_true_edge->flags & EDGE_FALSE_VALUE)
> > > > > > > > @@ -12452,12 +12452,40 @@ vectorizable_early_exit (loop_vec_info
> > > > > > > loop_vinfo, stmt_vec_info stmt_info,
> > > > > > > >        else
> > > > > > > >         workset.splice (stmts);
> > > > > > > >
> > > > > > > > +      /* See if we support ADDHN and use that for the 
> > > > > > > > reduction.  */
> > > > > > > > +      internal_fn ifn = IFN_VEC_ADD_HALVING_NARROW;
> > > > > > > > +      bool addhn_supported_p
> > > > > > > > +       = direct_internal_fn_supported_p (ifn, vectype,
> > > OPTIMIZE_FOR_SPEED);
> > > > > > > > +      tree narrow_type = NULL_TREE;
> > > > > > > > +      if (addhn_supported_p)
> > > > > > > > +       {
> > > > > > > > +         /* Calculate the narrowing type for the result.  */
> > > > > > > > +         auto halfprec = TYPE_PRECISION (TREE_TYPE (vectype)) 
> > > > > > > > / 2;
> > > > > > > > +         auto unsignedp = TYPE_UNSIGNED (TREE_TYPE (vectype));
> > > > > > > > +         tree itype = build_nonstandard_integer_type (halfprec,
> > > unsignedp);
> > > > > > > > +         poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > > > > > > > +         tree tmp_type = build_vector_type (itype, nunits);
> > > > > > > > +         narrow_type = truth_type_for (tmp_type);
> > > > > > > > +       }
> > > > > > > > +
> > > > > > > >        while (workset.length () > 1)
> > > > > > > >         {
> > > > > > > > -         new_temp = make_temp_ssa_name (vectype, NULL,
> > > "vexit_reduc");
> > > > > > > >           tree arg0 = workset.pop ();
> > > > > > > >           tree arg1 = workset.pop ();
> > > > > > > > -         new_stmt = gimple_build_assign (new_temp, 
> > > > > > > > BIT_IOR_EXPR,
> > > arg0, arg1);
> > > > > > > > +         if (addhn_supported_p && workset.length () == 0)
> > > > > > > > +           {
> > > > > > > > +             new_stmt = gimple_build_call_internal (ifn, 2, 
> > > > > > > > arg0, arg1);
> > > > > > > > +             vectype_out = narrow_type;
> > > > > > > > +             new_temp = make_temp_ssa_name (vectype_out, NULL,
> > > > > > > "vexit_reduc");
> > > > > > > > +             gimple_call_set_lhs (as_a <gcall *> (new_stmt), 
> > > > > > > > new_temp);
> > > > > > > > +             gimple_call_set_nothrow (as_a <gcall *> 
> > > > > > > > (new_stmt), true);
> > > > > > > > +           }
> > > > > > > > +         else
> > > > > > > > +           {
> > > > > > > > +             new_temp = make_temp_ssa_name (vectype_out, NULL,
> > > > > > > "vexit_reduc");
> > > > > > > > +             new_stmt
> > > > > > > > +               = gimple_build_assign (new_temp, BIT_IOR_EXPR, 
> > > > > > > > arg0,
> > > arg1);
> > > > > > > > +           }
> > > > > > > >           vect_finish_stmt_generation (loop_vinfo, stmt_info, 
> > > > > > > > new_stmt,
> > > > > > > >                                        &cond_gsi);
> > > > > > > >           workset.quick_insert (0, new_temp);
> > > > > > > > @@ -12480,6 +12508,7 @@ vectorizable_early_exit (loop_vec_info
> > > > > loop_vinfo,
> > > > > > > stmt_vec_info stmt_info,
> > > > > > > >
> > > > > > > >    gcc_assert (new_temp);
> > > > > > > >
> > > > > > > > +  tree cst = build_zero_cst (vectype_out);
> > > > > > > >    gimple_cond_set_condition (cond_stmt, NE_EXPR, new_temp, 
> > > > > > > > cst);
> > > > > > > >    update_stmt (orig_stmt);
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Richard Biener <rguent...@suse.de>
> > > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > > Nuernberg)
> > > > > >
> > > > >
> > > > > --
> > > > > Richard Biener <rguent...@suse.de>
> > > > > SUSE Software Solutions Germany GmbH,
> > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > Nuernberg)
> > > >
> > >
> > > --
> > > Richard Biener <rguent...@suse.de>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH v2 3/3]middle-end: Use addhn for compression instead of inclusive OR when reducing comparison values

Reply via email to