Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

Richard Sandiford Mon, 09 Oct 2023 03:17:46 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Monday, October 9, 2023 10:56 AM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: Richard Biener <richard.guent...@gmail.com>; gcc-patches@gcc.gnu.org;
>> nd <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>;
>> Marcus Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov
>> <kyrylo.tkac...@arm.com>
>> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> 
>> Tamar Christina <tamar.christ...@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandif...@arm.com>
>> >> Sent: Saturday, October 7, 2023 10:58 AM
>> >> To: Richard Biener <richard.guent...@gmail.com>
>> >> Cc: Tamar Christina <tamar.christ...@arm.com>;
>> >> gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
>> >> <richard.earns...@arm.com>; Marcus Shawcroft
>> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov
>> <kyrylo.tkac...@arm.com>
>> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> >>
>> >> Richard Biener <richard.guent...@gmail.com> writes:
>> >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
>> >> <tamar.christ...@arm.com> wrote:
>> >> >>
>> >> >> > -----Original Message-----
>> >> >> > From: Richard Sandiford <richard.sandif...@arm.com>
>> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
>> >> >> > To: Tamar Christina <tamar.christ...@arm.com>
>> >> >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
>> >> >> > <richard.earns...@arm.com>; Marcus Shawcroft
>> >> >> > <marcus.shawcr...@arm.com>; Kyrylo Tkachov
>> >> <kyrylo.tkac...@arm.com>
>> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> >> cond_copysign.
>> >> >> >
>> >> >> > Tamar Christina <tamar.christ...@arm.com> writes:
>> >> >> > >> -----Original Message-----
>> >> >> > >> From: Richard Sandiford <richard.sandif...@arm.com>
>> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
>> >> >> > >> To: Tamar Christina <tamar.christ...@arm.com>
>> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard
>> >> >> > >> Earnshaw <richard.earns...@arm.com>; Marcus Shawcroft
>> >> >> > >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov
>> >> >> > <kyrylo.tkac...@arm.com>
>> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> >> cond_copysign.
>> >> >> > >>
>> >> >> > >> Tamar Christina <tamar.christ...@arm.com> writes:
>> >> >> > >> > Hi All,
>> >> >> > >> >
>> >> >> > >> > This adds an implementation for masked copysign along with
>> >> >> > >> > an optimized pattern for masked copysign (x, -1).
>> >> >> > >>
>> >> >> > >> It feels like we're ending up with a lot of AArch64-specific
>> >> >> > >> code that just hard- codes the observation that changing the
>> >> >> > >> sign is equivalent to changing the top bit.  We then need to
>> >> >> > >> make sure that we choose the best way of changing the top bit
>> >> >> > >> for any
>> >> given situation.
>> >> >> > >>
>> >> >> > >> Hard-coding the -1/negative case is one instance of that.
>> >> >> > >> But it looks like we also fail to use the best sequence for SVE2. 
>> >> >> > >>  E.g.
>> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
>> >> >> > >>
>> >> >> > >> #include <stdint.h>
>> >> >> > >>
>> >> >> > >> void f(double *restrict a, double *restrict b) {
>> >> >> > >>     for (int i = 0; i < 100; ++i)
>> >> >> > >>         a[i] = __builtin_copysign(a[i], b[i]); }
>> >> >> > >>
>> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>> >> >> > >>     for (int i = 0; i < 100; ++i)
>> >> >> > >>         a[i] = (a[i] & ~c) | (b[i] & c); }
>> >> >> > >>
>> >> >> > >> gives:
>> >> >> > >>
>> >> >> > >> f:
>> >> >> > >>         mov     x2, 0
>> >> >> > >>         mov     w3, 100
>> >> >> > >>         whilelo p7.d, wzr, w3
>> >> >> > >> .L2:
>> >> >> > >>         ld1d    z30.d, p7/z, [x0, x2, lsl 3]
>> >> >> > >>         ld1d    z31.d, p7/z, [x1, x2, lsl 3]
>> >> >> > >>         and     z30.d, z30.d, #0x7fffffffffffffff
>> >> >> > >>         and     z31.d, z31.d, #0x8000000000000000
>> >> >> > >>         orr     z31.d, z31.d, z30.d
>> >> >> > >>         st1d    z31.d, p7, [x0, x2, lsl 3]
>> >> >> > >>         incd    x2
>> >> >> > >>         whilelo p7.d, w2, w3
>> >> >> > >>         b.any   .L2
>> >> >> > >>         ret
>> >> >> > >> g:
>> >> >> > >>         mov     x3, 0
>> >> >> > >>         mov     w4, 100
>> >> >> > >>         mov     z29.d, x2
>> >> >> > >>         whilelo p7.d, wzr, w4
>> >> >> > >> .L6:
>> >> >> > >>         ld1d    z30.d, p7/z, [x0, x3, lsl 3]
>> >> >> > >>         ld1d    z31.d, p7/z, [x1, x3, lsl 3]
>> >> >> > >>         bsl     z31.d, z31.d, z30.d, z29.d
>> >> >> > >>         st1d    z31.d, p7, [x0, x3, lsl 3]
>> >> >> > >>         incd    x3
>> >> >> > >>         whilelo p7.d, w3, w4
>> >> >> > >>         b.any   .L6
>> >> >> > >>         ret
>> >> >> > >>
>> >> >> > >> I saw that you originally tried to do this in match.pd and
>> >> >> > >> that the decision was to fold to copysign instead.  But
>> >> >> > >> perhaps there's a compromise where isel does something with
>> >> >> > >> the (new) copysign canonical
>> >> >> > form?
>> >> >> > >> I.e. could we go with your new version of the match.pd patch,
>> >> >> > >> and add some isel stuff as a follow-on?
>> 
>> [A]
>> 
>> >> >> > >>
>> >> >> > >
>> >> >> > > Sure if that's what's desired.... But..
>> >> >> > >
>> >> >> > > The example you posted above is for instance worse for x86
>> >> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has
>> >> >> > > a dependency chain of 2 and the latter of 3.  It's likely any
>> >> >> > > open coding of this
>> >> >> > operation is going to hurt a target.
>> >> >> > >
>> >> >> > > So I'm unsure what isel transform this into...
>> >> >> >
>> >> >> > I didn't mean that we should go straight to using isel for the
>> >> >> > general case, just for the new case.  The example above was
>> >> >> > instead trying to show the general point that hiding the logic
>> >> >> > ops in target code is
>> >> a double-edged sword.
>> >> >>
>> >> >> I see.. but the problem here is that transforming copysign (x, -1)
>> >> >> into (x | 0x8000000) would require an integer operation on an FP
>> >> >> value.  I'm happy to do it but it seems like it'll be an AArch64
>> >> >> only thing
>> >> anyway.
>> >> >>
>> >> >> If we want to do this we need to check can_change_mode_class or a
>> hook.
>> >> >> Most targets including x86 reject the conversion.  So it'll just
>> >> >> be effectively an AArch64 thing.
>> >> >>
>> >> >> You're right that the actual equivalent transformation is this
>> >> >> https://godbolt.org/z/KesfrMv5z But the target won't allow it.
>> >> >>
>> >> >> >
>> >> >> > The x86_64 example for the -1 case would be
>> >> >> > https://godbolt.org/z/b9s6MaKs8 where the isel change would be
>> >> >> > an improvement.  Without that, I guess
>> >> >> > x86_64 will need to have a similar patch to the AArch64 one.
>> >> >> >
>> >> >>
>> >> >> I think that's to be expected.  I think it's logical that every
>> >> >> target just needs to implement their optabs optimally.
>> >> >>
>> >> >> > That said, https://godbolt.org/z/e6nqoqbMh suggests that
>> >> >> > powerpc64 is probably relying on the current copysign -> neg/abs
>> transform.
>> >> >> > (Not sure why the second function uses different IVs from the
>> >> >> > first.)
>> >> >> >
>> >> >> > Personally, I wouldn't be against a target hook that indicated
>> >> >> > whether float bit manipulation is "free" for a given mode, if it
>> >> >> > comes to
>> >> that.
>> >> >>
>> >> >> I'm really sure Richi would agree there.  Generally speaking I
>> >> >> don't think people see doing FP operations on Int as beneficial.
>> >> >> But it's always
>> >> the case on AArch64.
>> >> >
>> >> > IIRC we're doing fpclassify "expansion" early for example.
>> >> >
>> >> > Note the issue I see is that the middle-end shouldn't get in the
>> >> > way of targets that have a copysign optab.  In case it's worthwhile
>> >> > to special-case a "setsign" thing then maybe provide an optab for
>> >> > that as well.  Now, that doesn't help if we want setsign to be
>> >> > expanded from the middle-end but still wan the copysign optab (and
>> >> > not require targets to implement both if they want to escape
>> >> > middle-end generic
>> >> expansion of setsign).
>> >> >
>> >> > But yes, I also thought the , 1 and , -1 cases could be special
>> >> > cased by RTL expansion (or ISEL).  But it would need to invoke
>> >> > costing which likely means target specific changes anyway... :/
>> >>
>> >> FWIW, if we had the target hook I suggested, I don't think AArch64
>> >> would need to implement copysign or xorsign optabs.  That's probably
>> >> true of
>> >> AArch32 too (at least for all combinations that are likely to matter these
>> days).
>> >>
>> >> I'd go one step further and say that, if a target wants to do its own
>> >> thing for copysign and xorsign, it clearly doesn't meet the
>> >> requirement that bit manipulation of floats is "free" for that mode.
>> >>
>> >
>> > So I'm aware I have no say here, but I'll make one last effort.
>> >
>> > The patch isn't just implementing the fneg (fabs ()) optimization just
>> because.
>> > The location where it's implemented makes a big difference.
>> >
>> > If this optimization is done late, it doesn't fix the regression
>> > fully, because doing It after all optimization passes have run means it 
>> > can't
>> properly be optimized.
>> >
>> > The point of doing the rewrite early to ORR accomplished two things:
>> >
>> > 1. It makes PRE realize that the block it's splitting would only have 1
>> instruction in it
>> >     and that such a split is not beneficial.  This is why I'm against 
>> > doing such
>> optimizations
>> >     so later. Optimizations don’t' happen in isolation, and when they make
>> sense should
>> >     happen early.  Transforming fneg (fabs (..)) in isel not only feels 
>> > wrong but
>> results in a 4%
>> >     performance loss.
>> >
>> > 2. Doing it early also gets the ORRs to be reassociated changing where the
>> loop dependent
>> >      variable lands.  Early makes it land in the merging MOVPRFX rather 
>> > than
>> requiring a SEL
>> >      at the end of the iteration.
>> >
>> > Replacing the fneg (fabs (..)) with copysign accomplishes 1 but no 2.
>> > This results in a 2-3% loss, but I can live with that given doing 1 gets 
>> > us back
>> to GCC 12 levels.
>> >
>> > Doing fneg (fabs (..)) in isel would have no meaning for me and not
>> > fix the regression.  I won't be looking to do that in that case.
>> >
>> > If it's acceptable I can transform COPYSIGN (X, -1) in gimple-isel.
>> 
>> FWIW, this is what I meant by [A] above.  I.e. go with your latest target-
>> independent change, and make isel replace COPYSIGN (X, -1) with logic
>> operations, under control of a new target hook that says that logic 
>> operations
>> on float values are as cheap as logic operations on integer values (for a 
>> given
>> float mode).  In future, hopefully all COPYSIGNs and XORSIGNs would be
>> handled the same way, under control of the same hook, but that would be a
>> separate future change.
>
> Ok, would you prefer isel or in the expander that Richi suggested?
> I assume isel is better if the intention is to later remove the expansion 
> code?


Yeah, isel sounds better to me FWIW (for that reason), but Richi would
know better.

At least with isel, there's the theoretical possibility of doing
simple optimisation before expand, or taking surrounding stmts
into account, if that ever becomes necessary for future changes.

Thanks,
Richard

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

Reply via email to