On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina <tamar.christ...@arm.com> wrote:
>
> > -----Original Message-----
> > From: Richard Sandiford <richard.sandif...@arm.com>
> > Sent: Thursday, October 5, 2023 9:26 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> > <richard.earns...@arm.com>; Marcus Shawcroft
> > <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >
> > Tamar Christina <tamar.christ...@arm.com> writes:
> > >> -----Original Message-----
> > >> From: Richard Sandiford <richard.sandif...@arm.com>
> > >> Sent: Thursday, October 5, 2023 8:29 PM
> > >> To: Tamar Christina <tamar.christ...@arm.com>
> > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> > >> <richard.earns...@arm.com>; Marcus Shawcroft
> > >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> > <kyrylo.tkac...@arm.com>
> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> > >>
> > >> Tamar Christina <tamar.christ...@arm.com> writes:
> > >> > Hi All,
> > >> >
> > >> > This adds an implementation for masked copysign along with an
> > >> > optimized pattern for masked copysign (x, -1).
> > >>
> > >> It feels like we're ending up with a lot of AArch64-specific code
> > >> that just hard- codes the observation that changing the sign is
> > >> equivalent to changing the top bit.  We then need to make sure that
> > >> we choose the best way of changing the top bit for any given situation.
> > >>
> > >> Hard-coding the -1/negative case is one instance of that.  But it
> > >> looks like we also fail to use the best sequence for SVE2.  E.g.
> > >> [https://godbolt.org/z/ajh3MM5jv]:
> > >>
> > >> #include <stdint.h>
> > >>
> > >> void f(double *restrict a, double *restrict b) {
> > >>     for (int i = 0; i < 100; ++i)
> > >>         a[i] = __builtin_copysign(a[i], b[i]); }
> > >>
> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> > >>     for (int i = 0; i < 100; ++i)
> > >>         a[i] = (a[i] & ~c) | (b[i] & c); }
> > >>
> > >> gives:
> > >>
> > >> f:
> > >>         mov     x2, 0
> > >>         mov     w3, 100
> > >>         whilelo p7.d, wzr, w3
> > >> .L2:
> > >>         ld1d    z30.d, p7/z, [x0, x2, lsl 3]
> > >>         ld1d    z31.d, p7/z, [x1, x2, lsl 3]
> > >>         and     z30.d, z30.d, #0x7fffffffffffffff
> > >>         and     z31.d, z31.d, #0x8000000000000000
> > >>         orr     z31.d, z31.d, z30.d
> > >>         st1d    z31.d, p7, [x0, x2, lsl 3]
> > >>         incd    x2
> > >>         whilelo p7.d, w2, w3
> > >>         b.any   .L2
> > >>         ret
> > >> g:
> > >>         mov     x3, 0
> > >>         mov     w4, 100
> > >>         mov     z29.d, x2
> > >>         whilelo p7.d, wzr, w4
> > >> .L6:
> > >>         ld1d    z30.d, p7/z, [x0, x3, lsl 3]
> > >>         ld1d    z31.d, p7/z, [x1, x3, lsl 3]
> > >>         bsl     z31.d, z31.d, z30.d, z29.d
> > >>         st1d    z31.d, p7, [x0, x3, lsl 3]
> > >>         incd    x3
> > >>         whilelo p7.d, w3, w4
> > >>         b.any   .L6
> > >>         ret
> > >>
> > >> I saw that you originally tried to do this in match.pd and that the
> > >> decision was to fold to copysign instead.  But perhaps there's a
> > >> compromise where isel does something with the (new) copysign canonical
> > form?
> > >> I.e. could we go with your new version of the match.pd patch, and add
> > >> some isel stuff as a follow-on?
> > >>
> > >
> > > Sure if that's what's desired.... But..
> > >
> > > The example you posted above is for instance worse for x86
> > > https://godbolt.org/z/x9ccqxW6T where the first operation has a
> > > dependency chain of 2 and the latter of 3.  It's likely any open coding 
> > > of this
> > operation is going to hurt a target.
> > >
> > > So I'm unsure what isel transform this into...
> >
> > I didn't mean that we should go straight to using isel for the general 
> > case, just
> > for the new case.  The example above was instead trying to show the general
> > point that hiding the logic ops in target code is a double-edged sword.
>
> I see.. but the problem here is that transforming copysign (x, -1) into
> (x | 0x8000000) would require an integer operation on an FP value.  I'm happy 
> to
> do it but it seems like it'll be an AArch64 only thing anyway.
>
> If we want to do this we need to check can_change_mode_class or a hook.
> Most targets including x86 reject the conversion.  So it'll just be 
> effectively an AArch64
> thing.
>
> You're right that the actual equivalent transformation is this 
> https://godbolt.org/z/KesfrMv5z
> But the target won't allow it.
>
> >
> > The x86_64 example for the -1 case would be
> > https://godbolt.org/z/b9s6MaKs8 where the isel change would be an
> > improvement.  Without that, I guess
> > x86_64 will need to have a similar patch to the AArch64 one.
> >
>
> I think that's to be expected.  I think it's logical that every target just 
> needs to implement
> their optabs optimally.
>
> > That said, https://godbolt.org/z/e6nqoqbMh suggests that powerpc64 is
> > probably relying on the current copysign -> neg/abs transform.
> > (Not sure why the second function uses different IVs from the first.)
> >
> > Personally, I wouldn't be against a target hook that indicated whether 
> > float bit
> > manipulation is "free" for a given mode, if it comes to that.
>
> I'm really sure Richi would agree there.  Generally speaking I don't think 
> people see
> doing FP operations on Int as beneficial.  But it's always the case on 
> AArch64.

IIRC we're doing fpclassify "expansion" early for example.

Note the issue I see is that the middle-end shouldn't get in the way of
targets that have a copysign optab.  In case it's worthwhile to special-case
a "setsign" thing then maybe provide an optab for that as well.  Now, that
doesn't help if we want setsign to be expanded from the middle-end but
still wan the copysign optab (and not require targets to implement both
if they want to escape middle-end generic expansion of setsign).

But yes, I also thought the , 1 and , -1 cases could be special cased by
RTL expansion (or ISEL).  But it would need to invoke costing which likely
means target specific changes anyway... :/

So I have no good advice here but I hoped that even the generic target
specific copysign implementation with and & xor would eventually be
optimized later on RTL for constant second arg?

Richard.

> But sure, if you believe it to be beneficial I can follow up with a patch, 
> but I'd still
> need this one to allow the folding of the VEC_PERM_EXPR away.
>
> Thanks,
> Tamar
>
> >
> > Thanks,
> > Richard

Reply via email to