Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

Richard Sandiford Thu, 05 Oct 2023 13:25:54 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Thursday, October 5, 2023 8:29 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
>> <richard.earns...@arm.com>; Marcus Shawcroft
>> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
>> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> 
>> Tamar Christina <tamar.christ...@arm.com> writes:
>> > Hi All,
>> >
>> > This adds an implementation for masked copysign along with an
>> > optimized pattern for masked copysign (x, -1).
>> 
>> It feels like we're ending up with a lot of AArch64-specific code that just 
>> hard-
>> codes the observation that changing the sign is equivalent to changing the 
>> top
>> bit.  We then need to make sure that we choose the best way of changing the
>> top bit for any given situation.
>> 
>> Hard-coding the -1/negative case is one instance of that.  But it looks like 
>> we
>> also fail to use the best sequence for SVE2.  E.g.
>> [https://godbolt.org/z/ajh3MM5jv]:
>> 
>> #include <stdint.h>
>> 
>> void f(double *restrict a, double *restrict b) {
>>     for (int i = 0; i < 100; ++i)
>>         a[i] = __builtin_copysign(a[i], b[i]); }
>> 
>> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>>     for (int i = 0; i < 100; ++i)
>>         a[i] = (a[i] & ~c) | (b[i] & c); }
>> 
>> gives:
>> 
>> f:
>>         mov     x2, 0
>>         mov     w3, 100
>>         whilelo p7.d, wzr, w3
>> .L2:
>>         ld1d    z30.d, p7/z, [x0, x2, lsl 3]
>>         ld1d    z31.d, p7/z, [x1, x2, lsl 3]
>>         and     z30.d, z30.d, #0x7fffffffffffffff
>>         and     z31.d, z31.d, #0x8000000000000000
>>         orr     z31.d, z31.d, z30.d
>>         st1d    z31.d, p7, [x0, x2, lsl 3]
>>         incd    x2
>>         whilelo p7.d, w2, w3
>>         b.any   .L2
>>         ret
>> g:
>>         mov     x3, 0
>>         mov     w4, 100
>>         mov     z29.d, x2
>>         whilelo p7.d, wzr, w4
>> .L6:
>>         ld1d    z30.d, p7/z, [x0, x3, lsl 3]
>>         ld1d    z31.d, p7/z, [x1, x3, lsl 3]
>>         bsl     z31.d, z31.d, z30.d, z29.d
>>         st1d    z31.d, p7, [x0, x3, lsl 3]
>>         incd    x3
>>         whilelo p7.d, w3, w4
>>         b.any   .L6
>>         ret
>> 
>> I saw that you originally tried to do this in match.pd and that the decision 
>> was
>> to fold to copysign instead.  But perhaps there's a compromise where isel 
>> does
>> something with the (new) copysign canonical form?
>> I.e. could we go with your new version of the match.pd patch, and add some
>> isel stuff as a follow-on?
>> 
>
> Sure if that's what's desired.... But..
>
> The example you posted above is for instance worse for x86 
> https://godbolt.org/z/x9ccqxW6T
> where the first operation has a dependency chain of 2 and the latter of 3.  
> It's likely any
> open coding of this operation is going to hurt a target.
>
> So I'm unsure what isel transform this into...


I didn't mean that we should go straight to using isel for the general
case, just for the new case.  The example above was instead trying to
show the general point that hiding the logic ops in target code is a
double-edged sword.

The x86_64 example for the -1 case would be https://godbolt.org/z/b9s6MaKs8
where the isel change would be an improvement.  Without that, I guess
x86_64 will need to have a similar patch to the AArch64 one.

That said, https://godbolt.org/z/e6nqoqbMh suggests that powerpc64
is probably relying on the current copysign -> neg/abs transform.
(Not sure why the second function uses different IVs from the first.)

Personally, I wouldn't be against a target hook that indicated whether
float bit manipulation is "free" for a given mode, if it comes to that.

Thanks,
Richard

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

Reply via email to