Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Thursday, October 5, 2023 8:29 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >> <richard.earns...@arm.com>; Marcus Shawcroft >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> > Hi All, >> > >> > This adds an implementation for masked copysign along with an >> > optimized pattern for masked copysign (x, -1). >> >> It feels like we're ending up with a lot of AArch64-specific code that just >> hard- >> codes the observation that changing the sign is equivalent to changing the >> top >> bit. We then need to make sure that we choose the best way of changing the >> top bit for any given situation. >> >> Hard-coding the -1/negative case is one instance of that. But it looks like >> we >> also fail to use the best sequence for SVE2. E.g. >> [https://godbolt.org/z/ajh3MM5jv]: >> >> #include <stdint.h> >> >> void f(double *restrict a, double *restrict b) { >> for (int i = 0; i < 100; ++i) >> a[i] = __builtin_copysign(a[i], b[i]); } >> >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) { >> for (int i = 0; i < 100; ++i) >> a[i] = (a[i] & ~c) | (b[i] & c); } >> >> gives: >> >> f: >> mov x2, 0 >> mov w3, 100 >> whilelo p7.d, wzr, w3 >> .L2: >> ld1d z30.d, p7/z, [x0, x2, lsl 3] >> ld1d z31.d, p7/z, [x1, x2, lsl 3] >> and z30.d, z30.d, #0x7fffffffffffffff >> and z31.d, z31.d, #0x8000000000000000 >> orr z31.d, z31.d, z30.d >> st1d z31.d, p7, [x0, x2, lsl 3] >> incd x2 >> whilelo p7.d, w2, w3 >> b.any .L2 >> ret >> g: >> mov x3, 0 >> mov w4, 100 >> mov z29.d, x2 >> whilelo p7.d, wzr, w4 >> .L6: >> ld1d z30.d, p7/z, [x0, x3, lsl 3] >> ld1d z31.d, p7/z, [x1, x3, lsl 3] >> bsl z31.d, z31.d, z30.d, z29.d >> st1d z31.d, p7, [x0, x3, lsl 3] >> incd x3 >> whilelo p7.d, w3, w4 >> b.any .L6 >> ret >> >> I saw that you originally tried to do this in match.pd and that the decision >> was >> to fold to copysign instead. But perhaps there's a compromise where isel >> does >> something with the (new) copysign canonical form? >> I.e. could we go with your new version of the match.pd patch, and add some >> isel stuff as a follow-on? >> > > Sure if that's what's desired.... But.. > > The example you posted above is for instance worse for x86 > https://godbolt.org/z/x9ccqxW6T > where the first operation has a dependency chain of 2 and the latter of 3. > It's likely any > open coding of this operation is going to hurt a target. > > So I'm unsure what isel transform this into...
I didn't mean that we should go straight to using isel for the general case, just for the new case. The example above was instead trying to show the general point that hiding the logic ops in target code is a double-edged sword. The x86_64 example for the -1 case would be https://godbolt.org/z/b9s6MaKs8 where the isel change would be an improvement. Without that, I guess x86_64 will need to have a similar patch to the AArch64 one. That said, https://godbolt.org/z/e6nqoqbMh suggests that powerpc64 is probably relying on the current copysign -> neg/abs transform. (Not sure why the second function uses different IVs from the first.) Personally, I wouldn't be against a target hook that indicated whether float bit manipulation is "free" for a given mode, if it comes to that. Thanks, Richard