On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina <tamar.christ...@arm.com> wrote: > > > -----Original Message----- > > From: Richard Sandiford <richard.sandif...@arm.com> > > Sent: Thursday, October 5, 2023 9:26 PM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > > <richard.earns...@arm.com>; Marcus Shawcroft > > <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. > > > > Tamar Christina <tamar.christ...@arm.com> writes: > > >> -----Original Message----- > > >> From: Richard Sandiford <richard.sandif...@arm.com> > > >> Sent: Thursday, October 5, 2023 8:29 PM > > >> To: Tamar Christina <tamar.christ...@arm.com> > > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > > >> <richard.earns...@arm.com>; Marcus Shawcroft > > >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov > > <kyrylo.tkac...@arm.com> > > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. > > >> > > >> Tamar Christina <tamar.christ...@arm.com> writes: > > >> > Hi All, > > >> > > > >> > This adds an implementation for masked copysign along with an > > >> > optimized pattern for masked copysign (x, -1). > > >> > > >> It feels like we're ending up with a lot of AArch64-specific code > > >> that just hard- codes the observation that changing the sign is > > >> equivalent to changing the top bit. We then need to make sure that > > >> we choose the best way of changing the top bit for any given situation. > > >> > > >> Hard-coding the -1/negative case is one instance of that. But it > > >> looks like we also fail to use the best sequence for SVE2. E.g. > > >> [https://godbolt.org/z/ajh3MM5jv]: > > >> > > >> #include <stdint.h> > > >> > > >> void f(double *restrict a, double *restrict b) { > > >> for (int i = 0; i < 100; ++i) > > >> a[i] = __builtin_copysign(a[i], b[i]); } > > >> > > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) { > > >> for (int i = 0; i < 100; ++i) > > >> a[i] = (a[i] & ~c) | (b[i] & c); } > > >> > > >> gives: > > >> > > >> f: > > >> mov x2, 0 > > >> mov w3, 100 > > >> whilelo p7.d, wzr, w3 > > >> .L2: > > >> ld1d z30.d, p7/z, [x0, x2, lsl 3] > > >> ld1d z31.d, p7/z, [x1, x2, lsl 3] > > >> and z30.d, z30.d, #0x7fffffffffffffff > > >> and z31.d, z31.d, #0x8000000000000000 > > >> orr z31.d, z31.d, z30.d > > >> st1d z31.d, p7, [x0, x2, lsl 3] > > >> incd x2 > > >> whilelo p7.d, w2, w3 > > >> b.any .L2 > > >> ret > > >> g: > > >> mov x3, 0 > > >> mov w4, 100 > > >> mov z29.d, x2 > > >> whilelo p7.d, wzr, w4 > > >> .L6: > > >> ld1d z30.d, p7/z, [x0, x3, lsl 3] > > >> ld1d z31.d, p7/z, [x1, x3, lsl 3] > > >> bsl z31.d, z31.d, z30.d, z29.d > > >> st1d z31.d, p7, [x0, x3, lsl 3] > > >> incd x3 > > >> whilelo p7.d, w3, w4 > > >> b.any .L6 > > >> ret > > >> > > >> I saw that you originally tried to do this in match.pd and that the > > >> decision was to fold to copysign instead. But perhaps there's a > > >> compromise where isel does something with the (new) copysign canonical > > form? > > >> I.e. could we go with your new version of the match.pd patch, and add > > >> some isel stuff as a follow-on? > > >> > > > > > > Sure if that's what's desired.... But.. > > > > > > The example you posted above is for instance worse for x86 > > > https://godbolt.org/z/x9ccqxW6T where the first operation has a > > > dependency chain of 2 and the latter of 3. It's likely any open coding > > > of this > > operation is going to hurt a target. > > > > > > So I'm unsure what isel transform this into... > > > > I didn't mean that we should go straight to using isel for the general > > case, just > > for the new case. The example above was instead trying to show the general > > point that hiding the logic ops in target code is a double-edged sword. > > I see.. but the problem here is that transforming copysign (x, -1) into > (x | 0x8000000) would require an integer operation on an FP value. I'm happy > to > do it but it seems like it'll be an AArch64 only thing anyway. > > If we want to do this we need to check can_change_mode_class or a hook. > Most targets including x86 reject the conversion. So it'll just be > effectively an AArch64 > thing. > > You're right that the actual equivalent transformation is this > https://godbolt.org/z/KesfrMv5z > But the target won't allow it. > > > > > The x86_64 example for the -1 case would be > > https://godbolt.org/z/b9s6MaKs8 where the isel change would be an > > improvement. Without that, I guess > > x86_64 will need to have a similar patch to the AArch64 one. > > > > I think that's to be expected. I think it's logical that every target just > needs to implement > their optabs optimally. > > > That said, https://godbolt.org/z/e6nqoqbMh suggests that powerpc64 is > > probably relying on the current copysign -> neg/abs transform. > > (Not sure why the second function uses different IVs from the first.) > > > > Personally, I wouldn't be against a target hook that indicated whether > > float bit > > manipulation is "free" for a given mode, if it comes to that. > > I'm really sure Richi would agree there. Generally speaking I don't think > people see > doing FP operations on Int as beneficial. But it's always the case on > AArch64.
IIRC we're doing fpclassify "expansion" early for example. Note the issue I see is that the middle-end shouldn't get in the way of targets that have a copysign optab. In case it's worthwhile to special-case a "setsign" thing then maybe provide an optab for that as well. Now, that doesn't help if we want setsign to be expanded from the middle-end but still wan the copysign optab (and not require targets to implement both if they want to escape middle-end generic expansion of setsign). But yes, I also thought the , 1 and , -1 cases could be special cased by RTL expansion (or ISEL). But it would need to invoke costing which likely means target specific changes anyway... :/ So I have no good advice here but I hoped that even the generic target specific copysign implementation with and & xor would eventually be optimized later on RTL for constant second arg? Richard. > But sure, if you believe it to be beneficial I can follow up with a patch, > but I'd still > need this one to allow the folding of the VEC_PERM_EXPR away. > > Thanks, > Tamar > > > > > Thanks, > > Richard