> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Saturday, October 7, 2023 10:58 AM > To: Richard Biener <richard.guent...@gmail.com> > Cc: Tamar Christina <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org; > nd <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>; > Marcus Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov > <kyrylo.tkac...@arm.com> > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. > > Richard Biener <richard.guent...@gmail.com> writes: > > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina > <tamar.christ...@arm.com> wrote: > >> > >> > -----Original Message----- > >> > From: Richard Sandiford <richard.sandif...@arm.com> > >> > Sent: Thursday, October 5, 2023 9:26 PM > >> > To: Tamar Christina <tamar.christ...@arm.com> > >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > >> > <richard.earns...@arm.com>; Marcus Shawcroft > >> > <marcus.shawcr...@arm.com>; Kyrylo Tkachov > <kyrylo.tkac...@arm.com> > >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for > cond_copysign. > >> > > >> > Tamar Christina <tamar.christ...@arm.com> writes: > >> > >> -----Original Message----- > >> > >> From: Richard Sandiford <richard.sandif...@arm.com> > >> > >> Sent: Thursday, October 5, 2023 8:29 PM > >> > >> To: Tamar Christina <tamar.christ...@arm.com> > >> > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > >> > >> <richard.earns...@arm.com>; Marcus Shawcroft > >> > >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov > >> > <kyrylo.tkac...@arm.com> > >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for > cond_copysign. > >> > >> > >> > >> Tamar Christina <tamar.christ...@arm.com> writes: > >> > >> > Hi All, > >> > >> > > >> > >> > This adds an implementation for masked copysign along with an > >> > >> > optimized pattern for masked copysign (x, -1). > >> > >> > >> > >> It feels like we're ending up with a lot of AArch64-specific > >> > >> code that just hard- codes the observation that changing the > >> > >> sign is equivalent to changing the top bit. We then need to > >> > >> make sure that we choose the best way of changing the top bit for any > given situation. > >> > >> > >> > >> Hard-coding the -1/negative case is one instance of that. But > >> > >> it looks like we also fail to use the best sequence for SVE2. E.g. > >> > >> [https://godbolt.org/z/ajh3MM5jv]: > >> > >> > >> > >> #include <stdint.h> > >> > >> > >> > >> void f(double *restrict a, double *restrict b) { > >> > >> for (int i = 0; i < 100; ++i) > >> > >> a[i] = __builtin_copysign(a[i], b[i]); } > >> > >> > >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) { > >> > >> for (int i = 0; i < 100; ++i) > >> > >> a[i] = (a[i] & ~c) | (b[i] & c); } > >> > >> > >> > >> gives: > >> > >> > >> > >> f: > >> > >> mov x2, 0 > >> > >> mov w3, 100 > >> > >> whilelo p7.d, wzr, w3 > >> > >> .L2: > >> > >> ld1d z30.d, p7/z, [x0, x2, lsl 3] > >> > >> ld1d z31.d, p7/z, [x1, x2, lsl 3] > >> > >> and z30.d, z30.d, #0x7fffffffffffffff > >> > >> and z31.d, z31.d, #0x8000000000000000 > >> > >> orr z31.d, z31.d, z30.d > >> > >> st1d z31.d, p7, [x0, x2, lsl 3] > >> > >> incd x2 > >> > >> whilelo p7.d, w2, w3 > >> > >> b.any .L2 > >> > >> ret > >> > >> g: > >> > >> mov x3, 0 > >> > >> mov w4, 100 > >> > >> mov z29.d, x2 > >> > >> whilelo p7.d, wzr, w4 > >> > >> .L6: > >> > >> ld1d z30.d, p7/z, [x0, x3, lsl 3] > >> > >> ld1d z31.d, p7/z, [x1, x3, lsl 3] > >> > >> bsl z31.d, z31.d, z30.d, z29.d > >> > >> st1d z31.d, p7, [x0, x3, lsl 3] > >> > >> incd x3 > >> > >> whilelo p7.d, w3, w4 > >> > >> b.any .L6 > >> > >> ret > >> > >> > >> > >> I saw that you originally tried to do this in match.pd and that > >> > >> the decision was to fold to copysign instead. But perhaps > >> > >> there's a compromise where isel does something with the (new) > >> > >> copysign canonical > >> > form? > >> > >> I.e. could we go with your new version of the match.pd patch, > >> > >> and add some isel stuff as a follow-on? > >> > >> > >> > > > >> > > Sure if that's what's desired.... But.. > >> > > > >> > > The example you posted above is for instance worse for x86 > >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has a > >> > > dependency chain of 2 and the latter of 3. It's likely any open > >> > > coding of this > >> > operation is going to hurt a target. > >> > > > >> > > So I'm unsure what isel transform this into... > >> > > >> > I didn't mean that we should go straight to using isel for the > >> > general case, just for the new case. The example above was instead > >> > trying to show the general point that hiding the logic ops in target > >> > code is > a double-edged sword. > >> > >> I see.. but the problem here is that transforming copysign (x, -1) > >> into (x | 0x8000000) would require an integer operation on an FP > >> value. I'm happy to do it but it seems like it'll be an AArch64 only thing > anyway. > >> > >> If we want to do this we need to check can_change_mode_class or a hook. > >> Most targets including x86 reject the conversion. So it'll just be > >> effectively an AArch64 thing. > >> > >> You're right that the actual equivalent transformation is this > >> https://godbolt.org/z/KesfrMv5z But the target won't allow it. > >> > >> > > >> > The x86_64 example for the -1 case would be > >> > https://godbolt.org/z/b9s6MaKs8 where the isel change would be an > >> > improvement. Without that, I guess > >> > x86_64 will need to have a similar patch to the AArch64 one. > >> > > >> > >> I think that's to be expected. I think it's logical that every > >> target just needs to implement their optabs optimally. > >> > >> > That said, https://godbolt.org/z/e6nqoqbMh suggests that powerpc64 > >> > is probably relying on the current copysign -> neg/abs transform. > >> > (Not sure why the second function uses different IVs from the > >> > first.) > >> > > >> > Personally, I wouldn't be against a target hook that indicated > >> > whether float bit manipulation is "free" for a given mode, if it comes to > that. > >> > >> I'm really sure Richi would agree there. Generally speaking I don't > >> think people see doing FP operations on Int as beneficial. But it's always > the case on AArch64. > > > > IIRC we're doing fpclassify "expansion" early for example. > > > > Note the issue I see is that the middle-end shouldn't get in the way > > of targets that have a copysign optab. In case it's worthwhile to > > special-case a "setsign" thing then maybe provide an optab for that as > > well. Now, that doesn't help if we want setsign to be expanded from > > the middle-end but still wan the copysign optab (and not require > > targets to implement both if they want to escape middle-end generic > expansion of setsign). > > > > But yes, I also thought the , 1 and , -1 cases could be special cased > > by RTL expansion (or ISEL). But it would need to invoke costing which > > likely means target specific changes anyway... :/ > > FWIW, if we had the target hook I suggested, I don't think AArch64 would > need to implement copysign or xorsign optabs. That's probably true of > AArch32 too (at least for all combinations that are likely to matter these > days). > > I'd go one step further and say that, if a target wants to do its own thing > for > copysign and xorsign, it clearly doesn't meet the requirement that bit > manipulation of floats is "free" for that mode. >
So I'm aware I have no say here, but I'll make one last effort. The patch isn't just implementing the fneg (fabs ()) optimization just because. The location where it's implemented makes a big difference. If this optimization is done late, it doesn't fix the regression fully, because doing It after all optimization passes have run means it can't properly be optimized. The point of doing the rewrite early to ORR accomplished two things: 1. It makes PRE realize that the block it's splitting would only have 1 instruction in it and that such a split is not beneficial. This is why I'm against doing such optimizations so later. Optimizations don’t' happen in isolation, and when they make sense should happen early. Transforming fneg (fabs (..)) in isel not only feels wrong but results in a 4% performance loss. 2. Doing it early also gets the ORRs to be reassociated changing where the loop dependent variable lands. Early makes it land in the merging MOVPRFX rather than requiring a SEL at the end of the iteration. Replacing the fneg (fabs (..)) with copysign accomplishes 1 but no 2. This results in a 2-3% loss, but I can live with that given doing 1 gets us back to GCC 12 levels. Doing fneg (fabs (..)) in isel would have no meaning for me and not fix the regression. I won't be looking to do that in that case. If it's acceptable I can transform COPYSIGN (X, -1) in gimple-isel. So before I start on this, Would this be acceptable for you? Thanks, Tamar > > So I have no good advice here but I hoped that even the generic target > > specific copysign implementation with and & xor would eventually be > > optimized later on RTL for constant second arg? > > Yeah. It looks like the required logic is there for scalars, it just needs > extending > to vectors. > > The patch below (untested beyond simple cases) seems to be enough to fix it, > but using the simplify routines even for CONST_INTs might be controversial. > > Thanks, > Richard > > > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index > bd9443dbcc2..5a9b1745673 100644 > --- a/gcc/simplify-rtx.cc > +++ b/gcc/simplify-rtx.cc > @@ -3409,20 +3409,20 @@ simplify_context::simplify_binary_operation_1 > (rtx_code code, > > /* Canonicalize (X & C1) | C2. */ > if (GET_CODE (op0) == AND > - && CONST_INT_P (trueop1) > - && CONST_INT_P (XEXP (op0, 1))) > + && CONSTANT_P (trueop1) > + && CONSTANT_P (XEXP (op0, 1))) > { > - HOST_WIDE_INT mask = GET_MODE_MASK (mode); > - HOST_WIDE_INT c1 = INTVAL (XEXP (op0, 1)); > - HOST_WIDE_INT c2 = INTVAL (trueop1); > + rtx c1 = XEXP (op0, 1); > + rtx c2 = trueop1; > > /* If (C1&C2) == C1, then (X&C1)|C2 becomes C2. */ > - if ((c1 & c2) == c1 > + if (rtx_equal_p (simplify_binary_operation (AND, mode, c1, c2), c1) > && !side_effects_p (XEXP (op0, 0))) > return trueop1; > > /* If (C1|C2) == ~0 then (X&C1)|C2 becomes X|C2. */ > - if (((c1|c2) & mask) == mask) > + if (rtx_equal_p (simplify_binary_operation (IOR, mode, c1, c2), > + CONSTM1_RTX (mode))) > return simplify_gen_binary (IOR, mode, XEXP (op0, 0), op1); > } > > @@ -3732,7 +3732,7 @@ simplify_context::simplify_binary_operation_1 > (rtx_code code, > machines, and also has shorter instruction path length. */ > if (GET_CODE (op0) == AND > && GET_CODE (XEXP (op0, 0)) == XOR > - && CONST_INT_P (XEXP (op0, 1)) > + && CONSTANT_P (XEXP (op0, 1)) > && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1)) > { > rtx a = trueop1; > @@ -3746,7 +3746,7 @@ simplify_context::simplify_binary_operation_1 > (rtx_code code, > /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) > */ > else if (GET_CODE (op0) == AND > && GET_CODE (XEXP (op0, 0)) == XOR > - && CONST_INT_P (XEXP (op0, 1)) > + && CONSTANT_P (XEXP (op0, 1)) > && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1)) > { > rtx a = XEXP (XEXP (op0, 0), 0); > -- > 2.25.1 >