Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Saturday, October 7, 2023 10:58 AM >> To: Richard Biener <richard.guent...@gmail.com> >> Cc: Tamar Christina <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org; >> nd <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>; >> Marcus Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov >> <kyrylo.tkac...@arm.com> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. >> >> Richard Biener <richard.guent...@gmail.com> writes: >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina >> <tamar.christ...@arm.com> wrote: >> >> >> >> > -----Original Message----- >> >> > From: Richard Sandiford <richard.sandif...@arm.com> >> >> > Sent: Thursday, October 5, 2023 9:26 PM >> >> > To: Tamar Christina <tamar.christ...@arm.com> >> >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >> >> > <richard.earns...@arm.com>; Marcus Shawcroft >> >> > <marcus.shawcr...@arm.com>; Kyrylo Tkachov >> <kyrylo.tkac...@arm.com> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for >> cond_copysign. >> >> > >> >> > Tamar Christina <tamar.christ...@arm.com> writes: >> >> > >> -----Original Message----- >> >> > >> From: Richard Sandiford <richard.sandif...@arm.com> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM >> >> > >> To: Tamar Christina <tamar.christ...@arm.com> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >> >> > >> <richard.earns...@arm.com>; Marcus Shawcroft >> >> > >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov >> >> > <kyrylo.tkac...@arm.com> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for >> cond_copysign. >> >> > >> >> >> > >> Tamar Christina <tamar.christ...@arm.com> writes: >> >> > >> > Hi All, >> >> > >> > >> >> > >> > This adds an implementation for masked copysign along with an >> >> > >> > optimized pattern for masked copysign (x, -1). >> >> > >> >> >> > >> It feels like we're ending up with a lot of AArch64-specific >> >> > >> code that just hard- codes the observation that changing the >> >> > >> sign is equivalent to changing the top bit. We then need to >> >> > >> make sure that we choose the best way of changing the top bit for any >> given situation. >> >> > >> >> >> > >> Hard-coding the -1/negative case is one instance of that. But >> >> > >> it looks like we also fail to use the best sequence for SVE2. E.g. >> >> > >> [https://godbolt.org/z/ajh3MM5jv]: >> >> > >> >> >> > >> #include <stdint.h> >> >> > >> >> >> > >> void f(double *restrict a, double *restrict b) { >> >> > >> for (int i = 0; i < 100; ++i) >> >> > >> a[i] = __builtin_copysign(a[i], b[i]); } >> >> > >> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) { >> >> > >> for (int i = 0; i < 100; ++i) >> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); } >> >> > >> >> >> > >> gives: >> >> > >> >> >> > >> f: >> >> > >> mov x2, 0 >> >> > >> mov w3, 100 >> >> > >> whilelo p7.d, wzr, w3 >> >> > >> .L2: >> >> > >> ld1d z30.d, p7/z, [x0, x2, lsl 3] >> >> > >> ld1d z31.d, p7/z, [x1, x2, lsl 3] >> >> > >> and z30.d, z30.d, #0x7fffffffffffffff >> >> > >> and z31.d, z31.d, #0x8000000000000000 >> >> > >> orr z31.d, z31.d, z30.d >> >> > >> st1d z31.d, p7, [x0, x2, lsl 3] >> >> > >> incd x2 >> >> > >> whilelo p7.d, w2, w3 >> >> > >> b.any .L2 >> >> > >> ret >> >> > >> g: >> >> > >> mov x3, 0 >> >> > >> mov w4, 100 >> >> > >> mov z29.d, x2 >> >> > >> whilelo p7.d, wzr, w4 >> >> > >> .L6: >> >> > >> ld1d z30.d, p7/z, [x0, x3, lsl 3] >> >> > >> ld1d z31.d, p7/z, [x1, x3, lsl 3] >> >> > >> bsl z31.d, z31.d, z30.d, z29.d >> >> > >> st1d z31.d, p7, [x0, x3, lsl 3] >> >> > >> incd x3 >> >> > >> whilelo p7.d, w3, w4 >> >> > >> b.any .L6 >> >> > >> ret >> >> > >> >> >> > >> I saw that you originally tried to do this in match.pd and that >> >> > >> the decision was to fold to copysign instead. But perhaps >> >> > >> there's a compromise where isel does something with the (new) >> >> > >> copysign canonical >> >> > form? >> >> > >> I.e. could we go with your new version of the match.pd patch, >> >> > >> and add some isel stuff as a follow-on?
[A] >> >> > >> >> >> > > >> >> > > Sure if that's what's desired.... But.. >> >> > > >> >> > > The example you posted above is for instance worse for x86 >> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has a >> >> > > dependency chain of 2 and the latter of 3. It's likely any open >> >> > > coding of this >> >> > operation is going to hurt a target. >> >> > > >> >> > > So I'm unsure what isel transform this into... >> >> > >> >> > I didn't mean that we should go straight to using isel for the >> >> > general case, just for the new case. The example above was instead >> >> > trying to show the general point that hiding the logic ops in target >> >> > code is >> a double-edged sword. >> >> >> >> I see.. but the problem here is that transforming copysign (x, -1) >> >> into (x | 0x8000000) would require an integer operation on an FP >> >> value. I'm happy to do it but it seems like it'll be an AArch64 only >> >> thing >> anyway. >> >> >> >> If we want to do this we need to check can_change_mode_class or a hook. >> >> Most targets including x86 reject the conversion. So it'll just be >> >> effectively an AArch64 thing. >> >> >> >> You're right that the actual equivalent transformation is this >> >> https://godbolt.org/z/KesfrMv5z But the target won't allow it. >> >> >> >> > >> >> > The x86_64 example for the -1 case would be >> >> > https://godbolt.org/z/b9s6MaKs8 where the isel change would be an >> >> > improvement. Without that, I guess >> >> > x86_64 will need to have a similar patch to the AArch64 one. >> >> > >> >> >> >> I think that's to be expected. I think it's logical that every >> >> target just needs to implement their optabs optimally. >> >> >> >> > That said, https://godbolt.org/z/e6nqoqbMh suggests that powerpc64 >> >> > is probably relying on the current copysign -> neg/abs transform. >> >> > (Not sure why the second function uses different IVs from the >> >> > first.) >> >> > >> >> > Personally, I wouldn't be against a target hook that indicated >> >> > whether float bit manipulation is "free" for a given mode, if it comes >> >> > to >> that. >> >> >> >> I'm really sure Richi would agree there. Generally speaking I don't >> >> think people see doing FP operations on Int as beneficial. But it's >> >> always >> the case on AArch64. >> > >> > IIRC we're doing fpclassify "expansion" early for example. >> > >> > Note the issue I see is that the middle-end shouldn't get in the way >> > of targets that have a copysign optab. In case it's worthwhile to >> > special-case a "setsign" thing then maybe provide an optab for that as >> > well. Now, that doesn't help if we want setsign to be expanded from >> > the middle-end but still wan the copysign optab (and not require >> > targets to implement both if they want to escape middle-end generic >> expansion of setsign). >> > >> > But yes, I also thought the , 1 and , -1 cases could be special cased >> > by RTL expansion (or ISEL). But it would need to invoke costing which >> > likely means target specific changes anyway... :/ >> >> FWIW, if we had the target hook I suggested, I don't think AArch64 would >> need to implement copysign or xorsign optabs. That's probably true of >> AArch32 too (at least for all combinations that are likely to matter these >> days). >> >> I'd go one step further and say that, if a target wants to do its own thing >> for >> copysign and xorsign, it clearly doesn't meet the requirement that bit >> manipulation of floats is "free" for that mode. >> > > So I'm aware I have no say here, but I'll make one last effort. > > The patch isn't just implementing the fneg (fabs ()) optimization just > because. > The location where it's implemented makes a big difference. > > If this optimization is done late, it doesn't fix the regression fully, > because doing > It after all optimization passes have run means it can't properly be > optimized. > > The point of doing the rewrite early to ORR accomplished two things: > > 1. It makes PRE realize that the block it's splitting would only have 1 > instruction in it > and that such a split is not beneficial. This is why I'm against doing > such optimizations > so later. Optimizations don’t' happen in isolation, and when they make > sense should > happen early. Transforming fneg (fabs (..)) in isel not only feels wrong > but results in a 4% > performance loss. > > 2. Doing it early also gets the ORRs to be reassociated changing where the > loop dependent > variable lands. Early makes it land in the merging MOVPRFX rather than > requiring a SEL > at the end of the iteration. > > Replacing the fneg (fabs (..)) with copysign accomplishes 1 but no 2. This > results in a 2-3% loss, > but I can live with that given doing 1 gets us back to GCC 12 levels. > > Doing fneg (fabs (..)) in isel would have no meaning for me and not fix the > regression. I won't be > looking to do that in that case. > > If it's acceptable I can transform COPYSIGN (X, -1) in gimple-isel. FWIW, this is what I meant by [A] above. I.e. go with your latest target-independent change, and make isel replace COPYSIGN (X, -1) with logic operations, under control of a new target hook that says that logic operations on float values are as cheap as logic operations on integer values (for a given float mode). In future, hopefully all COPYSIGNs and XORSIGNs would be handled the same way, under control of the same hook, but that would be a separate future change. I wasn't suggesting recognising (fneg (fabs )) in isel. But like I say, beware of powerpc64. It seems to rely on the current reverse transformation of (copysign x -1) to (fneg (fabs x)). (Could be wrong, but would be worth checking.) Thanks, Richard > So before I start on this, Would this be acceptable for you? > > Thanks, > Tamar > >> > So I have no good advice here but I hoped that even the generic target >> > specific copysign implementation with and & xor would eventually be >> > optimized later on RTL for constant second arg? >> >> Yeah. It looks like the required logic is there for scalars, it just needs >> extending >> to vectors. >> >> The patch below (untested beyond simple cases) seems to be enough to fix it, >> but using the simplify routines even for CONST_INTs might be controversial. >> >> Thanks, >> Richard >> >> >> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index >> bd9443dbcc2..5a9b1745673 100644 >> --- a/gcc/simplify-rtx.cc >> +++ b/gcc/simplify-rtx.cc >> @@ -3409,20 +3409,20 @@ simplify_context::simplify_binary_operation_1 >> (rtx_code code, >> >> /* Canonicalize (X & C1) | C2. */ >> if (GET_CODE (op0) == AND >> - && CONST_INT_P (trueop1) >> - && CONST_INT_P (XEXP (op0, 1))) >> + && CONSTANT_P (trueop1) >> + && CONSTANT_P (XEXP (op0, 1))) >> { >> - HOST_WIDE_INT mask = GET_MODE_MASK (mode); >> - HOST_WIDE_INT c1 = INTVAL (XEXP (op0, 1)); >> - HOST_WIDE_INT c2 = INTVAL (trueop1); >> + rtx c1 = XEXP (op0, 1); >> + rtx c2 = trueop1; >> >> /* If (C1&C2) == C1, then (X&C1)|C2 becomes C2. */ >> - if ((c1 & c2) == c1 >> + if (rtx_equal_p (simplify_binary_operation (AND, mode, c1, c2), c1) >> && !side_effects_p (XEXP (op0, 0))) >> return trueop1; >> >> /* If (C1|C2) == ~0 then (X&C1)|C2 becomes X|C2. */ >> - if (((c1|c2) & mask) == mask) >> + if (rtx_equal_p (simplify_binary_operation (IOR, mode, c1, c2), >> + CONSTM1_RTX (mode))) >> return simplify_gen_binary (IOR, mode, XEXP (op0, 0), op1); >> } >> >> @@ -3732,7 +3732,7 @@ simplify_context::simplify_binary_operation_1 >> (rtx_code code, >> machines, and also has shorter instruction path length. */ >> if (GET_CODE (op0) == AND >> && GET_CODE (XEXP (op0, 0)) == XOR >> - && CONST_INT_P (XEXP (op0, 1)) >> + && CONSTANT_P (XEXP (op0, 1)) >> && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1)) >> { >> rtx a = trueop1; >> @@ -3746,7 +3746,7 @@ simplify_context::simplify_binary_operation_1 >> (rtx_code code, >> /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) >> */ >> else if (GET_CODE (op0) == AND >> && GET_CODE (XEXP (op0, 0)) == XOR >> - && CONST_INT_P (XEXP (op0, 1)) >> + && CONSTANT_P (XEXP (op0, 1)) >> && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1)) >> { >> rtx a = XEXP (XEXP (op0, 0), 0); >> -- >> 2.25.1 >>