On Wed, Sep 8, 2021 at 3:43 PM Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Hi! > > As the testcase shows, we miscompile @xorsign<mode>3_1 if both input > operands are in the same register, because the splitter overwrites op1 > before with op1 & mask before using op0. > > For dest = xorsign op0, op0 we can actually simplify it from > dest = (op0 & mask) ^ op0 to dest = op0 & ~mask (aka abs). > > The expander change is an optimization improvement, if we at expansion > time know it is xorsign op0, op0, we can emit abs right away and get better > code through that. > > The @xorsign<mode>3_1 is a fix for the case where xorsign wouldn't be known > to have same operands during expansion, but during RTL optimizations they > would appear. For non-AVX we need to use earlyclobber, we require > dest and op1 to be the same but op0 must be different because we overwrite > op1 first. For AVX the constraints ensure that at most 2 of the 3 operands > may be the same register and if both inputs are the same, handles that case. > This case can be easily tested with the xorsign<mode>3 expander change > reverted. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > Patch LGTM.
PS: I'm curious why we need the post_reload splitter @xorsign<mode>3_1 for scalar mode, can't we just expand them into and/xor operations in the expander, just like vector modes did. Let me do some experiments to see whether it is ok to remove the splitter. > Thinking about it more this morning, while this patch fixes the problems > revealed in the testcase, the recent PR89984 change was buggy too, but > perhaps that can be fixed incrementally. Because for AVX the new code > destructively modifies op1. If that is different from dest, say on: > float > foo (float x, float y) > { > return x * __builtin_copysignf (1.0f, y) + y; > } > then we get after RA: > (insn 8 7 9 2 (set (reg:SF 20 xmm0 [orig:82 _2 ] [82]) > (unspec:SF [ > (reg:SF 20 xmm0 [88]) > (reg:SF 21 xmm1 [89]) > (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S16 > A128]) > ] UNSPEC_XORSIGN)) "hohoho.c":4:12 649 {xorsignsf3_1} > (nil)) > (insn 9 8 15 2 (set (reg:SF 20 xmm0 [87]) > (plus:SF (reg:SF 20 xmm0 [orig:82 _2 ] [82]) > (reg:SF 21 xmm1 [89]))) "hohoho.c":4:44 1021 {*fop_sf_comm} > (nil)) > but split the xorsign into: > vandps .LC0(%rip), %xmm1, %xmm1 > vxorps %xmm0, %xmm1, %xmm0 > and then the addition: > vaddss %xmm1, %xmm0, %xmm0 > which means we miscompile it - instead of adding y in the end we add > __builtin_copysignf (0.0f, y). > So, wonder if we don't want instead in addition to the &Yv <- Yv, 0 > alternative (enabled for both pre-AVX and AVX as in this patch) the > &Yv <- Yv, Yv where destination must be different from inputs and another > Yv <- Yv, Yv where it can be the same but then need a match_scratch > (with X for the other alternatives and =Yv for the last one). > That way we'd always have a safe register we can store the op1 & mask > value into, either the destination (in the first alternative known to > be equal to op1 which is needed for non-AVX but ok for AVX too), in the > second alternative known to be different from both inputs and in the third > which could be used for those > float bar (float x, float y) { return x * __builtin_copysignf (1.0f, y); } > cases where op1 is naturally xmm1 and dest == op0 naturally xmm0 we'd use > some other register like xmm2. > > 2021-09-07 Jakub Jelinek <ja...@redhat.com> > > PR target/102224 > * config/i386/i386.md (xorsign<mode>3): If operands[1] is equal to > operands[2], emit abs<mode>2 instead. > (@xorsign<mode>3_1): Add early-clobbers for output operand, enable > first alternative even for avx, add another alternative with > =&Yv <- 0, Yv, Yvm constraints. > * config/i386/i386-expand.c (ix86_split_xorsign): If op0 is equal > to op1, emit vpandn instead. > > * gcc.dg/pr102224.c: New test. > * gcc.target/i386/avx-pr102224.c: New test. > > --- gcc/config/i386/i386.md.jj 2021-09-06 14:47:43.199049975 +0200 > +++ gcc/config/i386/i386.md 2021-09-07 13:13:50.825603852 +0200 > @@ -10803,21 +10803,27 @@ (define_expand "xorsign<mode>3" > (match_operand:MODEF 1 "register_operand") > (match_operand:MODEF 2 "register_operand")] > "SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH" > - "ix86_expand_xorsign (operands); DONE;") > +{ > + if (rtx_equal_p (operands[1], operands[2])) > + emit_insn (gen_abs<mode>2 (operands[0], operands[1])); > + else > + ix86_expand_xorsign (operands); > + DONE; > +}) > > (define_insn_and_split "@xorsign<mode>3_1" > - [(set (match_operand:MODEF 0 "register_operand" "=Yv,Yv") > + [(set (match_operand:MODEF 0 "register_operand" "=&Yv,&Yv,&Yv") > (unspec:MODEF > - [(match_operand:MODEF 1 "register_operand" "Yv,Yv") > - (match_operand:MODEF 2 "register_operand" "0,Yv") > - (match_operand:<ssevecmode> 3 "nonimmediate_operand" "Yvm,Yvm")] > + [(match_operand:MODEF 1 "register_operand" "Yv,0,Yv") > + (match_operand:MODEF 2 "register_operand" "0,Yv,Yv") > + (match_operand:<ssevecmode> 3 "nonimmediate_operand" > "Yvm,Yvm,Yvm")] > UNSPEC_XORSIGN))] > "SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH" > "#" > "&& reload_completed" > [(const_int 0)] > "ix86_split_xorsign (operands); DONE;" > - [(set_attr "isa" "noavx,avx")]) > + [(set_attr "isa" "*,avx,avx")]) > > ;; One complement instructions > > --- gcc/config/i386/i386-expand.c.jj 2021-09-07 12:35:26.340685935 +0200 > +++ gcc/config/i386/i386-expand.c 2021-09-07 13:09:43.961032052 +0200 > @@ -2289,12 +2289,39 @@ ix86_split_xorsign (rtx operands[]) > mode = GET_MODE (dest); > vmode = GET_MODE (mask); > > - op1 = lowpart_subreg (vmode, op1, mode); > - x = gen_rtx_AND (vmode, op1, mask); > - emit_insn (gen_rtx_SET (op1, x)); > + /* The constraints ensure that for non-AVX dest == op1 is > + different from op0, and for AVX that at most two of > + dest, op0 and op1 are the same register but the third one > + is different. */ > + if (rtx_equal_p (op0, op1)) > + { > + gcc_assert (TARGET_AVX && !rtx_equal_p (op0, dest)); > + if (vmode == V4SFmode) > + vmode = V4SImode; > + else > + { > + gcc_assert (vmode == V2DFmode); > + vmode = V2DImode; > + } > + mask = lowpart_subreg (vmode, mask, GET_MODE (mask)); > + if (MEM_P (mask)) > + { > + rtx msk = lowpart_subreg (vmode, dest, mode); > + emit_insn (gen_rtx_SET (msk, mask)); > + mask = msk; > + } > + op0 = lowpart_subreg (vmode, op0, mode); > + x = gen_rtx_AND (vmode, gen_rtx_NOT (vmode, mask), op0); > + } > + else > + { > + op1 = lowpart_subreg (vmode, op1, mode); > + x = gen_rtx_AND (vmode, op1, mask); > + emit_insn (gen_rtx_SET (op1, x)); > > - op0 = lowpart_subreg (vmode, op0, mode); > - x = gen_rtx_XOR (vmode, op1, op0); > + op0 = lowpart_subreg (vmode, op0, mode); > + x = gen_rtx_XOR (vmode, op1, op0); > + } > > dest = lowpart_subreg (vmode, dest, mode); > emit_insn (gen_rtx_SET (dest, x)); > --- gcc/testsuite/gcc.dg/pr102224.c.jj 2021-09-07 12:17:43.088507018 +0200 > +++ gcc/testsuite/gcc.dg/pr102224.c 2021-09-07 13:11:52.923241157 +0200 > @@ -0,0 +1,49 @@ > +/* PR target/102224 */ > +/* { dg-do run } */ > +/* { dg-options "-O2" } */ > + > +__attribute__((noipa)) float > +foo (float x) > +{ > + return x * __builtin_copysignf (1.0f, x); > +} > + > +__attribute__((noipa)) float > +bar (float x, float y) > +{ > + return x * __builtin_copysignf (1.0f, y); > +} > + > +__attribute__((noipa)) float > +baz (float z, float x) > +{ > + return x * __builtin_copysignf (1.0f, x); > +} > + > +__attribute__((noipa)) float > +qux (float z, float x, float y) > +{ > + return x * __builtin_copysignf (1.0f, y); > +} > + > +int > +main () > +{ > + if (foo (1.0f) != 1.0f > + || foo (-4.0f) != 4.0f) > + __builtin_abort (); > + if (bar (1.25f, 7.25f) != 1.25f > + || bar (1.75f, -3.25f) != -1.75f > + || bar (-2.25f, 7.5f) != -2.25f > + || bar (-3.0f, -4.0f) != 3.0f) > + __builtin_abort (); > + if (baz (5.5f, 1.0f) != 1.0f > + || baz (4.25f, -4.0f) != 4.0f) > + __builtin_abort (); > + if (qux (1.0f, 1.25f, 7.25f) != 1.25f > + || qux (2.0f, 1.75f, -3.25f) != -1.75f > + || qux (3.0f, -2.25f, 7.5f) != -2.25f > + || qux (4.0f, -3.0f, -4.0f) != 3.0f) > + __builtin_abort (); > + return 0; > +} > --- gcc/testsuite/gcc.target/i386/avx-pr102224.c.jj 2021-09-07 > 13:12:41.429567551 +0200 > +++ gcc/testsuite/gcc.target/i386/avx-pr102224.c 2021-09-07 > 13:13:11.397151393 +0200 > @@ -0,0 +1,23 @@ > +/* PR tree-optimization/51581 */ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx" } */ > +/* { dg-require-effective-target avx } */ > + > +#ifndef CHECK_H > +#define CHECK_H "avx-check.h" > +#endif > +#ifndef TEST > +#define TEST avx_test > +#endif > + > +#define main main1 > +#include "../../gcc.dg/pr102224.c" > +#undef main > + > +#include CHECK_H > + > +static void > +TEST (void) > +{ > + main1 (); > +} > > Jakub > -- BR, Hongtao