On Tue, Aug 8, 2023 at 10:07 AM Richard Biener <rguent...@suse.de> wrote: > > On Mon, 7 Aug 2023, Uros Bizjak wrote: > > > On Mon, Jul 31, 2023 at 11:40?AM Richard Biener <rguent...@suse.de> wrote: > > > > > > On Sun, 30 Jul 2023, Uros Bizjak wrote: > > > > > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF > > > > named patterns in order to avoid generation of partial vector V4SFmode > > > > trapping instructions. > > > > > > > > The new option is enabled by default, because even with sanitization, > > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita > > > > benchmark can be achieved vs. scalar code. > > > > > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9% > > > > vs. scalar code. This is what clang does by default, as it defaults > > > > to -fno-trapping-math. > > > > > > I like the new option, note you lack invoke.texi documentation where > > > I'd also elaborate a bit on the interaction with -fno-trapping-math > > > and the possible performance impact then NaNs or denormals leak > > > into the upper halves and cross-reference -mdaz-ftz. > > > > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse > > option. It is written in a way to also cover half-float vectors. WDYT? > > "generate trapping floating-point operations" > > I'd say "generate floating-point operations that might affect the > set of floating point status flags", the word "trapping" is IMHO > misleading. > Not sure if "set of floating point status flags" is the correct term, > but it's what the C standard seems to refer to when talking about > things you get with fegetexceptflag. feraieexcept refers to > "floating-point exceptions". Unfortunately the -fno-trapping-math > documentation is similarly confusing (and maybe even wrong, I read > it to conform to 'non-stop' IEEE arithmetic).
Thanks for suggesting the right terminology. I think that: +@opindex mpartial-vector-math +@item -mpartial-vector-math +This option enables GCC to generate floating-point operations that might +affect the set of floating point status flags on partial vectors, where +vector elements reside in the low part of the 128-bit SSE register. Unless +@option{-fno-trapping-math} is specified, the compiler guarantees correct +behavior by sanitizing all input operands to have zeroes in the unused +upper part of the vector register. Note that by using built-in functions +or inline assembly with partial vector arguments, NaNs, denormal or invalid +values can leak into the upper part of the vector, causing possible +performance issues when @option{-fno-trapping-math} is in effect. These +issues can be mitigated by manually sanitizing the upper part of the partial +vector argument register or by using @option{-mdaz-ftz} to set +denormals-are-zero (DAZ) flag in the MXCSR register. Now explain in adequate detail what the option does. IMO, the "floating-point operations that might affect the set of floating point status flags" correctly identifies affected operations, so an example, as suggested below, is not necessary. > I'd maybe give an example of a FP operation that's _not_ affected > by the flag (copysign?). Please note that I have renamed the option to "-mpartial-vector-math" with a short target-specific description: +partial-vector-math +Target Var(ix86_partial_vec_math) Init(1) +Enable floating-point status flags setting SSE vector operations on partial vectors which I think summarises the option (without the word "trapping"). The same approach will be taken for Float16 operations, so the approach is not specific to MMX vectors. > Otherwise it looks OK to me. Thanks, I have attached the RFC V2 patch; I plan to submit a formal patch later today. Uros.
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index 1cc8563477a..8d9a1ae93f3 100644 --- a/gcc/config/i386/i386.opt +++ b/gcc/config/i386/i386.opt @@ -632,6 +632,10 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256) EnumValue Enum(prefer_vector_width) String(512) Value(PVW_AVX512) +partial-vector-math +Target Var(ix86_partial_vec_math) Init(1) +Enable floating-point status flags setting SSE vector operations on partial vectors + mmove-max= Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save Maximum number of bits that can be moved from memory to memory efficiently. diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md index b49554e9b8f..95f7a0113e7 100644 --- a/gcc/config/i386/mmx.md +++ b/gcc/config/i386/mmx.md @@ -595,7 +595,18 @@ (define_expand "movq_<mode>_to_sse" (match_operand:V2FI_V4HF 1 "nonimmediate_operand") (match_dup 2)))] "TARGET_SSE2" - "operands[2] = CONST0_RTX (<MODE>mode);") +{ + if (<MODE>mode == V2SFmode + && !flag_trapping_math) + { + rtx op1 = force_reg (<MODE>mode, operands[1]); + emit_move_insn (operands[0], lowpart_subreg (<mmxdoublevecmode>mode, + op1, <MODE>mode)); + DONE; + } + + operands[2] = CONST0_RTX (<MODE>mode); +}) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; @@ -648,7 +659,7 @@ (define_expand "<insn>v2sf3" (plusminusmult:V2SF (match_operand:V2SF 1 "nonimmediate_operand") (match_operand:V2SF 2 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -726,7 +737,7 @@ (define_expand "divv2sf3" [(set (match_operand:V2SF 0 "register_operand") (div:V2SF (match_operand:V2SF 1 "register_operand") (match_operand:V2SF 2 "register_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -748,7 +759,7 @@ (define_expand "<code>v2sf3" (smaxmin:V2SF (match_operand:V2SF 1 "register_operand") (match_operand:V2SF 2 "register_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -850,7 +861,7 @@ (define_insn "mmx_rcpit2v2sf3" (define_expand "sqrtv2sf2" [(set (match_operand:V2SF 0 "register_operand") (sqrt:V2SF (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -931,7 +942,7 @@ (define_insn_and_split "*mmx_haddv2sf3_low" (vec_select:SF (match_dup 1) (parallel [(match_operand:SI 3 "const_0_to_1_operand")]))))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE + "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math && INTVAL (operands[2]) != INTVAL (operands[3]) && ix86_pre_reload_split ()" "#" @@ -977,7 +988,7 @@ (define_insn_and_split "*mmx_hsubv2sf3_low" (vec_select:SF (match_dup 1) (parallel [(const_int 1)]))))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE + "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math && ix86_pre_reload_split ()" "#" "&& 1" @@ -1039,7 +1050,7 @@ (define_expand "vec_addsubv2sf3" (match_operand:V2SF 2 "nonimmediate_operand")) (plus:V2SF (match_dup 1) (match_dup 2)) (const_int 1)))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE" + "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -1102,7 +1113,7 @@ (define_expand "vec_cmpv2sfv2si" (match_operator:V2SI 1 "" [(match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")]))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx ops[4]; ops[3] = gen_reg_rtx (V4SFmode); @@ -1128,7 +1139,7 @@ (define_expand "vcond<mode>v2sf" (match_operand:V2SF 5 "nonimmediate_operand")]) (match_operand:V2FI 1 "general_operand") (match_operand:V2FI 2 "general_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx ops[6]; ops[5] = gen_reg_rtx (V4SFmode); @@ -1318,7 +1329,7 @@ (define_expand "fmav2sf4" (match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1343,7 +1354,7 @@ (define_expand "fmsv2sf4" (neg:V2SF (match_operand:V2SF 3 "nonimmediate_operand"))))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1368,7 +1379,7 @@ (define_expand "fnmav2sf4" (match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1394,7 +1405,7 @@ (define_expand "fnmsv2sf4" (neg:V2SF (match_operand:V2SF 3 "nonimmediate_operand"))))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1420,7 +1431,7 @@ (define_expand "fnmsv2sf4" (define_expand "fix_truncv2sfv2si2" [(set (match_operand:V2SI 0 "register_operand") (fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1436,7 +1447,7 @@ (define_expand "fix_truncv2sfv2si2" (define_expand "fixuns_truncv2sfv2si2" [(set (match_operand:V2SI 0 "register_operand") (unsigned_fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_AVX512VL && TARGET_MMX_WITH_SSE" + "TARGET_AVX512VL && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1461,7 +1472,7 @@ (define_insn "mmx_fix_truncv2sfv2si2" (define_expand "floatv2siv2sf2" [(set (match_operand:V2SF 0 "register_operand") (float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SImode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1477,7 +1488,7 @@ (define_expand "floatv2siv2sf2" (define_expand "floatunsv2siv2sf2" [(set (match_operand:V2SF 0 "register_operand") (unsigned_float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))] - "TARGET_AVX512VL && TARGET_MMX_WITH_SSE" + "TARGET_AVX512VL && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SImode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1754,7 +1765,7 @@ (define_expand "vec_initv2sfsf" (define_expand "nearbyintv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1770,7 +1781,7 @@ (define_expand "nearbyintv2sf2" (define_expand "rintv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1786,8 +1797,8 @@ (define_expand "rintv2sf2" (define_expand "lrintv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1804,7 +1815,7 @@ (define_expand "ceilv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1820,8 +1831,8 @@ (define_expand "ceilv2sf2" (define_expand "lceilv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1838,7 +1849,7 @@ (define_expand "floorv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1854,8 +1865,8 @@ (define_expand "floorv2sf2" (define_expand "lfloorv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1872,7 +1883,7 @@ (define_expand "btruncv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1889,7 +1900,7 @@ (define_expand "roundv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1905,8 +1916,8 @@ (define_expand "roundv2sf2" (define_expand "lroundv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 674f956f4b8..f5081c0cfb9 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -1419,6 +1419,7 @@ See RS/6000 and PowerPC Options. -mcld -mcx16 -msahf -mmovbe -mcrc32 -mmwait -mrecip -mrecip=@var{opt} -mvzeroupper -mprefer-avx128 -mprefer-vector-width=@var{opt} +-mpartial-vector-math -mmove-max=@var{bits} -mstore-max=@var{bits} -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx -mavx2 -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl @@ -33754,6 +33755,23 @@ This option instructs GCC to use 128-bit AVX instructions instead of This option instructs GCC to use @var{opt}-bit vector width in instructions instead of default on the selected platform. +@opindex mpartial-vector-math +@item -mpartial-vector-math +This option enables GCC to generate floating-point operations that might +affect the set of floating point status flags on partial vectors, where +vector elements reside in the low part of the 128-bit SSE register. Unless +@option{-fno-trapping-math} is specified, the compiler guarantees correct +behavior by sanitizing all input operands to have zeroes in the unused +upper part of the vector register. Note that by using built-in functions +or inline assembly with partial vector arguments, NaNs, denormal or invalid +values can leak into the upper part of the vector, causing possible +performance issues when @option{-fno-trapping-math} is in effect. These +issues can be mitigated by manually sanitizing the upper part of the partial +vector argument register or by using @option{-mdaz-ftz} to set +denormals-are-zero (DAZ) flag in the MXCSR register. + +This option is enabled by default. + @opindex mmove-max @item -mmove-max=@var{bits} This option instructs GCC to set the maximum number of bits can be diff --git a/gcc/testsuite/gcc.target/i386/pr110832-1.c b/gcc/testsuite/gcc.target/i386/pr110832-1.c new file mode 100644 index 00000000000..3df22e3b5a7 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110832-1.c @@ -0,0 +1,12 @@ +/* PR target/110832 */ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -msse2 -mno-partial-vector-math" } */ + +typedef float __attribute__((vector_size(8))) v2sf; + +v2sf test (v2sf a, v2sf b) +{ + return a + b; +} + +/* { dg-final { scan-assembler-not "addps" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr110832-2.c b/gcc/testsuite/gcc.target/i386/pr110832-2.c new file mode 100644 index 00000000000..4d16488b4fb --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110832-2.c @@ -0,0 +1,13 @@ +/* PR target/110832 */ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -ftrapping-math -msse2 -mpartial-vector-math -dp" } */ + +typedef float __attribute__((vector_size(8))) v2sf; + +v2sf test (v2sf a, v2sf b) +{ + return a + b; +} + +/* { dg-final { scan-assembler "addps" } } */ +/* { dg-final { scan-assembler-times "\\*vec_concatv4sf_0" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr110832-3.c b/gcc/testsuite/gcc.target/i386/pr110832-3.c new file mode 100644 index 00000000000..02cb4fc8100 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110832-3.c @@ -0,0 +1,13 @@ +/* PR target/110832 */ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -fno-trapping-math -msse2 -mpartial-vector-math -dp" } */ + +typedef float __attribute__((vector_size(8))) v2sf; + +v2sf test (v2sf a, v2sf b) +{ + return a + b; +} + +/* { dg-final { scan-assembler "addps" } } */ +/* { dg-final { scan-assembler-not "\\*vec_concatv4sf_0" } } */