On Tue, Aug 15, 2017 at 4:43 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Aug 15, 2017 at 4:21 PM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Tue, Aug 15, 2017 at 3:52 PM, Martin Jambor <mjam...@suse.cz> wrote: >>> Hi Joseph, >>> >>> On Thu, May 26, 2016 at 09:02:02PM +0000, Joseph Myers wrote: >>>> On Thu, 26 May 2016, Jan Hubicka wrote: >>>> >>>> > > > +ffp-int-builtin-inexact >>>> > > > +Common Report Var(flag_fp_int_builtin_inexact) Optimization >>>> > > > +Allow built-in functions ceil, floor, round, trunc to raise >>>> > > > \"inexact\" exceptions. >>>> > >>>> > When adding new codegen option which affects the correctness, it is also >>>> > necessary to update can_inline_edge_p and inline_call. >>>> >>>> This patch version adds handling for the new option in those places. >>>> Other changes: the default for the option is corrected so that >>>> -ffp-int-builtin-inexact really is in effect by default as intended; >>>> md.texi documentation for the patterns in question is updated to >>>> describe how they are affected by this option. >>>> >>>> >>>> Add option for whether ceil etc. can raise "inexact", adjust x86 >>>> conditions. >>>> >>>> In ISO C99/C11, the ceil, floor, round and trunc functions may or may >>>> not raise the "inexact" exception for noninteger arguments. Under TS >>>> 18661-1:2014, the C bindings for IEEE 754-2008, these functions are >>>> prohibited from raising "inexact", in line with the general rule that >>>> "inexact" is only when the mathematical infinite precision result of a >>>> function differs from the result after rounding to the target type. >>>> >>>> GCC has no option to select TS 18661 requirements for not raising >>>> "inexact" when expanding built-in versions of these functions inline. >>>> Furthermore, even given such requirements, the conditions on the x86 >>>> insn patterns for these functions are unnecessarily restrictive. I'd >>>> like to make the out-of-line glibc versions follow the TS 18661 >>>> requirements; in the cases where this slows them down (the cases using >>>> x87 floating point), that makes it more important for inline versions >>>> to be used when the user does not care about "inexact". >>> >>> Unfortunately, I have found out that this commit regresses run-time of >>> 538.imagick_r by about 5% on an AMD Ryzen machine and by 9% on a >>> slightly older Intel machine when compiled with just -O2 (so with >>> generic tuning). >>> >>> The problem is that ImageMagick spends a lot time calculating ceil and >>> floor and even with with generic tuning their library implementations >>> can use the ifunc mechanism to execute an efficient SSE 4.1 >>> implementation on the processors that have it, whereas the inline >>> expansion cannot do so and is much bigger and much much slower. To >>> give you an idea, this is the profile before and after the change: >>> >>> | Symbol | 237073 | % of runtime | 237074 | % >>> of runtime | sample delta | % sample delta | >>> >>> |----------------------------------+---------+--------------+---------+--------------+--------------+----------------| >>> | MorphologyApply | 1058932 | 52.88% | 1043194 | >>> 46.65% | -15738 | 98.51 | >>> | MeanShiftImage | 508088 | 25.50% | 833378 | >>> 37.43% | 325290 | 164.02 | >>> | GetVirtualPixelsFromNexus | 173354 | 8.70% | 168298 | >>> 7.56% | -5056 | 97.08 | >>> | SetPixelCacheNexusPixels.isra.10 | 114101 | 5.72% | 112790 | >>> 5.07% | -1311 | 98.85 | >>> | __ceil_sse41 | 21404 | 1.07% | 0 | >>> 0 | -21404 | 0.00 | >>> | __floor_sse41 | 19179 | 0.96% | 0 | >>> 0 | -19179 | 0.00 | >>> >>> And all of the sample count increases in MeanShiftImage can be tracked >>> down to the line in the cource calculating >>> >>> if ((x-floor(x)) < (ceil(x)-x)) >>> >>> I am not sure what to do about this, to me it seems that the >>> -ffp-int-builtin-inexact simply has a wrong default value, at least >>> for x86_64, as it was added in order not to slow code down but does >>> exactly that (all of the slowdown of course disappears when >>> -fno-fp-int-builtin-inexact is used). >>> >>> Or is the situation somehow more complex? >> >> I suppose these days the big inline sequences for the rounding functions >> are no longer profitable for generic tuning (assuming 'generic' nowadays >> includes SSE41 support). Esp. floor/ceil includes jumpy compensation >> code. > > Note other options are to inline if (__builtin_cpu_supports > ("sse4.1")) ... else ... > or to emit a call to a (local? comdat?) __gcc_floor ifunc dispatcher and > emit the ifunc math library ourselves (like we'd do with > attribute(target(""))). > > Not sure if we really can assume glibc is intelligent enough -- does it have > non-SSE4.1 implementations for ceil/floor? Back in time I implemented > these SSE2 expansions it used the generic C code which was awfully slow...
Still uses sysdeps/ieee754/dbl-64/wordsize-64/s_ceil.c which is quite slow compared to using our inline sequence. So I'd try the "easy" way of expanding if (__builtin_cpu_supports ("sse4.1")) as the sse4.1 sequence is just a single instruction. The interesting part of the story will be to make sure we can emit that even if ! TARGET_ROUND ... Uros, any idea how to accomplish this? Or is the idea of a "local" ifunc better? Note the ABI boundary will be expensive but I guess the conditional sequence as well (and it will disturb RA even if predicted to have SSE 4.1). Richard. > Richard. > >> Note that (x - floor(x)) < (ceil(x) - x) looks like some clever >> simplification >> might speed it up. Not that I can come up with sth off my head... >> >> Richard. >> >>> Martin >>> >>> >>>> >>>> This patch fixes these issues. A new option >>>> -fno-fp-int-builtin-inexact is added to request TS 18661 rules for >>>> these functions; the default -ffp-int-builtin-inexact reflects that >>>> such exceptions are allowed by C99 and C11. (The intention is that if >>>> C2x incorporates TS 18661-1, then the default would change in C2x >>>> mode.) >>>> >>>> The x86 built-ins for rint (x87, SSE2 and SSE4.1) are made >>>> unconditionally available (no longer depending on >>>> -funsafe-math-optimizations or -fno-trapping-math); "inexact" is >>>> correct for noninteger arguments to rint. For floor, ceil and trunc, >>>> the x87 and SSE2 built-ins are OK if -ffp-int-builtin-inexact or >>>> -fno-trapping-math (they may raise "inexact" for noninteger >>>> arguments); the SSE4.1 built-ins are made to use ROUND_NO_EXC so that >>>> they do not raise "inexact" and so are OK unconditionally. >>>> >>>> Now, while there was no semantic reason for depending on >>>> -funsafe-math-optimizations, the insn patterns had such a dependence >>>> because of use of gen_truncxf<mode>2_i387_noop to truncate back to >>>> SFmode or DFmode after using frndint in XFmode. In this case a no-op >>>> truncation is safe because rounding to integer always produces an >>>> exactly representable value (the same reason why IEEE semantics say it >>>> shouldn't produce "inexact") - but of course that insn pattern isn't >>>> safe because it would also match cases where the truncation is not in >>>> fact a no-op. To allow frndint to be used for SFmode and DFmode >>>> without that unsafe pattern, the relevant frndint patterns are >>>> extended to SFmode and DFmode or new SFmode and DFmode patterns added, >>>> so that the frndint operation can be represented in RTL as an >>>> operation acting directly on SFmode or DFmode without the extension >>>> and the problematic truncation. >>>> >>>> A generic test of the new option is added, as well as x86-specific >>>> tests, both execution tests including the generic test with different >>>> x86 options and scan-assembler tests verifying that functions that >>>> should be inlined with different options are indeed inlined. >>>> >>>> I think other architectures are OK for TS 18661-1 semantics already. >>>> Considering those defining "ceil" patterns: aarch64, arm, rs6000, s390 >>>> use instructions that do not raise "inexact"; nvptx does not support >>>> floating-point exceptions. (This does mean the -f option in fact only >>>> affects one architecture, but I think it should still be a -f option; >>>> it's logically architecture-independent and is expected to be affected >>>> by future -std options, so is similar to e.g. -fexcess-precision=, >>>> which also does nothing on most architectures but is implied by -std >>>> options.) >>>> >>>> Bootstrapped with no regressions on x86_64-pc-linux-gnu. OK to >>>> commit? >>>> >>>> gcc: >>>> 2016-05-26 Joseph Myers <jos...@codesourcery.com> >>>> >>>> PR target/71276 >>>> PR target/71277 >>>> * common.opt (ffp-int-builtin-inexact): New option. >>>> * doc/invoke.texi (-fno-fp-int-builtin-inexact): Document. >>>> * doc/md.texi (floor@var{m}2, btrunc@var{m}2, round@var{m}2) >>>> (ceil@var{m}2): Document dependence on this option. >>>> * ipa-inline-transform.c (inline_call): Handle >>>> flag_fp_int_builtin_inexact. >>>> * ipa-inline.c (can_inline_edge_p): Likewise. >>>> * config/i386/i386.md (rintxf2): Do not test >>>> flag_unsafe_math_optimizations. >>>> (rint<mode>2_frndint): New define_insn. >>>> (rint<mode>2): Do not test flag_unsafe_math_optimizations for 387 >>>> or !flag_trapping_math for SSE. Just use gen_rint<mode>2_frndint >>>> for 387 instead of extending and truncating. >>>> (frndintxf2_<rounding>): Test flag_fp_int_builtin_inexact || >>>> !flag_trapping_math instead of flag_unsafe_math_optimizations. >>>> Change to frndint<mode>2_<rounding>. >>>> (frndintxf2_<rounding>_i387): Likewise. Change to >>>> frndint<mode>2_<rounding>_i387. >>>> (<rounding_insn>xf2): Likewise. >>>> (<rounding_insn><mode>2): Test flag_fp_int_builtin_inexact || >>>> !flag_trapping_math instead of flag_unsafe_math_optimizations for >>>> x87. Test TARGET_ROUND || !flag_trapping_math || >>>> flag_fp_int_builtin_inexact instead of !flag_trapping_math for >>>> SSE. Use ROUND_NO_EXC in constant operand of >>>> gen_sse4_1_round<mode>2. Just use gen_frndint<mode>2_<rounding> >>>> for 387 instead of extending and truncating. >>>> >>>> gcc/testsuite: >>>> 2016-05-26 Joseph Myers <jos...@codesourcery.com> >>>> >>>> PR target/71276 >>>> PR target/71277 >>>> * gcc.dg/torture/builtin-fp-int-inexact.c, >>>> gcc.target/i386/387-builtin-fp-int-inexact.c, >>>> gcc.target/i386/387-rint-inline-1.c, >>>> gcc.target/i386/387-rint-inline-2.c, >>>> gcc.target/i386/sse2-builtin-fp-int-inexact.c, >>>> gcc.target/i386/sse2-rint-inline-1.c, >>>> gcc.target/i386/sse2-rint-inline-2.c, >>>> gcc.target/i386/sse4_1-builtin-fp-int-inexact.c, >>>> gcc.target/i386/sse4_1-rint-inline.c: New tests. >>>>