Re: Add option for whether ceil etc. can raise "inexact", adjust x86 conditions

Richard Biener Tue, 15 Aug 2017 08:03:19 -0700

On Tue, Aug 15, 2017 at 4:43 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Aug 15, 2017 at 4:21 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Tue, Aug 15, 2017 at 3:52 PM, Martin Jambor <mjam...@suse.cz> wrote:
>>> Hi Joseph,
>>>
>>> On Thu, May 26, 2016 at 09:02:02PM +0000, Joseph Myers wrote:
>>>> On Thu, 26 May 2016, Jan Hubicka wrote:
>>>>
>>>> > > > +ffp-int-builtin-inexact
>>>> > > > +Common Report Var(flag_fp_int_builtin_inexact) Optimization
>>>> > > > +Allow built-in functions ceil, floor, round, trunc to raise 
>>>> > > > \"inexact\" exceptions.
>>>> >
>>>> > When adding new codegen option which affects the correctness, it is also
>>>> > necessary to update can_inline_edge_p and inline_call.
>>>>
>>>> This patch version adds handling for the new option in those places.
>>>> Other changes: the default for the option is corrected so that
>>>> -ffp-int-builtin-inexact really is in effect by default as intended;
>>>> md.texi documentation for the patterns in question is updated to
>>>> describe how they are affected by this option.
>>>>
>>>>
>>>> Add option for whether ceil etc. can raise "inexact", adjust x86 
>>>> conditions.
>>>>
>>>> In ISO C99/C11, the ceil, floor, round and trunc functions may or may
>>>> not raise the "inexact" exception for noninteger arguments.  Under TS
>>>> 18661-1:2014, the C bindings for IEEE 754-2008, these functions are
>>>> prohibited from raising "inexact", in line with the general rule that
>>>> "inexact" is only when the mathematical infinite precision result of a
>>>> function differs from the result after rounding to the target type.
>>>>
>>>> GCC has no option to select TS 18661 requirements for not raising
>>>> "inexact" when expanding built-in versions of these functions inline.
>>>> Furthermore, even given such requirements, the conditions on the x86
>>>> insn patterns for these functions are unnecessarily restrictive.  I'd
>>>> like to make the out-of-line glibc versions follow the TS 18661
>>>> requirements; in the cases where this slows them down (the cases using
>>>> x87 floating point), that makes it more important for inline versions
>>>> to be used when the user does not care about "inexact".
>>>
>>> Unfortunately, I have found out that this commit regresses run-time of
>>> 538.imagick_r by about 5% on an AMD Ryzen machine and by 9% on a
>>> slightly older Intel machine when compiled with just -O2 (so with
>>> generic tuning).
>>>
>>> The problem is that ImageMagick spends a lot time calculating ceil and
>>> floor and even with with generic tuning their library implementations
>>> can use the ifunc mechanism to execute an efficient SSE 4.1
>>> implementation on the processors that have it, whereas the inline
>>> expansion cannot do so and is much bigger and much much slower.  To
>>> give you an idea, this is the profile before and after the change:
>>>
>>>   | Symbol                           |  237073 | % of runtime |  237074 | % 
>>> of runtime | sample delta | % sample delta |
>>>   
>>> |----------------------------------+---------+--------------+---------+--------------+--------------+----------------|
>>>   | MorphologyApply                  | 1058932 |       52.88% | 1043194 |   
>>>     46.65% |       -15738 |          98.51 |
>>>   | MeanShiftImage                   |  508088 |       25.50% |  833378 |   
>>>     37.43% |       325290 |         164.02 |
>>>   | GetVirtualPixelsFromNexus        |  173354 |        8.70% |  168298 |   
>>>      7.56% |        -5056 |          97.08 |
>>>   | SetPixelCacheNexusPixels.isra.10 |  114101 |        5.72% |  112790 |   
>>>      5.07% |        -1311 |          98.85 |
>>>   | __ceil_sse41                     |   21404 |        1.07% |       0 |   
>>>          0 |       -21404 |           0.00 |
>>>   | __floor_sse41                    |   19179 |        0.96% |       0 |   
>>>          0 |       -19179 |           0.00 |
>>>
>>> And all of the sample count increases in MeanShiftImage can be tracked
>>> down to the line in the cource calculating
>>>
>>>   if ((x-floor(x)) < (ceil(x)-x))
>>>
>>> I am not sure what to do about this, to me it seems that the
>>> -ffp-int-builtin-inexact simply has a wrong default value, at least
>>> for x86_64, as it was added in order not to slow code down but does
>>> exactly that (all of the slowdown of course disappears when
>>> -fno-fp-int-builtin-inexact is used).
>>>
>>> Or is the situation somehow more complex?
>>
>> I suppose these days the big inline sequences for the rounding functions
>> are no longer profitable for generic tuning (assuming 'generic' nowadays
>> includes SSE41 support).  Esp. floor/ceil includes jumpy compensation
>> code.
>
> Note other options are to inline if (__builtin_cpu_supports
> ("sse4.1")) ... else ...
> or to emit a call to a (local? comdat?) __gcc_floor ifunc dispatcher and
> emit the ifunc math library ourselves (like we'd do with 
> attribute(target(""))).
>
> Not sure if we really can assume glibc is intelligent enough -- does it have
> non-SSE4.1 implementations for ceil/floor?  Back in time I implemented
> these SSE2 expansions it used the generic C code which was awfully slow...


Still uses sysdeps/ieee754/dbl-64/wordsize-64/s_ceil.c which is quite slow
compared to using our inline sequence.

So I'd try the "easy" way of expanding if (__builtin_cpu_supports ("sse4.1"))
as the sse4.1 sequence is just a single instruction.  The interesting part
of the story will be to make sure we can emit that even if ! TARGET_ROUND ...

Uros, any idea how to accomplish this?  Or is the idea of a "local" ifunc
better?  Note the ABI boundary will be expensive but I guess the conditional
sequence as well (and it will disturb RA even if predicted to have SSE 4.1).

Richard.

> Richard.
>
>> Note that (x - floor(x)) < (ceil(x) - x) looks like some clever 
>> simplification
>> might speed it up.  Not that I can come up with sth off my head...
>>
>> Richard.
>>
>>> Martin
>>>
>>>
>>>>
>>>> This patch fixes these issues.  A new option
>>>> -fno-fp-int-builtin-inexact is added to request TS 18661 rules for
>>>> these functions; the default -ffp-int-builtin-inexact reflects that
>>>> such exceptions are allowed by C99 and C11.  (The intention is that if
>>>> C2x incorporates TS 18661-1, then the default would change in C2x
>>>> mode.)
>>>>
>>>> The x86 built-ins for rint (x87, SSE2 and SSE4.1) are made
>>>> unconditionally available (no longer depending on
>>>> -funsafe-math-optimizations or -fno-trapping-math); "inexact" is
>>>> correct for noninteger arguments to rint.  For floor, ceil and trunc,
>>>> the x87 and SSE2 built-ins are OK if -ffp-int-builtin-inexact or
>>>> -fno-trapping-math (they may raise "inexact" for noninteger
>>>> arguments); the SSE4.1 built-ins are made to use ROUND_NO_EXC so that
>>>> they do not raise "inexact" and so are OK unconditionally.
>>>>
>>>> Now, while there was no semantic reason for depending on
>>>> -funsafe-math-optimizations, the insn patterns had such a dependence
>>>> because of use of gen_truncxf<mode>2_i387_noop to truncate back to
>>>> SFmode or DFmode after using frndint in XFmode.  In this case a no-op
>>>> truncation is safe because rounding to integer always produces an
>>>> exactly representable value (the same reason why IEEE semantics say it
>>>> shouldn't produce "inexact") - but of course that insn pattern isn't
>>>> safe because it would also match cases where the truncation is not in
>>>> fact a no-op.  To allow frndint to be used for SFmode and DFmode
>>>> without that unsafe pattern, the relevant frndint patterns are
>>>> extended to SFmode and DFmode or new SFmode and DFmode patterns added,
>>>> so that the frndint operation can be represented in RTL as an
>>>> operation acting directly on SFmode or DFmode without the extension
>>>> and the problematic truncation.
>>>>
>>>> A generic test of the new option is added, as well as x86-specific
>>>> tests, both execution tests including the generic test with different
>>>> x86 options and scan-assembler tests verifying that functions that
>>>> should be inlined with different options are indeed inlined.
>>>>
>>>> I think other architectures are OK for TS 18661-1 semantics already.
>>>> Considering those defining "ceil" patterns: aarch64, arm, rs6000, s390
>>>> use instructions that do not raise "inexact"; nvptx does not support
>>>> floating-point exceptions.  (This does mean the -f option in fact only
>>>> affects one architecture, but I think it should still be a -f option;
>>>> it's logically architecture-independent and is expected to be affected
>>>> by future -std options, so is similar to e.g. -fexcess-precision=,
>>>> which also does nothing on most architectures but is implied by -std
>>>> options.)
>>>>
>>>> Bootstrapped with no regressions on x86_64-pc-linux-gnu.  OK to
>>>> commit?
>>>>
>>>> gcc:
>>>> 2016-05-26  Joseph Myers  <jos...@codesourcery.com>
>>>>
>>>>       PR target/71276
>>>>       PR target/71277
>>>>       * common.opt (ffp-int-builtin-inexact): New option.
>>>>       * doc/invoke.texi (-fno-fp-int-builtin-inexact): Document.
>>>>       * doc/md.texi (floor@var{m}2, btrunc@var{m}2, round@var{m}2)
>>>>       (ceil@var{m}2): Document dependence on this option.
>>>>       * ipa-inline-transform.c (inline_call): Handle
>>>>       flag_fp_int_builtin_inexact.
>>>>       * ipa-inline.c (can_inline_edge_p): Likewise.
>>>>       * config/i386/i386.md (rintxf2): Do not test
>>>>       flag_unsafe_math_optimizations.
>>>>       (rint<mode>2_frndint): New define_insn.
>>>>       (rint<mode>2): Do not test flag_unsafe_math_optimizations for 387
>>>>       or !flag_trapping_math for SSE.  Just use gen_rint<mode>2_frndint
>>>>       for 387 instead of extending and truncating.
>>>>       (frndintxf2_<rounding>): Test flag_fp_int_builtin_inexact ||
>>>>       !flag_trapping_math instead of flag_unsafe_math_optimizations.
>>>>       Change to frndint<mode>2_<rounding>.
>>>>       (frndintxf2_<rounding>_i387): Likewise.  Change to
>>>>       frndint<mode>2_<rounding>_i387.
>>>>       (<rounding_insn>xf2): Likewise.
>>>>       (<rounding_insn><mode>2): Test flag_fp_int_builtin_inexact ||
>>>>       !flag_trapping_math instead of flag_unsafe_math_optimizations for
>>>>       x87.  Test TARGET_ROUND || !flag_trapping_math ||
>>>>       flag_fp_int_builtin_inexact instead of !flag_trapping_math for
>>>>       SSE.  Use ROUND_NO_EXC in constant operand of
>>>>       gen_sse4_1_round<mode>2.  Just use gen_frndint<mode>2_<rounding>
>>>>       for 387 instead of extending and truncating.
>>>>
>>>> gcc/testsuite:
>>>> 2016-05-26  Joseph Myers  <jos...@codesourcery.com>
>>>>
>>>>       PR target/71276
>>>>       PR target/71277
>>>>       * gcc.dg/torture/builtin-fp-int-inexact.c,
>>>>       gcc.target/i386/387-builtin-fp-int-inexact.c,
>>>>       gcc.target/i386/387-rint-inline-1.c,
>>>>       gcc.target/i386/387-rint-inline-2.c,
>>>>       gcc.target/i386/sse2-builtin-fp-int-inexact.c,
>>>>       gcc.target/i386/sse2-rint-inline-1.c,
>>>>       gcc.target/i386/sse2-rint-inline-2.c,
>>>>       gcc.target/i386/sse4_1-builtin-fp-int-inexact.c,
>>>>       gcc.target/i386/sse4_1-rint-inline.c: New tests.
>>>>

Re: Add option for whether ceil etc. can raise "inexact", adjust x86 conditions

Reply via email to