On Thu, Mar 17, 2022 at 8:41 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> Implementations of the x87 floating point instruction set have always
> had some pretty strange characteristics.  For example on the original
> Intel Pentium the FLDPI instruction (to load 3.14159... into a register)
> took 5 cycles, and the FLDZ instruction (to load 0.0) took 2 cycles,
> when a regular FLD (load from memory) took just 1 cycle!?  Given that
> back then memory latencies were much lower (relatively) than they are
> today, these instructions were all but useless except when optimizing
> for size (impressively FLDZ/FLDPI require only two bytes).
>
> Such was the world back in 2006 when Uros Bizjak first added support for
> fldz https://gcc.gnu.org/pipermail/gcc-patches/2006-November/202589.html
> and then shortly after sensibly disabled them for !optimize_size with
> https://gcc.gnu.org/pipermail/gcc-patches/2006-November/204405.html
> [which was very expertly reviewed and approved here:
> https://gcc.gnu.org/pipermail/gcc-patches/2006-November/204487.html ]
>
> "And some things that should not have been forgotten were lost.
> History became legend.  Legend became myth." -- Lord of the Rings
>
> Alas this vestigial logic still persists in the compiler today,
> so for example on x86_64 for the following function:
>
> double foo(double x) { return x + 0.0; }
>
> generates with -O2
>
> foo:    addsd   .LC0(%rip), %xmm0
>         ret
> .LC0:   .long   0
>         .long   0
>
> preferring to read the constant 0.0 from memory [the constant pool],
> except when optimizing for size.  With -Os we get:
>
> foo:    xorps   %xmm1, %xmm1
>         addsd   %xmm1, %xmm0
>         ret
>
> Which is not only smaller (the two instructions require seven bytes vs.
> eight for the original addsd from mem, even without considering the
> constant pool) but is also faster on modern hardware.  The latter code
> sequence is generated by both clang and msvc with -O2.  Indeed Agner
> Fogg documents the set of floating point/SSE constants that it's
> cheaper to materialize than to load from memory.
>
> This patch shuffles the conditions on the i386 backend's *movtf_internal,
> *movdf_internal and *movsf_internal define_insns to untangle the newer
> TARGET_SSE_MATH clauses from the historical standard_80387_constant_p
> conditions.  Amongst the benefits of this are that it improves the code
> generated for PR tree-optimization/90356 and resolves PR target/86722.
> Many thanks to Hongtao whose approval of my PR 94680 "movq" patch
> unblocked this one.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -check with no new failures.  Ok for mainline?
>
>
> 2022-03-17  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/86722
>         PR tree-optimization/90356
>         * config/i386/i386.md (*movtf_internal): Don't guard
>         standard_sse_constant_p clause by optimize_function_for_size_p.
>         (*movdf_internal): Likewise.
>         (*movsf_internal): Likewise.
>
> gcc/testsuite/ChangeLog
>         PR target/86722
>         PR tree-optimization/90356
>         * gcc.target/i386/pr86722.c: New test case.
>         * gcc.target/i386/pr90356.c: New test case.

OK, and based on your analysis, even obvious.

Thanks,
Uros.

Reply via email to