On Thu, Mar 17, 2022 at 8:41 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > Implementations of the x87 floating point instruction set have always > had some pretty strange characteristics. For example on the original > Intel Pentium the FLDPI instruction (to load 3.14159... into a register) > took 5 cycles, and the FLDZ instruction (to load 0.0) took 2 cycles, > when a regular FLD (load from memory) took just 1 cycle!? Given that > back then memory latencies were much lower (relatively) than they are > today, these instructions were all but useless except when optimizing > for size (impressively FLDZ/FLDPI require only two bytes). > > Such was the world back in 2006 when Uros Bizjak first added support for > fldz https://gcc.gnu.org/pipermail/gcc-patches/2006-November/202589.html > and then shortly after sensibly disabled them for !optimize_size with > https://gcc.gnu.org/pipermail/gcc-patches/2006-November/204405.html > [which was very expertly reviewed and approved here: > https://gcc.gnu.org/pipermail/gcc-patches/2006-November/204487.html ] > > "And some things that should not have been forgotten were lost. > History became legend. Legend became myth." -- Lord of the Rings > > Alas this vestigial logic still persists in the compiler today, > so for example on x86_64 for the following function: > > double foo(double x) { return x + 0.0; } > > generates with -O2 > > foo: addsd .LC0(%rip), %xmm0 > ret > .LC0: .long 0 > .long 0 > > preferring to read the constant 0.0 from memory [the constant pool], > except when optimizing for size. With -Os we get: > > foo: xorps %xmm1, %xmm1 > addsd %xmm1, %xmm0 > ret > > Which is not only smaller (the two instructions require seven bytes vs. > eight for the original addsd from mem, even without considering the > constant pool) but is also faster on modern hardware. The latter code > sequence is generated by both clang and msvc with -O2. Indeed Agner > Fogg documents the set of floating point/SSE constants that it's > cheaper to materialize than to load from memory. > > This patch shuffles the conditions on the i386 backend's *movtf_internal, > *movdf_internal and *movsf_internal define_insns to untangle the newer > TARGET_SSE_MATH clauses from the historical standard_80387_constant_p > conditions. Amongst the benefits of this are that it improves the code > generated for PR tree-optimization/90356 and resolves PR target/86722. > Many thanks to Hongtao whose approval of my PR 94680 "movq" patch > unblocked this one. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -check with no new failures. Ok for mainline? > > > 2022-03-17 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > PR target/86722 > PR tree-optimization/90356 > * config/i386/i386.md (*movtf_internal): Don't guard > standard_sse_constant_p clause by optimize_function_for_size_p. > (*movdf_internal): Likewise. > (*movsf_internal): Likewise. > > gcc/testsuite/ChangeLog > PR target/86722 > PR tree-optimization/90356 > * gcc.target/i386/pr86722.c: New test case. > * gcc.target/i386/pr90356.c: New test case.
OK, and based on your analysis, even obvious. Thanks, Uros.