Re: [x86_64 PATCH] PR 90356: Use xor to load const_double 0.0 on SSE (always)

Uros Bizjak via Gcc-patches Thu, 17 Mar 2022 12:59:14 -0700

On Thu, Mar 17, 2022 at 8:50 PM Uros Bizjak <ubiz...@gmail.com> wrote:
>
> On Thu, Mar 17, 2022 at 8:41 PM Roger Sayle <ro...@nextmovesoftware.com> 
> wrote:
> >
> >
> > Implementations of the x87 floating point instruction set have always
> > had some pretty strange characteristics.  For example on the original
> > Intel Pentium the FLDPI instruction (to load 3.14159... into a register)
> > took 5 cycles, and the FLDZ instruction (to load 0.0) took 2 cycles,
> > when a regular FLD (load from memory) took just 1 cycle!?  Given that
> > back then memory latencies were much lower (relatively) than they are
> > today, these instructions were all but useless except when optimizing
> > for size (impressively FLDZ/FLDPI require only two bytes).
> >
> > Such was the world back in 2006 when Uros Bizjak first added support for
> > fldz https://gcc.gnu.org/pipermail/gcc-patches/2006-November/202589.html
> > and then shortly after sensibly disabled them for !optimize_size with
> > https://gcc.gnu.org/pipermail/gcc-patches/2006-November/204405.html
> > [which was very expertly reviewed and approved here:
> > https://gcc.gnu.org/pipermail/gcc-patches/2006-November/204487.html ]
> >
> > "And some things that should not have been forgotten were lost.
> > History became legend.  Legend became myth." -- Lord of the Rings
> >
> > Alas this vestigial logic still persists in the compiler today,
> > so for example on x86_64 for the following function:
> >
> > double foo(double x) { return x + 0.0; }
> >
> > generates with -O2
> >
> > foo:    addsd   .LC0(%rip), %xmm0
> >         ret
> > .LC0:   .long   0
> >         .long   0
> >
> > preferring to read the constant 0.0 from memory [the constant pool],
> > except when optimizing for size.  With -Os we get:
> >
> > foo:    xorps   %xmm1, %xmm1
> >         addsd   %xmm1, %xmm0
> >         ret
> >
> > Which is not only smaller (the two instructions require seven bytes vs.
> > eight for the original addsd from mem, even without considering the
> > constant pool) but is also faster on modern hardware.  The latter code
> > sequence is generated by both clang and msvc with -O2.  Indeed Agner
> > Fogg documents the set of floating point/SSE constants that it's
> > cheaper to materialize than to load from memory.
> >
> > This patch shuffles the conditions on the i386 backend's *movtf_internal,
> > *movdf_internal and *movsf_internal define_insns to untangle the newer
> > TARGET_SSE_MATH clauses from the historical standard_80387_constant_p
> > conditions.  Amongst the benefits of this are that it improves the code
> > generated for PR tree-optimization/90356 and resolves PR target/86722.
> > Many thanks to Hongtao whose approval of my PR 94680 "movq" patch
> > unblocked this one.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -check with no new failures.  Ok for mainline?
> >
> >
> > 2022-03-17  Roger Sayle  <ro...@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR target/86722
> >         PR tree-optimization/90356
> >         * config/i386/i386.md (*movtf_internal): Don't guard
> >         standard_sse_constant_p clause by optimize_function_for_size_p.
> >         (*movdf_internal): Likewise.
> >         (*movsf_internal): Likewise.
> >
> > gcc/testsuite/ChangeLog
> >         PR target/86722
> >         PR tree-optimization/90356
> >         * gcc.target/i386/pr86722.c: New test case.
> >         * gcc.target/i386/pr90356.c: New test case.
>
> OK, and based on your analysis, even obvious.


Maybe a little improvement for tests:

+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -msse" } */

You could add "-msse2 -mfpmath=sse" to dg-options, so it will also
compile for the ia32 target.

Uros.

Re: [x86_64 PATCH] PR 90356: Use xor to load const_double 0.0 on SSE (always)

Reply via email to