Michael Matz <m...@suse.de> wrote:
> Hello, > > On Fri, 6 Aug 2021, Stefan Kanthak wrote: > >> For -ffast-math, where the sign of -0.0 is not handled and the spurios >> invalid floating-point exception for |argument| >= 2**63 is acceptable, > > This claim would need to be proven in the wild. I should have left the "when" after the "and" which I originally had written... > |argument| > 2**52 are already integer, and shouldn't generate a spurious > exception from the various to-int conversions, not even in fast-math mode > for some relevant set of applications (at least SPECcpu). > > Btw, have you made speed measurements with your improvements? No. > The size improvements are obvious, but speed changes can be fairly > unintuitive, e.g. there were old K8 CPUs where the memory loads for > constants are actually faster than the equivalent sequence of shifting > and masking for the >= compares. That's an irrelevant CPU now, but it > shows that intuition about speed consequences can be wrong. I know. I also know of CPUs that can't load a 16-byte wide XMM register in one go, but had to split the load into 2 8-byte loads. If the constant happens to be present in L1 cache, it MAY load as fast as an immediate. BUT: on current CPUs, the code GCC generates movsd .LC1(%rip), %xmm2 movsd .LC0(%rip), %xmm4 movapd %xmm0, %xmm3 movapd %xmm0, %xmm1 andpd %xmm2, %xmm3 ucomisd %xmm3, %xmm4 jbe 38 <_trunc+0x38> needs - 4 cycles if the movsd are executed in parallel and the movapd are handled by the register renamer, - 5 cycles if the movsd and the movapd are executed in parallel, - 7 cycles else, plus an unknown number of cycles if the constants are not in L1. The proposed movq rax, xmm0 add rax, rax shr rax, 53 cmp eax, 53+1023 jae return needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the register renamer). Stefan