Hello, On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> >> For -ffast-math, where the sign of -0.0 is not handled and the > >> spurios invalid floating-point exception for |argument| >= 2**63 is > >> acceptable, > > > > This claim would need to be proven in the wild. > > I should have left the "when" after the "and" which I originally had > written... > > > |argument| > 2**52 are already integer, and shouldn't generate a > > spurious exception from the various to-int conversions, not even in > > fast-math mode for some relevant set of applications (at least > > SPECcpu). > > > > Btw, have you made speed measurements with your improvements? > > No. > > > The size improvements are obvious, but speed changes can be fairly > > unintuitive, e.g. there were old K8 CPUs where the memory loads for > > constants are actually faster than the equivalent sequence of shifting > > and masking for the >= compares. That's an irrelevant CPU now, but it > > shows that intuition about speed consequences can be wrong. > > I know. I also know of CPUs that can't load a 16-byte wide XMM register > in one go, but had to split the load into 2 8-byte loads. > > If the constant happens to be present in L1 cache, it MAY load as fast > as an immediate. > BUT: on current CPUs, the code GCC generates > > movsd .LC1(%rip), %xmm2 > movsd .LC0(%rip), %xmm4 > movapd %xmm0, %xmm3 > movapd %xmm0, %xmm1 > andpd %xmm2, %xmm3 > ucomisd %xmm3, %xmm4 > jbe 38 <_trunc+0x38> > > needs > - 4 cycles if the movsd are executed in parallel and the movapd are > handled by the register renamer, > - 5 cycles if the movsd and the movapd are executed in parallel, > - 7 cycles else, > plus an unknown number of cycles if the constants are not in L1. You also need to consider the case that the to-int converters are called in a loop (which ultimately are the only interesting cases for performance), where it's possible to load the constants before the loop and keep them in registers (at the expense of two register pressure of course) effectively removing the loads from cost considerations. It's all tough choices, which is why stuff needs to be measured in some contexts :-) (I do like your sequences btw, it's just not 100% clearcut that they are always a speed improvement). Ciao, Michael. > The proposed > > movq rax, xmm0 > add rax, rax > shr rax, 53 > cmp eax, 53+1023 > jae return > > needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the > register renamer). > > Stefan >