Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Michael Matz via Gcc Fri, 06 Aug 2021 08:05:07 -0700

Hello,

On Fri, 6 Aug 2021, Stefan Kanthak wrote:


> >> For -ffast-math, where the sign of -0.0 is not handled and the 
> >> spurios invalid floating-point exception for |argument| >= 2**63 is 
> >> acceptable,
> > 
> > This claim would need to be proven in the wild.
> 
> I should have left the "when" after the "and" which I originally had 
> written...
> 
> > |argument| > 2**52 are already integer, and shouldn't generate a 
> > spurious exception from the various to-int conversions, not even in 
> > fast-math mode for some relevant set of applications (at least 
> > SPECcpu).
> > 
> > Btw, have you made speed measurements with your improvements?
> 
> No.
> 
> > The size improvements are obvious, but speed changes can be fairly 
> > unintuitive, e.g. there were old K8 CPUs where the memory loads for 
> > constants are actually faster than the equivalent sequence of shifting 
> > and masking for the >= compares.  That's an irrelevant CPU now, but it 
> > shows that intuition about speed consequences can be wrong.
> 
> I know. I also know of CPUs that can't load a 16-byte wide XMM register 
> in one go, but had to split the load into 2 8-byte loads.
> 
> If the constant happens to be present in L1 cache, it MAY load as fast
> as an immediate.
> BUT: on current CPUs, the code GCC generates
> 
>         movsd  .LC1(%rip), %xmm2
>         movsd  .LC0(%rip), %xmm4
>         movapd %xmm0, %xmm3
>         movapd %xmm0, %xmm1
>         andpd  %xmm2, %xmm3
>         ucomisd %xmm3, %xmm4
>         jbe    38 <_trunc+0x38>
>  
> needs
> - 4 cycles if the movsd are executed in parallel and the movapd are
>   handled by the register renamer,
> - 5 cycles if the movsd and the movapd are executed in parallel,
> - 7 cycles else,
> plus an unknown number of cycles if the constants are not in L1.

You also need to consider the case that the to-int converters are called 
in a loop (which ultimately are the only interesting cases for 
performance), where it's possible to load the constants before the loop 
and keep them in registers (at the expense of two register pressure of 
course) effectively removing the loads from cost considerations.  It's all 
tough choices, which is why stuff needs to be measured in some contexts 
:-)

(I do like your sequences btw, it's just not 100% clearcut that they are 
always a speed improvement).


Ciao,
Michael.

> The proposed
> 
>         movq   rax, xmm0
>         add    rax, rax
>         shr    rax, 53
>         cmp    eax, 53+1023
>         jae    return
> 
> needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
> register renamer).
> 
> Stefan
>

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to