Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Stefan Kanthak Fri, 06 Aug 2021 07:42:46 -0700

Michael Matz <m...@suse.de> wrote:


> Hello,
> 
> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> 
>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> 
> This claim would need to be proven in the wild.

I should have left the "when" after the "and" which I originally had
written...

> |argument| > 2**52 are already integer, and shouldn't generate a spurious
> exception from the various to-int conversions, not even in fast-math mode
> for some relevant set of applications (at least SPECcpu).
> 
> Btw, have you made speed measurements with your improvements?

No.

> The size improvements are obvious, but speed changes can be fairly
> unintuitive, e.g. there were old K8 CPUs where the memory loads for
> constants are actually faster than the equivalent sequence of shifting
> and masking for the >= compares.  That's an irrelevant CPU now, but it
> shows that intuition about speed consequences can be wrong.

I know. I also know of CPUs that can't load a 16-byte wide XMM register
in one go, but had to split the load into 2 8-byte loads.

If the constant happens to be present in L1 cache, it MAY load as fast
as an immediate.
BUT: on current CPUs, the code GCC generates

        movsd  .LC1(%rip), %xmm2
        movsd  .LC0(%rip), %xmm4
        movapd %xmm0, %xmm3
        movapd %xmm0, %xmm1
        andpd  %xmm2, %xmm3
        ucomisd %xmm3, %xmm4
        jbe    38 <_trunc+0x38>
 
needs
- 4 cycles if the movsd are executed in parallel and the movapd are
  handled by the register renamer,
- 5 cycles if the movsd and the movapd are executed in parallel,
- 7 cycles else,
plus an unknown number of cycles if the constants are not in L1.
The proposed

        movq   rax, xmm0
        add    rax, rax
        shr    rax, 53
        cmp    eax, 53+1023
        jae    return

needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
register renamer).

Stefan

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to