Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Stefan Kanthak Fri, 06 Aug 2021 07:43:48 -0700

Gabriel Paubert <paub...@iram.es> wrote:


> On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert <paub...@iram.es> wrote:
>> 
>> > Hi,
>> > 
>> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:

[...]

>> >> The whole idea behind these implementations is to get rid of loading
>> >> floating-point constants to perform comparisions.
>> > 
>> > Indeed, but what I had in mind was something along the following lines:
>> > 
>> > movq rax,xmm0   # and copy rax to say rcx, if needed later
>> > shrq rax,52     # move sign and exponent to 12 LSBs 
>> > andl eax,0x7ff  # mask the sign
>> > cmpl eax,0x434  # value to be checked
>> > ja return       # exponent too large, we're done (what about NaNs?)
>> > cvttsd2si rax,xmm0 # safe after exponent check
>> > cvtsi2sd xmm0,rax  # conversion done
>> > 
>> > and a bit more to handle the corner cases (essentially preserve the
>> > sign to be correct between -1 and -0.0).
>> 
>> The sign of -0.0 is the only corner case and already handled in my code.
>> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
>> preserved, as in the code GCC generates as well as my code.
> 
> I don't know what the standard says about NaNs in this case, I seem to
> remember that arithmetic instructions typically produce QNaN when one of
> the inputs is a NaN, whether signaling or not. 

<https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
and its cousins as well as the C standard say

| If x is NaN, a NaN shall be returned.

That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.

>> > But the CPU can (speculatively) start the conversions early, so the
>> > dependency chain is rather short.
>> 
>> Correct.
>>  
>> > I don't know if it's faster than your new code,
>> 
>> It should be faster.
>> 
>> > I'm almost sure that it's shorter.
>> 
>> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
>> 5+4+5+5+2=21 bytes.
>> 
>> JFTR: better use "add rax,rax; shr rax,53" instead of
>>       "shr rax,52; and eax,0x7ff" and save 2 bytes.
> 
> Indeed, I don't have the exact size of instructions in my head,
> especially since I've not written x86 assembly since the mid 90s.
> 
> In any case, with your last improvement, the code is now down to a
> single 32 bit immediate constant. And I don't see how to eliminate it...
> 
>> 
>> Complete properly optimized code for __builtin_trunc is then as follows
>> (11 instructions, 44 bytes):
>> 
>> .code64
>> .intel_syntax
>> .equ    BIAS, 1023
>> .text
>>         movq    rax, xmm0    # rax = argument
>>         add     rax, rax
>>         shr     rax, 53      # rax = exponent of |argument|
>>         cmp     eax, BIAS + 53
>>         jae     .Lexit       # argument indefinite?
> 
> Maybe s/.Lexit/.L0/

Surely!

>>                              # |argument| >= 0x1.0p53?
>>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>>         cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>>         psrlq   xmm0, 63
>>         psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>>         orpd    xmm0, xmm1   # xmm0 = trunc(argument)
>> .L0:    ret
>> .end
>> 
> 
> This looks nice.

Let's see how to convince GCC to generate such code sequences...

Stefan

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to