Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Richard Biener via Gcc Fri, 06 Aug 2021 06:00:15 -0700

On Fri, Aug 6, 2021 at 2:47 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote:
>
> Gabriel Paubert <paub...@iram.es> wrote:
>
> > Hi,
> >
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert <paub...@iram.es> wrote:
> >>
> >>
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>
> >> >>                               .intel_syntax
> >> >>                               .text
> >> >>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = 
> >> >> trunc(argument)
> >> >>    5:   48 f7 d8              neg     rax
> >> >>                         #     jz      .L0          # argument zero?
> >> >>    8:   70 16                 jo      .L0          # argument 
> >> >> indefinite?
> >> >>                                                    # argument overflows 
> >> >> 64-bit integer?
> >> >>    a:   48 f7 d8              neg     rax
> >> >>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = 
> >> >> trunc(argument)
> >> >>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & 
> >> >> -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = 
> >> >> trunc(argument)
> >> >>   20:   c3              .L0:  ret
> >> >>                               .end
> >> >
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >>
> >> Right, I overlooked this fault. Thanks for pointing out.
> >>
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >>
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> >
> > Indeed, but what I had in mind was something along the following lines:
> >
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52     # move sign and exponent to 12 LSBs
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return       # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> >
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
>
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.
>
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
>
> Correct.
>
> > I don't know if it's faster than your new code,
>
> It should be faster.
>
> > I'm almost sure that it's shorter.
>
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
>
> JFTR: better use "add rax,rax; shr rax,53" instead of
>       "shr rax,52; and eax,0x7ff" and save 2 bytes.
>
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
>
> .code64
> .intel_syntax
> .equ    BIAS, 1023
> .text
>         movq    rax, xmm0    # rax = argument
>         add     rax, rax
>         shr     rax, 53      # rax = exponent of |argument|
>         cmp     eax, BIAS + 53
>         jae     .Lexit       # argument indefinite?
>                              # |argument| >= 0x1.0p53?
>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>         cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>         psrlq   xmm0, 63
>         psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>         orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> .L0:    ret
> .end
>
> @Richard Biener (et. al.):
>
> 1. Is a primitive for "floating-point > 2**x", which generates such
>    an "integer" code sequence, already available, at least for
>    float/binary32 and double/binary64?


Not that I know, but it should be possible to craft that.

> 2. the procedural code generator for __builtin_trunc() etc.  uses
>    __builtin_fabs() and __builtin_copysign() as building blocks.
>    These would need to (and of course should) be modified to generate
>    psllq/psrlq pairs instead of andpd/andnpd referencing a memory
>    location with either -0.0 oder ~(-0.0).
>
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
>
> .code64
> .intel_syntax
> .equ    BIAS, 1023
> .text
>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>         jo      .Lexit       # argument indefinite?
>                              # |argument| > 0x1.0p63?
>         cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
> .L0:    ret
> .end
>
> [...]
>
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> >> 57 + 32 = 89 bytes):
> >>
> >>                                         .code64
> >>                                         .intel_syntax noprefix
> >>                                         .text
> >>    0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
> >>    a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
> >> 0x1.0p52 = 4503599627370496.0
> >>    f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
> >>   19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = 
> >> argument
> >>   1d:   66 0f 73 f0 01                  psllq   xmm0, 1
> >>   22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = 
> >> |argument|
> >>   27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
> >>   2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = 
> >> (argument & -0.0) ? -0.0 : +0.0
> >>   31:   f2 0f 10 d8                     movsd   xmm3, xmm0
> >>   35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = 
> >> |argument| + 0x1.0p52
> >>   39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
> >> |argument| - 0x1.0p52
> >>                                                                  #      = 
> >> rint(|argument|)
> >>   3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
> >> -0x1.0p0 = -1.0
> >
> > Huh? I see +1.0, -1 would be 0xBFF0000000000000.
>
> Spurious error in the comment.
> I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, 
> xmm2"
> instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
>
> >>   42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = 
> >> (|argument| < rint(|argument|)) ? ~0L : 0L
> >>   47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = 
> >> (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >>   4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
> >> rint(|argument|)
> >>                                                                  #      - 
> >> (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >>                                                                  #      = 
> >> trunc(|argument|)
> >>   4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = 
> >> trunc(argument)
> >>   53:   c3                              ret
>
> regards
> Stefan

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to