Gabriel Paubert <paub...@iram.es> wrote:
> On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote: >> Gabriel Paubert <paub...@iram.es> wrote: >> >> > Hi, >> > >> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote: [...] >> >> The whole idea behind these implementations is to get rid of loading >> >> floating-point constants to perform comparisions. >> > >> > Indeed, but what I had in mind was something along the following lines: >> > >> > movq rax,xmm0 # and copy rax to say rcx, if needed later >> > shrq rax,52 # move sign and exponent to 12 LSBs >> > andl eax,0x7ff # mask the sign >> > cmpl eax,0x434 # value to be checked >> > ja return # exponent too large, we're done (what about NaNs?) >> > cvttsd2si rax,xmm0 # safe after exponent check >> > cvtsi2sd xmm0,rax # conversion done >> > >> > and a bit more to handle the corner cases (essentially preserve the >> > sign to be correct between -1 and -0.0). >> >> The sign of -0.0 is the only corner case and already handled in my code. >> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and >> preserved, as in the code GCC generates as well as my code. > > I don't know what the standard says about NaNs in this case, I seem to > remember that arithmetic instructions typically produce QNaN when one of > the inputs is a NaN, whether signaling or not. <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html> and its cousins as well as the C standard say | If x is NaN, a NaN shall be returned. That's why I mentioned that the code GCC generates also doesn't quiet SNaNs. >> > But the CPU can (speculatively) start the conversions early, so the >> > dependency chain is rather short. >> >> Correct. >> >> > I don't know if it's faster than your new code, >> >> It should be faster. >> >> > I'm almost sure that it's shorter. >> >> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but >> 5+4+5+5+2=21 bytes. >> >> JFTR: better use "add rax,rax; shr rax,53" instead of >> "shr rax,52; and eax,0x7ff" and save 2 bytes. > > Indeed, I don't have the exact size of instructions in my head, > especially since I've not written x86 assembly since the mid 90s. > > In any case, with your last improvement, the code is now down to a > single 32 bit immediate constant. And I don't see how to eliminate it... > >> >> Complete properly optimized code for __builtin_trunc is then as follows >> (11 instructions, 44 bytes): >> >> .code64 >> .intel_syntax >> .equ BIAS, 1023 >> .text >> movq rax, xmm0 # rax = argument >> add rax, rax >> shr rax, 53 # rax = exponent of |argument| >> cmp eax, BIAS + 53 >> jae .Lexit # argument indefinite? > > Maybe s/.Lexit/.L0/ Surely! >> # |argument| >= 0x1.0p53? >> cvttsd2si rax, xmm0 # rax = trunc(argument) >> cvtsi2sd xmm1, rax # xmm1 = trunc(argument) >> psrlq xmm0, 63 >> psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0 >> orpd xmm0, xmm1 # xmm0 = trunc(argument) >> .L0: ret >> .end >> > > This looks nice. Let's see how to convince GCC to generate such code sequences... Stefan