On Fri, Aug 6, 2021 at 2:47 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote: > > Gabriel Paubert <paub...@iram.es> wrote: > > > Hi, > > > > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote: > >> Gabriel Paubert <paub...@iram.es> wrote: > >> > >> > >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote: > > >> >> .intel_syntax > >> >> .text > >> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = > >> >> trunc(argument) > >> >> 5: 48 f7 d8 neg rax > >> >> # jz .L0 # argument zero? > >> >> 8: 70 16 jo .L0 # argument > >> >> indefinite? > >> >> # argument overflows > >> >> 64-bit integer? > >> >> a: 48 f7 d8 neg rax > >> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = > >> >> trunc(argument) > >> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63 > >> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & > >> >> -0.0) ? -0.0 : 0.0 > >> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = > >> >> trunc(argument) > >> >> 20: c3 .L0: ret > >> >> .end > >> > > >> > There is one important difference, namely setting the invalid exception > >> > flag when the parameter can't be represented in a signed integer. > >> > >> Right, I overlooked this fault. Thanks for pointing out. > >> > >> > So using your code may require some option (-fast-math comes to mind), > >> > or you need at least a check on the exponent before cvttsd2si. > >> > >> The whole idea behind these implementations is to get rid of loading > >> floating-point constants to perform comparisions. > > > > Indeed, but what I had in mind was something along the following lines: > > > > movq rax,xmm0 # and copy rax to say rcx, if needed later > > shrq rax,52 # move sign and exponent to 12 LSBs > > andl eax,0x7ff # mask the sign > > cmpl eax,0x434 # value to be checked > > ja return # exponent too large, we're done (what about NaNs?) > > cvttsd2si rax,xmm0 # safe after exponent check > > cvtsi2sd xmm0,rax # conversion done > > > > and a bit more to handle the corner cases (essentially preserve the > > sign to be correct between -1 and -0.0). > > The sign of -0.0 is the only corner case and already handled in my code. > Both SNAN and QNAN (which have an exponent 0x7ff) are handled and > preserved, as in the code GCC generates as well as my code. > > > But the CPU can (speculatively) start the conversions early, so the > > dependency chain is rather short. > > Correct. > > > I don't know if it's faster than your new code, > > It should be faster. > > > I'm almost sure that it's shorter. > > "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but > 5+4+5+5+2=21 bytes. > > JFTR: better use "add rax,rax; shr rax,53" instead of > "shr rax,52; and eax,0x7ff" and save 2 bytes. > > Complete properly optimized code for __builtin_trunc is then as follows > (11 instructions, 44 bytes): > > .code64 > .intel_syntax > .equ BIAS, 1023 > .text > movq rax, xmm0 # rax = argument > add rax, rax > shr rax, 53 # rax = exponent of |argument| > cmp eax, BIAS + 53 > jae .Lexit # argument indefinite? > # |argument| >= 0x1.0p53? > cvttsd2si rax, xmm0 # rax = trunc(argument) > cvtsi2sd xmm1, rax # xmm1 = trunc(argument) > psrlq xmm0, 63 > psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0 > orpd xmm0, xmm1 # xmm0 = trunc(argument) > .L0: ret > .end > > @Richard Biener (et. al.): > > 1. Is a primitive for "floating-point > 2**x", which generates such > an "integer" code sequence, already available, at least for > float/binary32 and double/binary64?
Not that I know, but it should be possible to craft that. > 2. the procedural code generator for __builtin_trunc() etc. uses > __builtin_fabs() and __builtin_copysign() as building blocks. > These would need to (and of course should) be modified to generate > psllq/psrlq pairs instead of andpd/andnpd referencing a memory > location with either -0.0 oder ~(-0.0). > > For -ffast-math, where the sign of -0.0 is not handled and the spurios > invalid floating-point exception for |argument| >= 2**63 is acceptable, > it boils down to: > > .code64 > .intel_syntax > .equ BIAS, 1023 > .text > cvttsd2si rax, xmm0 # rax = trunc(argument) > jo .Lexit # argument indefinite? > # |argument| > 0x1.0p63? > cvtsi2sd xmm0, rax # xmm1 = trunc(argument) > .L0: ret > .end > > [...] > > >> Right, the conversions dominate both the original and the code I posted. > >> It's easy to get rid of them, with still slightly shorter and faster > >> branchless code (17 instructions, 84 bytes, instead of 13 instructions, > >> 57 + 32 = 89 bytes): > >> > >> .code64 > >> .intel_syntax noprefix > >> .text > >> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000 > >> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = > >> 0x1.0p52 = 4503599627370496.0 > >> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000 > >> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = > >> argument > >> 1d: 66 0f 73 f0 01 psllq xmm0, 1 > >> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = > >> |argument| > >> 27: 66 0f 73 d1 3f psrlq xmm1, 63 > >> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = > >> (argument & -0.0) ? -0.0 : +0.0 > >> 31: f2 0f 10 d8 movsd xmm3, xmm0 > >> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = > >> |argument| + 0x1.0p52 > >> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = > >> |argument| - 0x1.0p52 > >> # = > >> rint(|argument|) > >> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = > >> -0x1.0p0 = -1.0 > > > > Huh? I see +1.0, -1 would be 0xBFF0000000000000. > > Spurious error in the comment. > I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, > xmm2" > instead of "subsd xmm0, xmm2" to save a "movsd" instruction. > > >> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = > >> (|argument| < rint(|argument|)) ? ~0L : 0L > >> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = > >> (|argument| < rint(|argument|)) ? 1.0 : 0.0 > >> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = > >> rint(|argument|) > >> # - > >> (|argument| < rint(|argument|)) ? 1.0 : 0.0 > >> # = > >> trunc(|argument|) > >> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = > >> trunc(argument) > >> 53: c3 ret > > regards > Stefan