https://llvm.org/bugs/show_bug.cgi?id=31602
Bug ID: 31602 Summary: [X86] float/double -> unsigned long conversion slow when inputs are predictable Product: libraries Version: trunk Hardware: PC OS: Linux Status: NEW Severity: normal Priority: P Component: Backend: X86 Assignee: unassignedb...@nondot.org Reporter: mku...@google.com CC: llvm-bugs@lists.llvm.org Classification: Unclassified SSE and AVX (up until AVX512) don't have convert instructions from FP (both float or double) and unsigned long. So, these conversion have to be emulated using FP -> signed long conversions. GCC lowers this: unsigned long foo(double x) { return x; } as: foo(double): movsd .LC0(%rip), %xmm1 ucomisd %xmm1, %xmm0 jnb .L2 cvttsd2siq %xmm0, %rax ret .L2: subsd %xmm1, %xmm0 movabsq $-9223372036854775808, %rdx cvttsd2siq %xmm0, %rax xorq %rdx, %rax ret .LC0: .long 0 .long 1138753536 That is - check whether the value is in range, and if not, force it into range, convert, and correct the value. What we do, on the other hand, is: .LCPI0_0: .quad 4890909195324358656 # double 9.2233720368547758E+18 foo(double): movsd .LCPI0_0(%rip), %xmm1 movapd %xmm0, %xmm2 subsd %xmm1, %xmm2 cvttsd2si %xmm2, %rax movabsq $-9223372036854775808, %rcx # imm = 0x8000000000000000 xorq %rax, %rcx cvttsd2si %xmm0, %rax ucomisd %xmm1, %xmm0 cmovaeq %rcx, %rax retq Which is basically an if-converted version of the GCC code. Since cvttsd2si has a fairly long latency, the GCC version is much faster when the branch is well-predicted, and slower when it's not. But it seems like in most cases this branch should be well-predicted - e.g. if all inputs are "small", and actually fit into the signed range. A few additional notes: 1) Our current version is problematic in the presence of FP exceptions, see PR17686. 2) I tried playing around with selecting on the input instead of the output, but that doesn't really improve the situation, since we then need to adjust the sign bit of the output of one of the converts. There are two options here - (1) adjusting and selecting again between the original and the adjusted version, or (2) fudging the adjustment so that it's a nop for the right convert. ICC generates code which is basically (2). This avoids the problem in PR17686, but both options appear to be even slower than what we have now. -- You are receiving this mail because: You are on the CC list for the bug.
_______________________________________________ llvm-bugs mailing list llvm-bugs@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs