https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106
--- Comment #8 from JunMa <JunMa at linux dot alibaba.com> --- (In reply to Alexander Monakov from comment #6) > Reopening and confirming, GCC's code looks less efficient than possible for > no good reason. > > CDCE does > > y = sqrt (x); > ==> > y = IFN_SQRT (x); > if (__builtin_isless (x, 0)) > sqrt (x); > > but it could do > > y = IFN_SQRT (x); > if (__builtin_isless (x, 0)) > y = sqrt (x); > > (note two assignments to y) > what is the difference between this and LLVM's approach ? > or to mimic LLVM's approach: > > if (__builtin_isless (x, 0)) > y = sqrt (x); > else > y = IFN_SQRT (x); I have finished a patch which do as same as LLVM in cdce pass, and test with case below: #include <math.h> int main () { float x = 1.0; float y; for (int i=0; i<100000000; i++) { y += sqrtf (x+i); } return y; } And I've got, for x86-64 with O2: # original asm of IFN_SQRT part .L4: pxor %xmm0, %xmm0 cvtsi2ssl %ebx, %xmm0 addss %xmm3, %xmm0 ucomiss %xmm0, %xmm4 movaps %xmm0, %xmm2 sqrtss %xmm2, %xmm2 ja .L7 and perf stat : 1,423,652,277 cycles # 2.180 GHz (83.31%) 1,121,862,980 stalled-cycles-frontend # 78.80% frontend cycles idle (83.31%) 634,957,413 stalled-cycles-backend # 44.60% backend cycles idle (66.62%) 1,102,109,423 instructions # 0.77 insn per cycle # 1.02 stalled cycles per insn (83.31%) 200,400,940 branches # 306.873 M/sec (83.44%) 7,734 branch-misses # 0.00% of all branches (83.44%) #transformed asm : .L4: pxor %xmm0, %xmm0 cvtsi2ssl %ebx, %xmm0 addss %xmm3, %xmm0 ucomiss %xmm0, %xmm2 ja .L8 sqrtss %xmm0, %xmm0 and perf stat: 1,418,560,722 cycles # 2.180 GHz (83.25%) 1,116,732,674 stalled-cycles-frontend # 78.72% frontend cycles idle (83.25%) 674,837,417 stalled-cycles-backend # 47.57% backend cycles idle (66.81%) 1,003,067,037 instructions # 0.71 insn per cycle # 1.11 stalled cycles per insn (83.41%) 200,619,151 branches # 308.272 M/sec (83.40%) 5,637 branch-misses # 0.00% of all branches (83.28%) The transformed case has less instructions and gets better performance which looks good to me. However, one thing that I noticed is the original case gets less 'stalled-cycles-backend', since its code has better ILP. I'm not sure which approach is better. Environment: gcc version: gcc trunk@270488 OS: centos7.2 HW: Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz