https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106

--- Comment #8 from JunMa <JunMa at linux dot alibaba.com> ---
(In reply to Alexander Monakov from comment #6)
> Reopening and confirming, GCC's code looks less efficient than possible for
> no good reason.
> 
> CDCE does
> 
>         y = sqrt (x);
>      ==>
>         y = IFN_SQRT (x);
>         if (__builtin_isless (x, 0))
>             sqrt (x);
> 
> but it could do
> 
>         y = IFN_SQRT (x);
>         if (__builtin_isless (x, 0))
>             y = sqrt (x);
> 
> (note two assignments to y)
> 

what is the difference between this and LLVM's approach ? 

> or to mimic LLVM's approach:
> 
>         if (__builtin_isless (x, 0))
>             y = sqrt (x);
>         else
>             y = IFN_SQRT (x);

I have finished a patch which do as same as LLVM in cdce pass, and test with
case below:

 #include <math.h>
  int main () {
    float x = 1.0;
    float y;
    for (int i=0; i<100000000; i++) {
      y += sqrtf (x+i);
    }
    return y;
  }

And I've got, for x86-64 with O2:

  # original asm of IFN_SQRT part
.L4:
  pxor  %xmm0, %xmm0
  cvtsi2ssl  %ebx, %xmm0
  addss  %xmm3, %xmm0
  ucomiss %xmm0, %xmm4
  movaps %xmm0, %xmm2
  sqrtss %xmm2, %xmm2
  ja  .L7

and perf stat : 
     1,423,652,277      cycles                    #    2.180 GHz               
      (83.31%)
     1,121,862,980      stalled-cycles-frontend   #   78.80% frontend cycles
idle     (83.31%)
       634,957,413      stalled-cycles-backend    #   44.60% backend cycles
idle      (66.62%)
     1,102,109,423      instructions              #    0.77  insn per cycle     
                                                  #    1.02  stalled cycles per
insn  (83.31%)
       200,400,940      branches                  #  306.873 M/sec             
      (83.44%)
             7,734      branch-misses             #    0.00% of all branches   
      (83.44%)



#transformed asm : 
.L4:
  pxor  %xmm0, %xmm0
  cvtsi2ssl  %ebx, %xmm0
  addss  %xmm3, %xmm0
  ucomiss %xmm0, %xmm2
  ja   .L8
  sqrtss %xmm0, %xmm0

and perf stat:
     1,418,560,722      cycles                    #    2.180 GHz               
      (83.25%)
     1,116,732,674      stalled-cycles-frontend   #   78.72% frontend cycles
idle     (83.25%)
       674,837,417      stalled-cycles-backend    #   47.57% backend cycles
idle      (66.81%)
     1,003,067,037      instructions              #    0.71  insn per cycle     
                                                  #    1.11  stalled cycles per
insn  (83.41%)
       200,619,151      branches                  #  308.272 M/sec             
      (83.40%)
             5,637      branch-misses             #    0.00% of all branches   
      (83.28%)


The transformed case has less instructions and gets better performance which
looks good to me. However, one thing that I noticed is the original case gets
less 'stalled-cycles-backend', since its code has better ILP.

I'm not sure which approach is better.

Environment:
gcc version:  gcc trunk@270488 
OS: centos7.2
HW: Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz

Reply via email to