https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124434

--- Comment #12 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> The loop at -O0:
> ```
> .L3:
>         fldt    -16(%rbp) // load into x87 stack
>         fldt    -48(%rbp) // load into x87 stack
>         fmulp   %st, %st(1) //multiply
>         fldt    -64(%rbp) // load into x87 stack
>         faddp   %st, %st(1) // add
>         fstpt   -16(%rbp) // store from x87 stack
>         addl    $1, -20(%rbp)
> .L2:
>         cmpl    $999999999, -20(%rbp)
>         jle     .L3
> ```
> 
> Nothing is kept in the x87 stack which slows down x87 in general as the
> transfering between the x87 stack and the normal stack is slow and there is
> no load bypass.

Note clang produces similar (actually slightly worse) at -O0 too:
```
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        cmpl    $1000000000, -52(%rbp)          # imm = 0x3B9ACA00
        jge     .LBB0_4
# %bb.2:                                #   in Loop: Header=BB0_1 Depth=1
        fldt    -16(%rbp)
        fldt    -32(%rbp)
        fldt    -48(%rbp) ;; load into x87 stack
        fxch    %st(2) ;; exchange top with stack-2
        fmulp   %st, %st(1) ;; top*=stack-1; and remove stack-1
        faddp   %st, %st(1) ;; top+=stack-1 and remove stack-1
        fstpt   -16(%rbp) ;; stop top
# %bb.3:                                #   in Loop: Header=BB0_1 Depth=1
        movl    -52(%rbp), %eax
        addl    $1, %eax
        movl    %eax, -52(%rbp)
        jmp     .LBB0_1
```

Reply via email to