https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124434
--- Comment #12 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> The loop at -O0:
> ```
> .L3:
> fldt -16(%rbp) // load into x87 stack
> fldt -48(%rbp) // load into x87 stack
> fmulp %st, %st(1) //multiply
> fldt -64(%rbp) // load into x87 stack
> faddp %st, %st(1) // add
> fstpt -16(%rbp) // store from x87 stack
> addl $1, -20(%rbp)
> .L2:
> cmpl $999999999, -20(%rbp)
> jle .L3
> ```
>
> Nothing is kept in the x87 stack which slows down x87 in general as the
> transfering between the x87 stack and the normal stack is slow and there is
> no load bypass.
Note clang produces similar (actually slightly worse) at -O0 too:
```
.LBB0_1: # =>This Inner Loop Header: Depth=1
cmpl $1000000000, -52(%rbp) # imm = 0x3B9ACA00
jge .LBB0_4
# %bb.2: # in Loop: Header=BB0_1 Depth=1
fldt -16(%rbp)
fldt -32(%rbp)
fldt -48(%rbp) ;; load into x87 stack
fxch %st(2) ;; exchange top with stack-2
fmulp %st, %st(1) ;; top*=stack-1; and remove stack-1
faddp %st, %st(1) ;; top+=stack-1 and remove stack-1
fstpt -16(%rbp) ;; stop top
# %bb.3: # in Loop: Header=BB0_1 Depth=1
movl -52(%rbp), %eax
addl $1, %eax
movl %eax, -52(%rbp)
jmp .LBB0_1
```