https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124434
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The loop at -O0:
```
.L3:
fldt -16(%rbp) // load into x87 stack
fldt -48(%rbp) // load into x87 stack
fmulp %st, %st(1) //multiply
fldt -64(%rbp) // load into x87 stack
faddp %st, %st(1) // add
fstpt -16(%rbp) // store from x87 stack
addl $1, -20(%rbp)
.L2:
cmpl $999999999, -20(%rbp)
jle .L3
```
Nothing is kept in the x87 stack which slows down x87 in general as the
transfering between the x87 stack and the normal stack is slow and there is no
load bypass.
at -O1:
.L2:
fmul %st(1), %st
fadd %st(2), %st
subl $1, %eax
jne .L2
Everything is kept on the x87 stack.