https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |vmakarov at gcc dot gnu.org
Keywords|EH |
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is that we fail to sink
d_29 = {t_28, t_28, t_28 t_28};
we compute a good place in select_best_block but then since it is at the
same loop depth as the original place we apply
/* If BEST_BB is at the same nesting level, then require it to have
significantly lower execution frequency to avoid gratuitous movement. */
if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
/* If result of comparsion is unknown, prefer EARLY_BB.
Thus use !(...>=..) rather than (...<...) */
&& !(best_bb->count * 100 >= early_bb->count * threshold))
return best_bb;
and fail to sink. I'm not exactly sure why we do the above - we probably
should when best_bb post-dominates early_bb, also if the sunk stmt
possibly (or provably) will enlarge lifetime of its uses (but that's also
hard to guess since we process sinking of the defs of the uses only
afterwards). In this case we have a single use and a single def so
sinking shouldn't make things worse. We could also weight in
spilling class of a reg here.
In our case we have the dominated block with a higher(!) count than
the dominating block which means the profile is corrupt.
With --param sink-frequency-threshold we sink the ctor and the feeding
division but still get
.L5:
movq (%rbx), %rax
pxor %xmm1, %xmm1
leaq 0(%rbp,%rax), %rdx
.p2align 4,,10
.p2align 3
.L4:
movaps (%rsp), %xmm0
addps (%rax), %xmm0
addq $16, %rax
movaps %xmm0, -16(%rax)
addps %xmm0, %xmm1
cmpq %rax, %rdx
jne .L4
movaps %xmm1, %xmm0
movhlps %xmm1, %xmm0
addps %xmm0, %xmm1
movaps %xmm1, %xmm0
shufps $85, %xmm1, %xmm0
addps %xmm1, %xmm0
.LEHB1:
call _Z1gf
addq $8, %rbx
cmpq %rbx, %r12
jne .L5
because we (rightfully so) refuse to sink into the outer loop. What we
fail to do is hoist the reload out of the inner loop (I suppose
clang does exactly that).
We don't have any pass after reload that would perform loop invatiant motion,
I'm not sure how this situation is handled in general in RA - is a post-RA
pass optimizing the spill/reload placement "globally" usually done?