[Bug target/120941] [16 Regression] 24-40% slowdown of 519.lbm_r on Zen2 and 470.lbm on Zen5 since r16-1644-gaba3b9d3a48a07

rguenther at suse dot de via Gcc-bugs Tue, 15 Jul 2025 06:24:44 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120941


--- Comment #26 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 15 Jul 2025, pheeck at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120941
> 
> --- Comment #25 from Filip Kastl <pheeck at gcc dot gnu.org> ---
> (In reply to H.J. Lu from comment #24)
> > Why is it bad for znver2?
> 
> Oh, I thought we are trying to figure that out.  Spilling because of register
> pressure, as richi suggested in comment 3, is the best guess we currently 
> have.
> 
> I'll see if I can confirm that there is some extra spilling.

For the testcase you reduced to, sanitized a bit:

enum { ST, SB, ET, EB, WT, WB }
LBM_initializeGrid(double *grid) {
  grid[ST] = grid[SB] = grid[ET] = grid[EB] =
      grid[WT] = grid[WB] = 1.0 / 36.0;
}

this is

LBM_initializeGrid:
.LFB0:
        .cfi_startproc
        vmovddup        .LC1(%rip), %xmm0
        vmovupd %xmm0, 32(%rdi)
        vbroadcastsd    .LC1(%rip), %ymm0
        vmovupd %ymm0, (%rdi)
        vzeroupper
        ret

vs.

LBM_initializeGrid:
.LFB0:
        .cfi_startproc
        vbroadcastsd    .LC1(%rip), %ymm0
        vmovupd %xmm0, 32(%rdi)
        vmovupd %ymm0, (%rdi)
        vzeroupper
        ret

the latter (new) version is better.  I would expect that if the 
two uses are far apart you get extra spilling as I said.  I'd
have restricted the optimization to uses within a single basic block
for example.  If we'd have a tunable/--param for that you could
see if that helps the regressions.

[Bug target/120941] [16 Regression] 24-40% slowdown of 519.lbm_r on Zen2 and 470.lbm on Zen5 since r16-1644-gaba3b9d3a48a07

Reply via email to