[Bug target/120941] [16 Regression] 24-40% slowdown of 519.lbm_r on Zen2 and 470.lbm on Zen5 since r16-1644-gaba3b9d3a48a07

pheeck at gcc dot gnu.org via Gcc-bugs Fri, 25 Jul 2025 06:54:39 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120941


--- Comment #28 from Filip Kastl <pheeck at gcc dot gnu.org> ---
Created attachment 61965
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61965&action=edit
testcase 2 (reduced lbm, where the spill can be seen)

Ok, I think I have confirmed that there is a spill going on.  I also have a new
reduced testcase, where the spill can be seen.

Compile the testcase on r16-1643-gd073bb6cfc219d and r16-1644-gaba3b9d3a48a07
using

gcc -std=gnu99 -c -o lbm.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP  -Ofast
-march=znver2 -mtune=znver2 -flto -fpermissive -std=gnu17 -DSPEC_LP64  lbm.c
gcc -std=gnu99 -c -o main.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP  -Ofast
-march=znver2 -mtune=znver2 -flto -fpermissive -std=gnu17 -DSPEC_LP64  main.c
gcc -std=gnu99 -Ofast -march=znver2 -mtune=znver2 -g -flto -fpermissive
-std=gnu17 lbm.o main.o -lm -o lbm_r

and disassemble the two binaries.  When you compare the disassemblies, you'll
see:

 r16-1643:                                  r16-1644:
 cs nopw 0x0(%rax,%rax,1)                 | vbroadcastsd 0xfc9(%rip),%ymm6 #
402040
                                          |
 test   %ebx,%ebx                         |
 je     4012d3 <main+0x293>               | vbroadcastsd 0xfb0(%rip),%ymm5 #
402050
 lea    0x2(%rbx),%eax                    |
 neg    %eax                              |
 mov    %eax,%esi                         | vmovapd %ymm6,-0x50(%rbp)
 shr    $1,%esi                           | vmovapd %ymm5,-0x30(%rbp)
 cmp    $0xc,%eax                         | nopw   0x0(%rax,%rax,1)
 jbe    401315 <main+0x2d5>               | test   %ebx,%ebx
 vmovd  %ebx,%xmm2                        | je     4012cc <main+0x28c>
 vbroadcastsd 0xf89(%rip),%ymm4 # 402040  | lea    0x2(%rbx),%eax
                                          | neg    %eax
                                          | mov    %eax,%esi
 vbroadcastsd 0xf90(%rip),%ymm3 # 402050  | shr    $1,%esi
                                          | cmp    $0xc,%eax
                                          | jbe    401307 <main+0x2c7>
                                          | vmovd  %ebx,%xmm2
                                          | vmovapd -0x30(%rbp),%ymm4
                                          | vmovapd -0x50(%rbp),%ymm3

There are two jumps in this part of the binary.  There are also the two
vbroadcastd instructions reading from memory.  With r16-1644, the read and
broadcasted values are spilled onto stack right away.  Then there are the
jumps.  Finally, the spilled values are read from the stack.  In contrast, the
r16-1643 binary delays the broadcasting until this point and there is no spill.

If you look into the rrvl rtl dump, you see that the rrvl pass is responsible
for moving the broadcasts upwards and inserting vmovapd -0x30(%rbp),%ymm4 and
vmovapd -0x50(%rbp),%ymm3.  Those moves are just moves between registers at
that point, though.

So to me this seems consistent with Richi's hypothesis.  There is an extra
spill and it is caused by rrvl moving loads across basic blocks.

[Bug target/120941] [16 Regression] 24-40% slowdown of 519.lbm_r on Zen2 and 470.lbm on Zen5 since r16-1644-gaba3b9d3a48a07

Reply via email to