https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120941
--- Comment #28 from Filip Kastl <pheeck at gcc dot gnu.org> --- Created attachment 61965 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61965&action=edit testcase 2 (reduced lbm, where the spill can be seen) Ok, I think I have confirmed that there is a spill going on. I also have a new reduced testcase, where the spill can be seen. Compile the testcase on r16-1643-gd073bb6cfc219d and r16-1644-gaba3b9d3a48a07 using gcc -std=gnu99 -c -o lbm.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -Ofast -march=znver2 -mtune=znver2 -flto -fpermissive -std=gnu17 -DSPEC_LP64 lbm.c gcc -std=gnu99 -c -o main.o -DSPEC -DNDEBUG -DSPEC_AUTO_SUPPRESS_OPENMP -Ofast -march=znver2 -mtune=znver2 -flto -fpermissive -std=gnu17 -DSPEC_LP64 main.c gcc -std=gnu99 -Ofast -march=znver2 -mtune=znver2 -g -flto -fpermissive -std=gnu17 lbm.o main.o -lm -o lbm_r and disassemble the two binaries. When you compare the disassemblies, you'll see: r16-1643: r16-1644: cs nopw 0x0(%rax,%rax,1) | vbroadcastsd 0xfc9(%rip),%ymm6 # 402040 | test %ebx,%ebx | je 4012d3 <main+0x293> | vbroadcastsd 0xfb0(%rip),%ymm5 # 402050 lea 0x2(%rbx),%eax | neg %eax | mov %eax,%esi | vmovapd %ymm6,-0x50(%rbp) shr $1,%esi | vmovapd %ymm5,-0x30(%rbp) cmp $0xc,%eax | nopw 0x0(%rax,%rax,1) jbe 401315 <main+0x2d5> | test %ebx,%ebx vmovd %ebx,%xmm2 | je 4012cc <main+0x28c> vbroadcastsd 0xf89(%rip),%ymm4 # 402040 | lea 0x2(%rbx),%eax | neg %eax | mov %eax,%esi vbroadcastsd 0xf90(%rip),%ymm3 # 402050 | shr $1,%esi | cmp $0xc,%eax | jbe 401307 <main+0x2c7> | vmovd %ebx,%xmm2 | vmovapd -0x30(%rbp),%ymm4 | vmovapd -0x50(%rbp),%ymm3 There are two jumps in this part of the binary. There are also the two vbroadcastd instructions reading from memory. With r16-1644, the read and broadcasted values are spilled onto stack right away. Then there are the jumps. Finally, the spilled values are read from the stack. In contrast, the r16-1643 binary delays the broadcasting until this point and there is no spill. If you look into the rrvl rtl dump, you see that the rrvl pass is responsible for moving the broadcasts upwards and inserting vmovapd -0x30(%rbp),%ymm4 and vmovapd -0x50(%rbp),%ymm3. Those moves are just moves between registers at that point, though. So to me this seems consistent with Richi's hypothesis. There is an extra spill and it is caused by rrvl moving loads across basic blocks.