https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84200
--- Comment #4 from Martin Liška <marxin at gcc dot gnu.org> --- So the small benchmark spends all the time in the first function: 75.08% lbm_r_peak.gcc7 lbm_r_peak.gcc7-m64 [.] LBM_performStreamCollideTRT 0.76% lbm_r_peak.gcc7 lbm_r_peak.gcc7-m64 [.] LBM_handleInOutFlow 0.71% specperl specperl [.] Perl_hv_common Function disassembly: : Disassembly of section .text: : : 0000000000401910 <LBM_performStreamCollideTRT>: : LBM_performStreamCollideTRT(): 0.00 : 401910: sub $0x98,%rsp 0.00 : 401917: lea 0xc65d400(%rdi),%rdx 0.00 : 40191e: jmpq 401efc <LBM_performStreamCollideTRT+0x5ec> 0.00 : 401923: nopl 0x0(%rax,%rax,1) 0.00 : 401928: mov 0x23e1(%rip),%rax # 403d10 <_IO_stdin_used+0x1c0> 0.00 : 40192f: movsd 0x23f9(%rip),%xmm2 # 403d30 <_IO_stdin_used+0x1e0> 0.00 : 401937: pxor %xmm10,%xmm10 0.00 : 40193c: movsd 0x23d4(%rip),%xmm0 # 403d18 <_IO_stdin_used+0x1c8> 0.00 : 401944: movsd 0x23dc(%rip),%xmm5 # 403d28 <_IO_stdin_used+0x1d8> ... 0.23 : 40238c: movsd 0x88(%rdi),%xmm0 4.10 : 402394: movsd %xmm0,-0x1868e0(%rsi) 0.39 : 40239c: movsd 0x90(%rdi),%xmm0 4.90 : 4023a4: movsd %xmm0,0x186b18(%rsi) 3.68 : 4023ac: jmpq 401ee5 <LBM_performStreamCollideTRT+0x5d5> 0.00 : 4023b1: add $0x98,%rsp 0.00 : 4023b8: retq The function is full of SSL instruction, and it's a huge loop nest. What's difference after the problematic revision is that newly we end up with the function to start at: 0000000000401910. Before the commit it used to start at 0000000000401900. There's no difference in generated assembly. I experimented with the address of the fn, and it starts to be fast back at address 0000000000401940. So I'm blaming some instruction cache in Zen CPU, where newly generated assembly probably crosses a boundary and thus is much slower. Note that the slow down really huge due to such a small memory layout change. In order to make it fast, one can add: -falign-functions=32.