https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84200

--- Comment #4 from Martin Liška <marxin at gcc dot gnu.org> ---
So the small benchmark spends all the time in the first function:

  75.08%  lbm_r_peak.gcc7  lbm_r_peak.gcc7-m64          [.]
LBM_performStreamCollideTRT
   0.76%  lbm_r_peak.gcc7  lbm_r_peak.gcc7-m64          [.] LBM_handleInOutFlow 
   0.71%  specperl         specperl                     [.] Perl_hv_common

Function disassembly:

         :      Disassembly of section .text:
         :
         :      0000000000401910 <LBM_performStreamCollideTRT>:
         :      LBM_performStreamCollideTRT():
    0.00 :        401910:       sub    $0x98,%rsp
    0.00 :        401917:       lea    0xc65d400(%rdi),%rdx
    0.00 :        40191e:       jmpq   401efc
<LBM_performStreamCollideTRT+0x5ec>
    0.00 :        401923:       nopl   0x0(%rax,%rax,1)
    0.00 :        401928:       mov    0x23e1(%rip),%rax        # 403d10
<_IO_stdin_used+0x1c0>
    0.00 :        40192f:       movsd  0x23f9(%rip),%xmm2        # 403d30
<_IO_stdin_used+0x1e0>
    0.00 :        401937:       pxor   %xmm10,%xmm10
    0.00 :        40193c:       movsd  0x23d4(%rip),%xmm0        # 403d18
<_IO_stdin_used+0x1c8>
    0.00 :        401944:       movsd  0x23dc(%rip),%xmm5        # 403d28
<_IO_stdin_used+0x1d8>
...
    0.23 :        40238c:       movsd  0x88(%rdi),%xmm0
    4.10 :        402394:       movsd  %xmm0,-0x1868e0(%rsi)
    0.39 :        40239c:       movsd  0x90(%rdi),%xmm0
    4.90 :        4023a4:       movsd  %xmm0,0x186b18(%rsi)
    3.68 :        4023ac:       jmpq   401ee5
<LBM_performStreamCollideTRT+0x5d5>
    0.00 :        4023b1:       add    $0x98,%rsp
    0.00 :        4023b8:       retq   

The function is full of SSL instruction, and it's a huge loop nest.
What's difference after the problematic revision is that newly we end up with
the function
to start at: 0000000000401910. Before the commit it used to start at
0000000000401900. There's no
difference in generated assembly. I experimented with the address of the fn,
and it starts to be
fast back at address 0000000000401940. So I'm blaming some instruction cache in
Zen CPU, where newly
generated assembly probably crosses a boundary and thus is much slower. Note
that the slow down
really huge due to such a small memory layout change. In order to make it fast,
one can add:
-falign-functions=32.

Reply via email to