https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056
--- Comment #1 from Martin Liška <marxin at gcc dot gnu.org> --- Created attachment 46169 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46169&action=edit perf annotate - Ofast native vs. Ofast native PGO I'm attaching HTML and txt perf annotate for Ofast native and Ofast native PGO builds. As seen, it's still the same story. There's a big register pressure that leads to spilling of some of the induction variables. For these builds, the most significant difference is: GOOD: : if(block(row, 4, i4) <= 0) cycle 0.00 : 41c660: mov (%r9),%r12d 1.99 : 41c663: mov %r11d,0x80(%rsp) 0.11 : 41c66b: mov %r11d,%edx 0.02 : 41c66e: test %r12d,%r12d 0.15 : 41c671: jg 41c7b0 <__brute_force_MOD_digits_2+0xe00> 0.01 : 41c677: inc %r11 0.64 : 41c67a: add $0x144,%r9 0.13 : 41c681: add $0x144,%r8 0.05 : 41c688: add $0x144,%r10 : do i4 = l(4), u(4) 0.15 : 41c68f: cmp %r11d,0x6c(%rsp) 2.39 : 41c694: jge 41c660 <__brute_force_MOD_digits_2+0xcb0> 0.00 : 41c696: mov 0x168(%rsp),%r10 0.55 : 41c69e: mov 0x170(%rsp),%r9 0.08 : 41c6a6: mov 0x178(%rsp),%r11 0.05 : 41c6ae: mov 0x180(%rsp),%r8 : block(row, 4:9, i3) = block(row, 4:9, i3) + 10 BAD: : if(block(row, 4, i4) <= 0) cycle 0.05 : 41a8b0: mov (%r11),%edi 0.78 : 41a8b3: mov %r10d,0x84(%rsp) 0.04 : 41a8bb: mov %r10d,%r13d 0.01 : 41a8be: test %edi,%edi 0.26 : 41a8c0: jg 41aa10 <__brute_force_MOD_digits_2+0x1210> 0.44 : 41a8c6: addq $0x144,0x48(%rsp) 4.04 : 41a8cf: addq $0x144,0x58(%rsp) 1.31 : 41a8d8: inc %r10 0.02 : 41a8db: add $0x144,%r11 : do i4 = l(4), u(4) 0.01 : 41a8e2: cmp %r10d,0x88(%rsp) 0.25 : 41a8ea: jge 41a8b0 <__brute_force_MOD_digits_2+0x10b0> : block(row, 4:9, i3) = block(row, 4:9, i3) + 10 0.03 : 41a8ec: mov 0xd0(%rsp),%r15 0.27 : 41a8f4: addl $0xa,-0xdc(%r15) 0.20 : 41a8fc: addl $0xa,-0xb8(%r15) 0.01 : 41a904: addl $0xa,-0x94(%r15) 0.07 : 41a90c: addl $0xa,-0x70(%r15) 0.05 : 41a911: addl $0xa,-0x4c(%r15) 0.06 : 41a916: addl $0xa,-0x28(%r15) The benchmark is quite unpredictable, I'm leaving that for now.