https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116410
Bug ID: 116410 Summary: fat-lto-objects generates different and inefficient code compared with no-fat-lto-objects Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: yinyuefengyi at gmail dot com Target Milestone: --- It seems unexpected that -ffat-lto-objects generates different code compared with -fno-fat-lto-objects (which is set by default in GCC). Unfortunately is -ffat-lto-objects produces code with worse performance compared with -fno-fat-lto-objects by about 2%+. Even worse, many release OS like Fedora/Redhat added the flag -ffat-lto-objects as global flag when building the OSes, which means all software packages build by it are slow down. One typical example is found in zstd and could be reproduced as below: git clone https://github.com/facebook/zstd.git cd zstd/programs export CFLAGS="-O2 -flto=auto -g" make zstd V=1 -j gdb -batch -ex "disassemble/r ZSTD_rescaleFreqs" zstd > nofat.asm export CFLAGS="-O2 -flto=auto -g -ffat-lto-objects" gdb -batch -ex "disassemble/r ZSTD_rescaleFreqs" zstd > fat.asm cut one piece of code from fat-lto-objects: 0x000000000043aca3 <+99>: 0f 29 54 24 10 movaps %xmm2,0x10(%rsp) 0x000000000043aca8 <+104>: 0f 29 04 24 movaps %xmm0,(%rsp) 0x000000000043acac <+108>: 0f 29 54 24 20 movaps %xmm2,0x20(%rsp) 0x000000000043acb1 <+113>: 0f 29 54 24 30 movaps %xmm2,0x30(%rsp) 0x000000000043acb6 <+118>: 0f 29 54 24 40 movaps %xmm2,0x40(%rsp) 0x000000000043acbb <+123>: 0f 29 54 24 50 movaps %xmm2,0x50(%rsp) 0x000000000043acc0 <+128>: 0f 29 54 24 60 movaps %xmm2,0x60(%rsp) 0x000000000043acc5 <+133>: 0f 29 54 24 70 movaps %xmm2,0x70(%rsp) 0x000000000043acca <+138>: 0f 29 94 24 80 00 00 00 movaps %xmm2,0x80(%rsp) 0x000000000043acd2 <+146>: 0f 11 00 movups %xmm0,(%rax) 0x000000000043acd5 <+149>: 66 0f 6f 7c 24 10 movdqa 0x10(%rsp),%xmm7 0x000000000043acdb <+155>: 66 0f ef c0 pxor %xmm0,%xmm0 0x000000000043acdf <+159>: 0f 11 78 10 movups %xmm7,0x10(%rax) 0x000000000043ace3 <+163>: 66 0f 6f 7c 24 20 movdqa 0x20(%rsp),%xmm7 0x000000000043ace9 <+169>: 0f 11 78 20 movups %xmm7,0x20(%rax) 0x000000000043aced <+173>: 66 0f 6f 7c 24 30 movdqa 0x30(%rsp),%xmm7 0x000000000043acf3 <+179>: 0f 11 78 30 movups %xmm7,0x30(%rax) 0x000000000043acf7 <+183>: 66 0f 6f 7c 24 40 movdqa 0x40(%rsp),%xmm7 0x000000000043acfd <+189>: 0f 11 78 40 movups %xmm7,0x40(%rax) 0x000000000043ad01 <+193>: 66 0f 6f 7c 24 50 movdqa 0x50(%rsp),%xmm7 0x000000000043ad07 <+199>: 0f 11 78 50 movups %xmm7,0x50(%rax) 0x000000000043ad0b <+203>: 66 0f 6f 74 24 60 movdqa 0x60(%rsp),%xmm6 0x000000000043ad11 <+209>: 0f 11 70 60 movups %xmm6,0x60(%rax) 0x000000000043ad15 <+213>: 66 0f 6f 7c 24 70 movdqa 0x70(%rsp),%xmm7 0x000000000043ad1b <+219>: 0f 11 78 70 movups %xmm7,0x70(%rax) same piece of code from no-fat-lto-objects: 0x000000000043ab03 <+99>: 0f 11 50 10 movups %xmm2,0x10(%rax) 0x000000000043ab07 <+103>: 0f 11 00 movups %xmm0,(%rax) 0x000000000043ab0a <+106>: 0f 29 04 24 movaps %xmm0,(%rsp) 0x000000000043ab0e <+110>: 66 0f ef c0 pxor %xmm0,%xmm0 0x000000000043ab12 <+114>: 0f 11 50 20 movups %xmm2,0x20(%rax) 0x000000000043ab16 <+118>: 0f 11 50 30 movups %xmm2,0x30(%rax) 0x000000000043ab1a <+122>: 0f 11 50 40 movups %xmm2,0x40(%rax) 0x000000000043ab1e <+126>: 0f 11 50 50 movups %xmm2,0x50(%rax) 0x000000000043ab22 <+130>: 0f 11 50 60 movups %xmm2,0x60(%rax) 0x000000000043ab26 <+134>: 0f 11 50 70 movups %xmm2,0x70(%rax) 0x000000000043ab2a <+138>: 0f 11 90 80 00 00 00 movups %xmm2,0x80(%rax) 0x000000000043ab31 <+145>: 48 89 e0 mov %rsp,%rax 0x000000000043ab34 <+148>: 0f 29 54 24 10 movaps %xmm2,0x10(%rsp) 0x000000000043ab39 <+153>: 0f 29 54 24 20 movaps %xmm2,0x20(%rsp) 0x000000000043ab3e <+158>: 0f 29 54 24 30 movaps %xmm2,0x30(%rsp) 0x000000000043ab43 <+163>: 0f 29 54 24 40 movaps %xmm2,0x40(%rsp) 0x000000000043ab48 <+168>: 0f 29 54 24 50 movaps %xmm2,0x50(%rsp) 0x000000000043ab4d <+173>: 0f 29 54 24 60 movaps %xmm2,0x60(%rsp) 0x000000000043ab52 <+178>: 0f 29 54 24 70 movaps %xmm2,0x70(%rsp) 0x000000000043ab57 <+183>: 0f 29 94 24 80 00 00 00 movaps %xmm2,0x80(%rsp) 0x000000000043ab5f <+191>: 90 nop I did a initial investigation and found that summaries information become different since ipa-moderef pass: /* Compute no-LTO summaries when local optimization is going to happen. */ bool nolto = (!ipa || ((!flag_lto || flag_fat_lto_objects) && !in_lto_p) || (in_lto_p && !flag_wpa && flag_incremental_link != INCREMENTAL_LINK_LTO)); nolto is true for fat-lto-objects, but false for no-fat-lto-objects, then followed summary/summaries are modified and caused different alias analysis information, dse fail to remove the redudant load/store to stack. Is this a valid bug?