[Bug target/99083] New: Big run-time regressions of 519.lbm_r with LTO

jamborm at gcc dot gnu.org via Gcc-bugs Fri, 12 Feb 2021 15:32:41 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99083


            Bug ID: 99083
           Summary: Big run-time regressions of 519.lbm_r with LTO
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: ubizjak at gmail dot com
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

On AMD Zen2 CPUs, 519.lbm_r is 62.12% slower when built with -O2 and
-flto than when not using LTO.  It is also 62.12% slower than when
using GCC 10 with the two options.  My measurements match those from
LNT on a different zen2:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=325.477.0&plot.1=312.477.0&plot.2=349.477.0&plot.3=278.477.0&plot.4=401.477.0&plot.5=298.477.0

On the same CPU, compiling the benchmark with -Ofast -march=native
-flto is slower than non-LTO, by 8.07% on Zen2 and 6.06% on Zen3.  The
Zen2 case has also been caught by LNT:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=295.477.0&plot.1=293.477.0&plot.2=287.477.0&plot.3=286.477.0&;

I have bisected both of these regressions (on Zen2s) to:

  commit 4c61e35f20fe2ffeb9421dbd6f26c767a234a4a0
  Author: Uros Bizjak <[email protected]>
  Date:   Wed Dec 9 21:06:07 2020 +0100

      i386: Remove REG_ALLOC_ORDER definition

      REG_ALLOC_ORDER just defines what the default is set to.

      2020-12-09  Uroš Bizjak  <[email protected]>

      gcc/    
              * config/i386/i386.h (REG_ALLOC_ORDER): Remove

...which looks like it was supposed to be a no-op, but I looked at the
-O2 LTO case and the assembly generated by this commit definitely
differs from the assembly produced by the previous one in instruction
selection, spilling and even some scheduling.  For example, I see
hunks like:

@@ -994,10 +996,10 @@
        movapd  %xmm13, %xmm9
        movsd   96(%rsp), %xmm13
        subsd   %xmm12, %xmm9
-       movsd   256(%rsp), %xmm12
+       movq    %rbx, %xmm12
+       mulsd   %xmm6, %xmm12
        movsd   %xmm5, 15904(%rdx)
        movsd   72(%rax), %xmm5
-       mulsd   %xmm6, %xmm12
        mulsd   %xmm0, %xmm9
        subsd   %xmm10, %xmm5
        movsd   216(%rsp), %xmm10

The -Ofast native LTO assemblies also differ.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug target/99083] New: Big run-time regressions of 519.lbm_r with LTO

Reply via email to