https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119055

            Bug ID: 119055
           Summary: [15 Regression] 5-8% slowdown of 456.hmmer since
                    r15-7605-gc5752c1f01316a
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pheeck at gcc dot gnu.org
                CC: rsandifo at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

As seen here

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=468.180.0

there was an 8% exec time slowdown of the 456.hmmer SPEC 2006
benchmark when run with -Ofast -march=native PGO on an AMD Zen3 machine.
I bisected it to r15-7605-gc5752c1f01316a

Author: Richard Sandiford <richard.sandif...@arm.com>
Date:   Tue Feb 18 11:00:57 2025 +0000

    late-combine: Tighten register class check [PR108840]

    gcc.target/aarch64/pr108840.c has failed since r15-268-g9dbff9c05520
    (which means that I really ought to have looked at it earlier).

    The test wants us to fold an SImode AND into all shifts that use it.
    This is something that late-combine is supposed to do, but:

    (1) the pre-RA pass chickened out because of a register pressure check

    (2) the post-RA pass can't handle it, because the shift uses are in
        QImode and the sets are in SImode

    Both are things that would be good to fix.  But (1) is particularly
    silly.  The constraints on the AND have "rk" for the destination
    (so allowing the stack pointer) and "r" for the first source.
    Including the stack pointer made the destination seem more permissive
    than the source.

    The intention was instead to check whether there are any
    *allocatable* registers in the destination class that aren't
    present in the source.

    That's enough for all tests but the last one.  The last one still
    fails because combine merges the final shift with the move into
    the hard return register, giving an arithmetic instruction with
    a hard register destination.  Pre-RA late-combine currently punts
    on those, again due to register pressure concerns.  That too is
    something I'd like to relax, but not for GCC 15.  In the interim,
    the best thing seems to be to disable combine for the test.

    gcc/
            PR rtl-optimization/108840
            * late-combine.cc (late_combine::check_register_pressure):
            Take only allocatable registers into account when checking
            the permissiveness of register classes.

    gcc/testsuite/
            PR rtl-optimization/108840
            * gcc.target/aarch64/pr108840.c: Run at -O2 but disable combine.

 gcc/late-combine.cc                         | 10 ++++++++--
 gcc/testsuite/gcc.target/aarch64/pr108840.c |  2 +-
 2 files changed, 9 insertions(+), 3 deletions(-)

This is a regression against GCC 14. See the comparison
here:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.8=1047.180.0&plot.9=468.180.0&;


There were also these 456.hmmer slowdowns in the same timeframe (so probably
caused by the same commit):

5% Zen4 -Ofast -march=generic
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=967.180.0

5% Zen4 -O2 -march=native
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=961.180.0
(although here the graph is noisy)


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

Reply via email to