https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

            Bug ID: 98782
           Summary: IRA artificially creating spills due to BB frequencies
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: ra
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
                CC: jgreenhalgh at gcc dot gnu.org, vmakarov at gcc dot gnu.org
  Target Milestone: ---

Created attachment 50020
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50020&action=edit
dumps and source

Hi,

James and I have been investigating the exchange2 regression that has been
haunting Trunk

since:

        commit 1118a3ff9d3ad6a64bba25dc01e7703325e23d92
        Author: Jan Hubicka <j...@suse.cz>
        Date:   Tue Aug 11 12:02:32 2020 +0200
    Do not combine PRED_LOOP_GUARD and PRED_LOOP_GUARD_WITH_RECURSION

That patch fixed the prediction frequency for the basic blocks in some
of the recursively specialized functions in exchange2. Unfortunately, it
also created a large (12+%) performance regression on multiple targets
(including x86).

After initially blaming the new prediction frequencies, and significant work
from Jan and Martin we have good confidence in the probabilities, however they
appear to be exposing issues with the probability-based cost models in IRA
causing additional spills after artificially limiting register pressure by
excluding caller-saved registers across a call site in the loop.

This testcase (tuned) for AArch64 shows the issue:

    void bar (int, int, int, int);
    int foo (int x, char* foo) {
      int tmp = x * 753;
      int t2 = tmp + 7;
      int t3 = tmp * 7;
      int c1 = 753;
      int c2 = c1 + 7;
      int c3 = c3 * 7;
      for (int i = 0; i < 1024; i++) {
        if (__builtin_expect_with_probability (foo[i] != 0, 1, SPILLER))
          bar(x, tmp, t2, t3);
        c1 += foo[i+1];
        c2 *= foo[i+1];
        c3 += c2;
      }
      return c1 + c2 + c3;
    }

You can see the difference in the basic block labeled with L2 (using ffixed
to provoke register pressure):

Good compile:
  gcc -DSPILLER=0.5 -fno-shrink-wrap -fno-schedule-insns -O3   -ffixed-x23
-ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks

    .L2:
        ldrb    w0, [x19, 1]!
        add     w22, w22, w0
        mul     w20, w20, w0
        add     w21, w21, w20
        cmp     x7, x19
        bne     .L5

Bad compile:
  gcc -DSPILLER=0.51 -fno-shrink-wrap -fno-schedule-insns -O3   -ffixed-x23
-ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks

    .L2:
        ldrb    w0, [x19, 1]!
        add     w21, w21, w0
        mul     w20, w20, w0
        ldr     x0, [sp, 64]         <<<<<< Reload of x0
        add     w22, w22, w20
        cmp     x0, x19
        bne     .L5


Neither of us are an expert in this area by any means, I think what we're
seeing can
be explained by this line in the IRA dump:

Good:
   Allocno a5r104 of GENERAL_REGS(24) has 17 avail. regs  4-15 18-22, node: 
4-15 18-22 (confl regs =  0-3 16-17 23-85)
Bad:
    Allocno a5r104 of GENERAL_REGS(24) has 4 avail. regs  19-22, node:  19-22
(confl regs =  0-3 16-17 23-85)

In the bad case it looks like all the available registers go down, but these
particular ones have so few left over that it causes the spill to occur.

The change in available registers comes from this code in
setup_profitable_hardregs:

              if (ALLOCNO_UPDATED_MEMORY_COST (a) < costs[j]
                  /* Do not remove HARD_REGNO for static chain pointer
                     pseudo when non-local goto is used.  */
                  && ! non_spilled_static_chain_regno_p (ALLOCNO_REGNO (a)))
                CLEAR_HARD_REG_BIT (data->profitable_hard_regs,
                                    hard_regno);

Both sides of the ALLOCNO_UPDATED_MEMORY_COST (a) < costs[j] calculation
make use of frequency, but there is some asymmetry.

In the case of the bigger exchange2 regression if just revert
1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which affects the BB frequencies we
get a slightly higher score than in GCC 10, indicating that the changes in IPA
are indeed sound.

It also gives us a comparison where the entire sequence up to reload is exactly
the same, aside from the counts in the BB and the frequencies.

Good BB:

;;  succ:       66 [always]  count:53687092 (estimated locally) (FALLTHRU)
;; lr  out       29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438
;; live  out     29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438

;; basic block 66, loop depth 0, count 107374184 (estimated locally), maybe hot
;;  prev block 65, next block 67, flags: (REACHABLE, RTL, MODIFIED)
;;  pred:       64 [50.0% (guessed)]  count:53687092 (estimated locally)
;;              65 [always]  count:53687092 (estimated locally) (FALLTHRU)

Bad BB:

;;  succ:       66 [always]  count:3487081 (estimated locally) (FALLTHRU)
;; lr  out       29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438
;; live  out     29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438

;; basic block 66, loop depth 0, count 6974163 (estimated locally), maybe hot
;;  prev block 65, next block 67, flags: (REACHABLE, RTL, MODIFIED)
;;  pred:       64 [50.0% (guessed)]  count:3487082 (estimated locally)
;;              65 [always]  count:3487081 (estimated locally) (FALLTHRU)

>From there everything seems to change including the costs for register classes

Good:

a8(r112,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:1445,14170
FP_LO_REGS:1445,14170 FP_REGS:1445,14170 POINTER_AND_FP_REGS:1445,14170
MEM:1156,11336
a45(r112,l1) costs: GENERAL_REGS:0,0 FP_LO8_REGS:12725,12725
FP_LO_REGS:12725,12725 FP_REGS:12725,12725 POINTER_AND_FP_REGS:12725,12725
MEM:10180,10180
Allocno a8r112 of GENERAL_REGS(30) has 26 avail. regs  1-15 18-28, node:  1-15
18-28 (confl regs =  0 16-17 29-85)

Bad:

a8(r112,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:85,610 FP_LO_REGS:85,610
FP_REGS:85,610 POINTER_AND_FP_REGS:85,610 MEM:68,488
a45(r112,l1) costs: GENERAL_REGS:0,0 FP_LO8_REGS:525,525 FP_LO_REGS:525,525
FP_REGS:525,525 POINTER_AND_FP_REGS:525,525 MEM:420,420
Allocno a8r112 of GENERAL_REGS(30) has 10 avail. regs  19-28, node:  19-28
(confl regs =  0 16-17 29-85)

and a new spill a89 is created for a value that is actually live through all
branches inside the loop:

Good IRA sets:

    Hard reg set forest:
      0:( 0-28 30 32-63)@0
        1:( 0-17 19-25 27-28 30)@112080
          2:( 1-15 19-25 27-28)@196432
            3:( 19-25 27-28)@280

Bad IRA sets:

    Hard reg set forest:
      0:( 0-28 30 32-63)@0
        1:( 0-22 24 28 30)@121976
          2:( 1-15 18-22 24 28)@116576
            3:( 19-22 24 28)@8864
      Spill a87(r99,l2)
      Spill a89(r112,l2)

In the exchange2 example the function call seems thousands of instructions
away, but is part of the main loop.
The function call is self recursive call (before specialization) and only needs
x0.  So it looks like caller saved registers are being severely
penalized here.

That is to say, the likelihood of RA needing a caller-saved register is so high
it should just use it, even in the
presence of the function call.  Since whether it spills the caller-saved or the
temp register as it's trying now would
make no difference if it has to spill at the call site.  But if it doesn't then
this would result in better code.

We would appreciate any tips/help/workarounds we could do to mitigate this as
the regression is quite significant.

I have attached for convenience the dump files and generated code for the good
and bad cases for the example above.

In exchange2 this code is generated in the
__brute_force_MOD_digits_2.constprop.4 specialization.

Reply via email to