https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782
Bug ID: 98782 Summary: IRA artificially creating spills due to BB frequencies Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: ra Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: jgreenhalgh at gcc dot gnu.org, vmakarov at gcc dot gnu.org Target Milestone: --- Created attachment 50020 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50020&action=edit dumps and source Hi, James and I have been investigating the exchange2 regression that has been haunting Trunk since: commit 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 Author: Jan Hubicka <j...@suse.cz> Date: Tue Aug 11 12:02:32 2020 +0200 Do not combine PRED_LOOP_GUARD and PRED_LOOP_GUARD_WITH_RECURSION That patch fixed the prediction frequency for the basic blocks in some of the recursively specialized functions in exchange2. Unfortunately, it also created a large (12+%) performance regression on multiple targets (including x86). After initially blaming the new prediction frequencies, and significant work from Jan and Martin we have good confidence in the probabilities, however they appear to be exposing issues with the probability-based cost models in IRA causing additional spills after artificially limiting register pressure by excluding caller-saved registers across a call site in the loop. This testcase (tuned) for AArch64 shows the issue: void bar (int, int, int, int); int foo (int x, char* foo) { int tmp = x * 753; int t2 = tmp + 7; int t3 = tmp * 7; int c1 = 753; int c2 = c1 + 7; int c3 = c3 * 7; for (int i = 0; i < 1024; i++) { if (__builtin_expect_with_probability (foo[i] != 0, 1, SPILLER)) bar(x, tmp, t2, t3); c1 += foo[i+1]; c2 *= foo[i+1]; c3 += c2; } return c1 + c2 + c3; } You can see the difference in the basic block labeled with L2 (using ffixed to provoke register pressure): Good compile: gcc -DSPILLER=0.5 -fno-shrink-wrap -fno-schedule-insns -O3 -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks .L2: ldrb w0, [x19, 1]! add w22, w22, w0 mul w20, w20, w0 add w21, w21, w20 cmp x7, x19 bne .L5 Bad compile: gcc -DSPILLER=0.51 -fno-shrink-wrap -fno-schedule-insns -O3 -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks .L2: ldrb w0, [x19, 1]! add w21, w21, w0 mul w20, w20, w0 ldr x0, [sp, 64] <<<<<< Reload of x0 add w22, w22, w20 cmp x0, x19 bne .L5 Neither of us are an expert in this area by any means, I think what we're seeing can be explained by this line in the IRA dump: Good: Allocno a5r104 of GENERAL_REGS(24) has 17 avail. regs 4-15 18-22, node: 4-15 18-22 (confl regs = 0-3 16-17 23-85) Bad: Allocno a5r104 of GENERAL_REGS(24) has 4 avail. regs 19-22, node: 19-22 (confl regs = 0-3 16-17 23-85) In the bad case it looks like all the available registers go down, but these particular ones have so few left over that it causes the spill to occur. The change in available registers comes from this code in setup_profitable_hardregs: if (ALLOCNO_UPDATED_MEMORY_COST (a) < costs[j] /* Do not remove HARD_REGNO for static chain pointer pseudo when non-local goto is used. */ && ! non_spilled_static_chain_regno_p (ALLOCNO_REGNO (a))) CLEAR_HARD_REG_BIT (data->profitable_hard_regs, hard_regno); Both sides of the ALLOCNO_UPDATED_MEMORY_COST (a) < costs[j] calculation make use of frequency, but there is some asymmetry. In the case of the bigger exchange2 regression if just revert 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which affects the BB frequencies we get a slightly higher score than in GCC 10, indicating that the changes in IPA are indeed sound. It also gives us a comparison where the entire sequence up to reload is exactly the same, aside from the counts in the BB and the frequencies. Good BB: ;; succ: 66 [always] count:53687092 (estimated locally) (FALLTHRU) ;; lr out 29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438 ;; live out 29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438 ;; basic block 66, loop depth 0, count 107374184 (estimated locally), maybe hot ;; prev block 65, next block 67, flags: (REACHABLE, RTL, MODIFIED) ;; pred: 64 [50.0% (guessed)] count:53687092 (estimated locally) ;; 65 [always] count:53687092 (estimated locally) (FALLTHRU) Bad BB: ;; succ: 66 [always] count:3487081 (estimated locally) (FALLTHRU) ;; lr out 29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438 ;; live out 29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438 ;; basic block 66, loop depth 0, count 6974163 (estimated locally), maybe hot ;; prev block 65, next block 67, flags: (REACHABLE, RTL, MODIFIED) ;; pred: 64 [50.0% (guessed)] count:3487082 (estimated locally) ;; 65 [always] count:3487081 (estimated locally) (FALLTHRU) >From there everything seems to change including the costs for register classes Good: a8(r112,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:1445,14170 FP_LO_REGS:1445,14170 FP_REGS:1445,14170 POINTER_AND_FP_REGS:1445,14170 MEM:1156,11336 a45(r112,l1) costs: GENERAL_REGS:0,0 FP_LO8_REGS:12725,12725 FP_LO_REGS:12725,12725 FP_REGS:12725,12725 POINTER_AND_FP_REGS:12725,12725 MEM:10180,10180 Allocno a8r112 of GENERAL_REGS(30) has 26 avail. regs 1-15 18-28, node: 1-15 18-28 (confl regs = 0 16-17 29-85) Bad: a8(r112,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:85,610 FP_LO_REGS:85,610 FP_REGS:85,610 POINTER_AND_FP_REGS:85,610 MEM:68,488 a45(r112,l1) costs: GENERAL_REGS:0,0 FP_LO8_REGS:525,525 FP_LO_REGS:525,525 FP_REGS:525,525 POINTER_AND_FP_REGS:525,525 MEM:420,420 Allocno a8r112 of GENERAL_REGS(30) has 10 avail. regs 19-28, node: 19-28 (confl regs = 0 16-17 29-85) and a new spill a89 is created for a value that is actually live through all branches inside the loop: Good IRA sets: Hard reg set forest: 0:( 0-28 30 32-63)@0 1:( 0-17 19-25 27-28 30)@112080 2:( 1-15 19-25 27-28)@196432 3:( 19-25 27-28)@280 Bad IRA sets: Hard reg set forest: 0:( 0-28 30 32-63)@0 1:( 0-22 24 28 30)@121976 2:( 1-15 18-22 24 28)@116576 3:( 19-22 24 28)@8864 Spill a87(r99,l2) Spill a89(r112,l2) In the exchange2 example the function call seems thousands of instructions away, but is part of the main loop. The function call is self recursive call (before specialization) and only needs x0. So it looks like caller saved registers are being severely penalized here. That is to say, the likelihood of RA needing a caller-saved register is so high it should just use it, even in the presence of the function call. Since whether it spills the caller-saved or the temp register as it's trying now would make no difference if it has to spill at the call site. But if it doesn't then this would result in better code. We would appreciate any tips/help/workarounds we could do to mitigate this as the regression is quite significant. I have attached for convenience the dump files and generated code for the good and bad cases for the example above. In exchange2 this code is generated in the __brute_force_MOD_digits_2.constprop.4 specialization.