Wilco Dijkstra <wilco.dijks...@arm.com> writes: > Hi Richard, > >> The patch below is what I meant. It passes bootstrap & regression-test >> on aarch64-linux-gnu (and so produces the same results for the tests >> that you changed). Do you see any problems with this version? >> If not, I think we should go with it. > > Thanks for the detailed example - unfortunately there are issues with it. > Early expansion means more instructions to deal with in RTL and fewer > optimizations - it even affects inlining (I see more calls/returns in the > instruction frequencies).
Do you have an example? It seems odd that rtl changes after cfgexpand would affect inlining. > Worse, this change completely disables rematerialization of FP immediates > which implies extra spilling. A basic example goes like this: > > void g(void); > double bad_remat (double x) > { > x += 5.347897294; > g(); > x *= 5.347897294; > return x; > } > > which with -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10 > -ffixed-d11 -ffixed-d12 -ffixed-d13 -ffixed-d14 now compiles to: > > adrp x0, .LC0 > str x30, [sp, -32]! > ldr d31, [x0, #:lo12:.LC0] > str d15, [sp, 8] > fadd d15, d0, d31 > str d31, [sp, 24] > bl g > ldr d31, [sp, 24] > fmul d0, d15, d31 > ldr d15, [sp, 8] > ldr x30, [sp], 32 > ret Hmm, interesting. Thanks for the example. The reason this works with your patch and not mine seems to be a direct consequence of the fix for PR37273: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37273 It means that, if we start with: (set (reg R) (mem const_pool)) REG_EQUIV: legitimate_const then the final costing pass starts out with a significantly negative memory cost for R. Assigning memory is taken to be a saving: if (set != 0 && REG_P (SET_DEST (set)) && MEM_P (SET_SRC (set)) && (note = find_reg_note (insn, REG_EQUIV, NULL_RTX)) != NULL_RTX && ((MEM_P (XEXP (note, 0)) && !side_effects_p (SET_SRC (set))) || (CONSTANT_P (XEXP (note, 0)) && targetm.legitimate_constant_p (GET_MODE (SET_DEST (set)), XEXP (note, 0)) && REG_N_SETS (REGNO (SET_DEST (set))) == 1)) && general_operand (SET_SRC (set), GET_MODE (SET_SRC (set))) /* LRA does not use equiv with a symbol for PIC code. */ && (! ira_use_lra_p || ! pic_offset_table_rtx || ! contains_symbol_ref_p (XEXP (note, 0)))) Thus for your patch IRA assigns a call-clobbered register while for my patch it assumes that memory is cheaper than using caller saves. If I disable the CONSTANT_P part of the check, to mimic what IRA does with your patch, then I get the expected output. The patch for PR37273 predates LRA and so was aimed at reload's version of caller saves. AIUI, the problem there was that the caller saves didn't take advantage of equivalences and so would always save to the stack. LRA is smarter than that, so I suspect the check could be dropped. We should then get better code in cases where the rematerialised pseudo is used multiple times between two calls. Even so, the code above is clearly assuming that the equivalence will be replaced by a memory where necessary. That doesn't happen due to the following code in lra_constraints: /* If it is not a reverse equivalence, we check that a pseudo in rhs of the init insn is not dying in the insn. Otherwise, the live info at the beginning of the corresponding BB might be wrong after we removed the insn. When the equiv can be a constant, the right hand side of the init insn can be a pseudo. */ || (! reverse_equiv_p (i) && (init_insn_rhs_dead_pseudo_p (i) /* If we reloaded the pseudo in an equivalence init insn, we cannot remove the equiv init insns and the init insns might write into const memory in this case. */ || contains_reloaded_insn_p (i))) This specifically mentions constants, but the behaviour seems odd for them. We don't use init_insns directly to rematerialise the constant. We simply replace the pseudo with the constant and reload it afresh. So I wouldn't have expected the check above to be necessary for constants. In particular, we check elsewhere whether the constant is legitimate, and things like that. Adding !CONSTANT_P (x) to that condition also "fixes" the testcase. Note that if you change the constant in your example to (say) 7902, the output with current trunk is similarly bad: str x30, [sp, -32]! mov x0, 244091581366272 movk x0, 0x40be, lsl 48 str d15, [sp, 8] fmov d15, x0 fadd d0, d0, d15 str d0, [sp, 24] bl g ldr d0, [sp, 24] fmul d0, d0, d15 ldr d15, [sp, 8] ldr x30, [sp], 32 ret Even without the secondary reload you mention below, the RA should be able to reload from the constant pool rather than the stack. So ISTM that there is an underlying problem that current trunk happens to work around in some cases. I don't think what the move patterns are doing is right on first principles. > Recent changes have been moving in the opposite direction - keeping > high-level constructs (like GOT accesses) as a single operation works out > better for register allocation and allows more optimization. But I think we should look more into why that is, at least in this case. (I've tried to start that process above.) The effect is to hide address arithmetic from the main allocator (IRA) and only expose it during LRA. This means that, in high register GPR pressure, LRA has to scavenge and spill more than it needs to. Since having insn conditions dependent on can_create_pseudo_p is IMO wrong in principle, and since the patch wasn't approved in its committed form, I think we should revert it for GCC 15 and revisit this for GCC 16. Thanks, Richard > So keeping FP immediates as standard move instructions until regalloc > is best. Supporting MOV/FMOV in regalloc would require another secondary > reload (and would then allow rematerialization of these constants). > > Cheers, > Wilco