https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782
--- Comment #15 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > That is, we're trading two memory accesses in the call branch > (if we allocate R) against one memory access in both branches > (if we spill R). As the call branch gets more likely, > the cost of doing two memory accesses there gets higher > relative to the cost of doing one memory access in both branches. > And that seems like the right behaviour in principle. > > From that point of view, it doesn't look like the memory and register > costs of R are too wrong here. The things being costed are the store > and load around the call (which do exist if we allocate a call-clobbered > register) and the loads at each use site (which do exist if we spill R). Indeed, I don't think the heuristics are wrong, but because one frequency CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are a bit sensitive to any changes. The edge probabilities can barely change while the BB_FREQ can change dramatically. > > Like Feng Xue says in comment 1, I think the main missed optimisation > opportunity here is that foo + 1024 is invariant, so if we allocate > a call-clobbered register, we could save R once outside the loop > and reload it after each call. That would give: > > - a store of R outside the loop (low execution count) > - a load of R inside the loop (after the call) with freq 0.51 * loop iters > Yes, that is the ideal solution, but also requires more changes to RA. Instead I've chosen a middle ground here (same as yours but done in ira_tune_allocno_costs instead), which is to store and load only inside the loop, but to do so only in the BB which contains the call. This is a major improvement over the current situation because when you have several nested loops where the value is invariant across a number of them you run into problems when each of these BB have naturally very high register pressure. As you say: > - a store of R outside the loop (low execution count) > - a load of R inside the loop (after the call) with freq 0.51 * loop iters > - a load of R inside the loop with freq 0.49 * loop iters and if the loop has various BB (like a long if/then/elseif/else) chain the load has to happen in in every BB in the loop. That's why we get the large amount of spills we currently do. By forcing it to spill only in the BB with the call inside the loop the other BBs are freed from all the loads. > If we force R to be allocated a call-clobbered register instead > of being spilled (and changing nothing else, via a hack to > ira-color.c:improve_allocation) then we generate: > > - a store of R inside the loop (before the call) with freq 0.51 * loop iters > - a load of R inside the loop (after the call) with freq 0.51 * loop iters I essentially did the same thing, but I think in a more conservative way. When you just have a single call inside the entire loop I force it to assign the call-clobbered if it needs it. This removed the loads with freq 0.49. But left the ones with 0.51. I use call counts as the measure here because with 1 call and multiple BB inside the live range you know that at most 1 BB will have a call and so the rest won't have any. Since essentially if you have high register pressure, just only make the part that increases it more pay for the spills. As you say it's not perfect, but it's a conservative improvement over the current situation. > which is cheaper than both the current approaches. We don't do that > optimisation yet though, so the current costing seems to reflect what we > currently generate. In many (if not most) Arches stores are significantly cheaper than the loads though. So the store before the call doesn't end up making that much of a difference, but yes it adds up if you have many of them. So indeed removing it is optimal, but that seems like a very hard one to do. I would assume that the live range for the loop starts at the body of the loop. So I would imagine it's very hard to tell reload to spill outside of the current allocas it's currently allocating for? > > I don't know how well the above translates to the original example > though. Are the some of the spilled values in exchange loop-invariant > as well? Yes I believe so, It's a bit hard for me to tell since the functions are huge and have many nested loops... But in rtl the BBs quite large as well after the constprop and recursive inlining stuff. But the behaviour is consistent with the minimal problem here.