https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782
--- Comment #16 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- (In reply to Tamar Christina from comment #15) > > From that point of view, it doesn't look like the memory and register > > costs of R are too wrong here. The things being costed are the store > > and load around the call (which do exist if we allocate a call-clobbered > > register) and the loads at each use site (which do exist if we spill R). > > Indeed, I don't think the heuristics are wrong, but because one frequency > CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are > a > bit sensitive to any changes. The edge probabilities can barely change while > the BB_FREQ can change dramatically. On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration count. It shouldn't grow or shrink based on the value of SPILLED. That's because every execution of the loop body involves exactly one reference to r104. SPILLED specifies the probability that that single reference is the “call” use rather than the “non-call” use, but it doesn't change the total number of references per iteration. So I think the only reason we see the different ALLOCNO_FREQs in: ALLOCNO_FREQ 989, … vs: ALLOCNO_FREQ 990, … is round-off error. If the values had more precision, I think we'd have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ. At one extreme, if SPILLED is very low (but if we nevertheless continue to duplicate the common part of the loop body) then storing and loading around the call becomes very cheap (ALLOCNO_CALL_FREQ is much lower than the conceptually fixed ALLOCNO_FREQ). If SPILLED is very high (but again we continue to duplicate the common part of the loop body) then storing and loading around the call becomes very expensive. So I guess the question is: where is the cut-off? Given that: - the loop iterates 1024 times - there is a single store outside the loop - the target claims that loads and stores have equal cost ISTM that the cut-off really is in the range [0.5, 0.51]. If the value of SPILLED reflects the real probability at runtime then I think IRA is doing the right thing, given the claimed costs of storing and loading. I guess the problems are: - SPILLED is only a guess - the aarch64 costs of stores and loads don't reflect reality > > Like Feng Xue says in comment 1, I think the main missed optimisation > > opportunity here is that foo + 1024 is invariant, so if we allocate > > a call-clobbered register, we could save R once outside the loop > > and reload it after each call. That would give: > > > > - a store of R outside the loop (low execution count) > > - a load of R inside the loop (after the call) with freq 0.51 * loop iters > > > > Yes, that is the ideal solution, but also requires more changes to RA. True :-) > Instead I've chosen a middle ground here (same as yours but done in > ira_tune_allocno_costs instead), which is to store and load only inside > the loop, but to do so only in the BB which contains the call. I don't think you were saying otherwise, but just FTR: I wasn't proposing a solution, I was just describing a hack. It seemed to me like IRA was making the right decision for r104 in isolation, for the given SPILLED value and target costs. My hack to force an allocation for r104 made things worse. > > which is cheaper than both the current approaches. We don't do that > > optimisation yet though, so the current costing seems to reflect what we > > currently generate. > > In many (if not most) Arches stores are significantly cheaper than the loads > though. So the store before the call doesn't end up making that much of a > difference, but yes it adds up if you have many of them. Yeah. Could we fix the problem that way instead? The only reason IRA is treating loads and stores as equal cost is because aarch64 asked it to :-) At least for the reduced testcase, ISTM that IRA is making the right choice for the “loads and stores are equally expensive” assumption. It also seems to make the right choice for a “loads are more expensive than stores” assumption. It's up to the target to choose which is right. So I think the reduced testcase is showing a target bug.