Re: Add an "early rematerialisation" pass

Jeff Law Fri, 15 Dec 2017 11:54:06 -0800

On 12/14/2017 12:26 PM, Richard Sandiford wrote:

>>> How does it relate to what LRA can do?  AFAIK LRA doesn't try to find
>>> any global optimal solution and previous hardreg assignments may work
>>> against it?
> 
> Yeah, both of those are problems.  But the more important problem is
> that it can't increase the live ranges of input registers as easily.
> Doing it before RA means that IRA gets to see the new ranges.
LRA does not work on a global basis.  It's somewhere between basic block
and extended basic block in its scope.


Remat (along with caller-saves) is really just a case of range splitting
in my mind.  So you really want the pass to either directly integrate
with IRA or run prior to IRA.

You can do splitting in response to failure to get a hard register and
try to hook back into IRA to color those new objects.  I had reasonably
good success with that approach when I was looking at the allocators
prior to LRA.

Basically I let IRA do its thing.  WHen it was done I walked through the
IL splitting ranges to make hard registers available at key points.
Then I'd call back into IRA (using existing mechanisms) to try
allocation again for the allocnos that had not been colored and any new
ones.  The key was there was some very simple and easy range splitting
you could do on already allocated allocnos that in turn would free up
hard registers.

That kind of model doesn't seem to fit here terribly well.  It's not a
lack of hard regs that's the problem, but simply not having any hard
regs available across calls.  So splitting the range of some allocno
that did get a hard register isn't going to help color any of the
allocanos that did not get a register.


If we go back further (circa 1998) we did a pre-allocation range
splitting pass.  We had it working marginally OK, but never really as
well as we wanted.

In that model we looked at pseudos that were likely going to be hard to
allocate and split them into multiple new pseudos.  We tracked the
relationship between the new and original pseudo so that reload could
shove them back together in cases where that made the most sense.  We
had copyin/copyout insns to move back and forth between the range copies
and the original pseudo as needed.

I don't remember the heuristics that drove when/where to split.
Meissner might since my recollection is that he did the major lifting
there.

But again, I don't think that model works here either.  It did nothing
WRT remat.

I know we pondered remat in the context of revamping caller-saves in the
early 90s to help Sparc FP.  But my recollection was that once we had
caller-saves handling the basics well, the performance gains were enough
that digging into remat was never really explored.

Anyway, that's a bit of history.  IMHO remat has to run prior to
allocation or integrated with allocation.   In general I'd expect
running before and independent of IRA to be easier to implement, but
slightly less performant than tightly integrated with IRA.

In addition to potentially avoiding spilling, we have an added benefit
for SVE that we avoid variable sized stack frames if we can eliminate
*all* instances of SVE regsiters live across calls.

I'm guessing that they're relatively rare to begin with based on
comments within the actual code.


> 
>>> That said - I would have expected remat to be done before the
>>> first scheduling pass?  Even before pass_sms (not sure
>>> what pass_live_range_shrinkage does).  Or be integrated
>>> with scheduling and it's register pressure cost model.
> 
> SMS shouldn't be a problem.  Early remat wouldn't introduce new
> instructions into a loop unless the loop also had a call, which would
> prevent SMS.  And although it's theoretically possible that it could
> remove instructions from a loop, that would only happen if:
> 
>   (a) the instruction actually computes the same value every time, so
>       could have been moved outside the loop; and
> 
>   (b) the result is only used after a following call (and in particular
>       isn't used within the loop itself)
> 
> (a) is a missed optimisation and (b) seems unlikely.
> 
> Integrating remat into scheduling would make it much less powerful,
> since scheduling does only limited code motion between blocks.
> 
> Doing it before scheduling would be good in principle, but there
> would then need to be a fake dependency between the call and remat
> instructions to stop the scheduler moving the remat instructions
> back before the call.  Adding early remat was a way of avoiding such
> fake dependencies in "every" pass, but it might be that scheduling
> is one case in which the dependencies make sense.
> 
> Either way, being able to run the pass before scheduling seems
> like a future enhancement, blocked on a future enhancement to
> the scheduler.
I'd expect it to be post-scheduling simply because otherwise you have to
ensure scheduling doesn't muck it back up.  And there's been talk t

> 
>>> Also I would have expected the approach to apply to all modes,
>>> just the cost of spilling is different.  But if you can, say, reduce
>>> register pressure by one by rematerializing a bit-not then that
>>> should be always profitable, no?  postreload-cse will come to
>>> the rescue anyhow.
> 
> But that would then mean taking the register pressure into account when
> deciding whether to rematerialise.  On the one hand would make it hard
> to do before scheduling (which decides the final pre-RA pressure).
> It would also make it a significantly different algorithm, since it
> wouldn't be a standard availability problem any more.
RIght.  In general you don't want to remat or range split speculatively
-- otherwise you end up doing a lot of work later to pull things back
together.  Been there, done that.  Estimation of pressure and costing
models are very important here as the scope of things under
consideration is great and the possibility of makings worse is very real.



> 
> For that use case, pressure-dependent remat in the scheduler might
> be a better approach, like you were suggesting.  The early remat
> pass is specifically for the extreme case of no registers being
> call-preserved, where it's more important that we don't miss
> remat opportunities, and more important that we treat it as
> a global problem.
This case feels different, largely becuase we're talking about a fairly
uncommon case and the cost if we hit that case is very high.  ie, we've
got a larger amount of wiggle room if we don't get the costing model
terribly accurate.

jeff

Re: Add an "early rematerialisation" pass

Reply via email to