RE: Understanding IRA (The Story so far)

Ian Bolton Mon, 14 Dec 2009 03:15:33 -0800

For those who have been following my adventure in the world of IRA, or
just
want to take advantage of some of the time I've spent on this, here's a
detailed recap of my exploits and my current status (which I'm very
happy
with):


I initially thought IRA was calculating costs wrongly for our
architecture
and so I experimented with adding more costs in different places.  This
effectively raised the priority of coloring certain allocnos and it led
to
good performance on some benchmarks, but it was really a complete hack
and
not an acceptable solution, so I went back to the drawing board.

After chatting with Vlad and Jeff, I realised IRA was calculating costs
correctly and the issue was that the costs only reflect the preferences
of
each allocno, as opposed to their "agnosticism" (i.e. not caring what
register they get), and so instructions that will happily take any
register
would take whichever one you showed them first; this meant that high
priority
agnostic instructions (that are colored first) were pinching BOTTOM_REGS
before the lower-priority special instructions got a chance.  To address
this, I came up with the simple solution of changing the
REG_ALLOC_ORDER,
where I show TOP_CREGS (the top half of our register set) to agnostic
instructions before the callee-save BOTTOM_REGS (c7-c15) by default, so
they
take TOP_CREGS when all other costs are equal; meanwhile, allocnos
requiring
BOTTOM_REGS were given them because they'd been left unallocated and
their
costs were lower than the TOP_CREGS.

This resulted in good performance improvements on some key benchmarks,
where
previously agnostic instructions were unwittingly using up these special
BOTTOM_REGS and leading to us allocating TOP_CREGS for operands in
special
instructions (which then required spills and moves to get a BOTTOM_REG
in
time to do the special instruction).  The main drawback of this
alternate
REG_ALLOC_ORDER was that we reduce the chance to reuse a callee-save
register
if we pick a TOP_CREG, because it doesn't work on all instructions, and
I
ended up getting regressions on some low-pressure benchmarks because we
used
more callee-save registers than previously.

I then tried various other ways of achieving the performance
improvements I
got on the high register pressure benchmarks and eliminating the
regressions
on the low-pressure ones:

* I tried using '?' constraints on agnostic instructions, such that they
erred to TOP_CREGS instead of callee-save BOTTOM_REGS, but this led to
too
much bias away from the more reusable BOTTOM_REGS (even when there was
no
need to avoid them) and I subsequently got regressions;

* I tried restricting the effect of the '?', by not using it for the
register
preferencing pass and by only adding on the alt_cost if I'd determined
the
allocno never requires BOTTOM_REG, and this got some decent performance,
but
I was still effectively penalising agnostic instructions for using
BOTTOM_REGS regardless of whether others allocnos were demanding them,
so I
got regressions;

* I tried coloring the ones that need BOTTOM_REGS first, which maximises
the
reuse opportunities, but this was essentially a rejection of the
existing
tried-and-tested coloring heuristics in IRA and consequently led to some
regressions;

* I tried combining the new REG_ALLOC_ORDER with a modified coloring
heuristic that uses the need for BOTTOM_REGS as a tie-breaker when two
allocnos are otherwise indistinguishable, but this was overriding the
ALLOCNO_NUM ordering (which I guess represents live range starting
points?),
so I got some regressions.

Fortunately, I was hit by another epiphany on Monday night and I
realised I
could emulate the new REG_ALLOC_ORDER dynamically within assign_hard_reg
and
only when necessary.  When allocating for an agnostic allocno, if it
conflicted with other allocnos that needed BOTTOM_REGS, I added 1 to the
full_cost of each of the BOTTOM_REGS so this allocno took a TOP_CREG.
This
gave me my best performance overall, but there were still some
regressions on
low-pressure benchmarks.  Whilst investigating them, I realised I had
gone
beyond my original new REG_ALLOC_ORDER because I was now avoiding all
BOTTOM_REGS instead of just the callee-save BOTTOM_REGS, so I changed
assign_hard_reg so it only adds 1 to each callee-save BOTTOM_REG.  This
means
we happily use up the caller-save BOTTOM_REGS (c1 - c6), regardless of
whether the allocno is agnostic or not, and this leads to more
reusability in
low-pressure cases (whereas avoiding all BOTTOM_REGS for an agnostic
instruction decreases reusability).

Note that I had been experimenting with Priority algorithm and
Chaitin-Briggs
up until this point and have decided to stick with Priority for now
because
we are using Priority already and this minimises the disruption and
reduces
the risk of regressions on untested inputs.  (Results generally show CB
to be
better performing and smaller code-size, but it suffers badly in very
high
pressure cases and spills more than Priority.)

As of yesterday, I had two versions of the Priority algorithm IRA: one
that
biases against all BOTTOM_REGS when an agnostic allocno conflicts with
one
that wants BOTTOM_REGS; and one that biases against only the callee-save
BOTTOM_REGS.  The former got me better performance on high-pressure
benchmarks and the latter got me better performance on the ones with
some low
pressure.  Either were an improvement on what I started with (default
IRA),
but I wanted to have the best of both worlds and so I needed a hybrid
that
biases against only callee-save BOTTOM_REGS for low-pressure and biases
against all BOTTOM_REGS for high-pressure.

Sure, I could have put in some ugly hack that looks at the register
pressure
number and picks one or the other, but instead I came up with a more
intelligent solution, which takes advantage of our single caller-save
TOP_CREG (c16), which appears after the caller-save BOTTOM_REGS in our
REG_ALLOC_ORDER.  Now, when an agnostic allocno conflicts with one that
wants
BOTTOM_REGS, I add 1 to the cost of c16 too, so we still give out c1, c2
etc
early on (because they have lower costs than all the callee-save regs
and
they appear earlier in the REG_ALLOC_ORDER than c16) to whichever
allocno
gets there first.  But later, when all the caller-save regs are
allocated, we
will use our first callee-save reg (agnostic allocnos will take c17,
ones
that need BOTTOM_REGS will take c7).  At this point, the future cost of
this
allocated callee-save reg becomes cheaper in future than the caller-save
regs, and we opt to use this in preference to the caller-save regs
(including
c1-c6).  The result is that, for high pressure, we are holding back all
the
BOTTOM_REGS for those that need them.

The performance of this version is my best ever and I am very happy with
it.
It doesn't quite get the highest performance I've seen on one of the
high-
pressure benchmarks, because we may still be giving up some BOTTOM_REGS
to
agnostic instructions early on, but it's pretty pretty good otherwise.

Thanks for reading and happy allocating!

RE: Understanding IRA (The Story so far)

Reply via email to