Re: IRA for GCC 4.4

Peter Bergner Fri, 25 Apr 2008 15:22:58 -0700

On Thu, 2008-04-24 at 20:23 -0400, Vladimir Makarov wrote:
> Hi, Peter.  The last time I looked at the conflict builder 
> (ra-conflict.c), I did not see the compressed matrix.  Is it in the 
> trunk?  What should I look at?


Yes, the compressed bit matrix was committed as revision 129037 on
October 5th, so it's been there a while.  Note that the old square
bit matrix was used not only for testing for conflicts, but also for
visiting an allocno's neighbors.  The new code (and all compilers I've
worked on/with), use a {,compressed} upper triangular bit matrix for
testing for conflicts and an adjacency list for visiting neighbors.

The code that allocates and initializes the compressed bit matrix is in
global.c.  If you remember how a upper triangular bit matrix works, it's
just one big bit vector, where the bit number that represents the conflict
between allocnos LOW and HIGH is given by either of these two functions:

  1) bitnum = f(HIGH) + LOW
  2) bitnum = f(LOW) + HIGH

where:

  1) f(HIGH) = (HIGH * (HIGH - 1)) / 2
  2) f(LOW) = LOW * (max_allocno - LOW) + (LOW * (LOW - 1)) / 2 - LOW - 1

As mentioned in some of the conflict graph bit matrix literature (actually,
they only mention expression #1 above), the expensive functions f(HIGH) and
f(LOW) can be precomputed and stored in an array, so to access the conflict
graph bits only takes a load and an addition.  Below is an example bit matrix
with initialized array:
                  
                0    1    2     3    4    5    6    7    8    9   10   11
------        -------------------------------------------------------------
| -1 |      0 |    |  0 |  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 |
------        -------------------------------------------------------------
|  9 |      1 |    |    | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
------        -------------------------------------------------------------
| 18 |      2 |    |    |    | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
------        -------------------------------------------------------------
| 26 |      3 |    |    |    |    | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 |
------        -------------------------------------------------------------
| 33 |      4 |    |    |    |    |    | 38 | 39 | 40 | 41 | 42 | 43 | 44 |
------        -------------------------------------------------------------
| 39 |      5 |    |    |    |    |    |    | 45 | 46 | 47 | 48 | 49 | 50 |
------        -------------------------------------------------------------
| 44 |      6 |    |    |    |    |    |    |    | 51 | 52 | 53 | 54 | 55 |
------        -------------------------------------------------------------
| 48 |      7 |    |    |    |    |    |    |    |    | 56 | 57 | 58 | 59 |
------        -------------------------------------------------------------
| 51 |      8 |    |    |    |    |    |    |    |    |    | 60 | 61 | 62 |
------        -------------------------------------------------------------
| 53 |      9 |    |    |    |    |    |    |    |    |    |    | 63 | 64 |
------        -------------------------------------------------------------
| 54 |     10 |    |    |    |    |    |    |    |    |    |    |    | 65 |
------        -------------------------------------------------------------
| NA |     11 |    |    |    |    |    |    |    |    |    |    |    |    |
------        -------------------------------------------------------------

As an example, if we look at the interference between allocnos 8 and 10, we
compute "array[8] + 10" = "51 + 10" = "61", which if you look above, you will
see is the correct bit number for that interference bit.

The difference between a compressed upper triangular bit matrix from a standard
upper triangular bit matrix like the one above, is we eliminate space from the
bit matrix for conflicts we _know_ can never exist.  The easiest case to catch,
and the only one we catch at the moment, is that allocnos that are "local" to a
basic block B1 cannot conflict with allocnos that are local to basic block B2,
where B1 != B2.  To take advantage of this fact, I updated the code in global.c
to sort the allocnos such that all "global" allocnos (allocnos that are live in
more than one basic block) are given smaller allocno numbers than the "local"
allocnos, and all allocnos for a given basic block are grouped together in a
contiguous range to allocno numbers.  The sorting is accomplished by:

  /* ...so we can sort them in the order we want them to receive
     their allocnos.  */
  qsort (reg_allocno, max_allocno, sizeof (int), regno_compare);

Once we have them sorted, our conceptual view of the compressed bit matrix
will now look like:

                  G    G    G    B0   B0   B0   B1   B1   B2   B2   B2   B2

                  0    1    2     3    4    5    6    7    8    9   10   11
------          -------------------------------------------------------------
| -1 |    G   0 |    |  0 |  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 |
------          -------------------------------------------------------------
|  9 |    G   1 |    |    | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
------          -------------------------------------------------------------
| 18 |    G   2 |    |    |    | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
------          -------------------------------------------------------------
| 26 |    B0  3 |    |    |    |    | 30 | 31 |  * |  * |  * |  * |  * |  * |
------          -------------------------------------------------------------
| 27 |    B0  4 |    |    |    |    |    | 32 |  * |  * |  * |  * |  * |  * |
------          -------------------------------------------------------------
| NA |    B0  5 |    |    |    |    |    |    |  * |  * |  * |  * |  * |  * |
------          -------------------------------------------------------------
| 26 |    B1  6 |    |    |    |    |    |    |    | 33 |  * |  * |  * |  * |
------          -------------------------------------------------------------
| NA |    B1  7 |    |    |    |    |    |    |    |    |  * |  * |  * |  * |
------          -------------------------------------------------------------
| 25 |    B2  8 |    |    |    |    |    |    |    |    |    | 34 | 35 | 36 |
------          -------------------------------------------------------------
| 27 |    B2  9 |    |    |    |    |    |    |    |    |    |    | 37 | 38 |
------          -------------------------------------------------------------
| 28 |    B2 10 |    |    |    |    |    |    |    |    |    |    |    | 39 |
------          -------------------------------------------------------------
| NA |    B2 11 |    |    |    |    |    |    |    |    |    |    |    |    |
------          -------------------------------------------------------------

I have augmented the figure showing which allocnos are global (G) and which
are local (B0, B1 and B2).  The "*" slots in the matrix correspond to conflicts
that cannot exist between the local allocnos.  By sorting the allocnos, we
have grouped the unneeded space together to make it easy to eliminate.
You'll also notice the initialized array now has different values than before.
This is how we conceptually "skip" over the empty space we have not allocated.

As an example, if we look at the interference between allocnos 8 and 10, we
compute "array[8] + 10" = "25 + 10" = "35", which if you look above, you will
see is the correct bit number for that interference bit.  The code that
computes the precomputed bit matrix values as well as the number of bits in
the bit matrix is the following code, also from global.c:

  max_bitnum = 0;
  max_global_bitnum = 0;
  for (i = 0; i < (size_t) max_allocno; i++)
    {
      int regno = allocno[i].reg;
      int blk = regno_basic_block (regno);
      int row_size = --num_allocnos_per_blk[blk];
      reg_allocno[regno] = (int) i;
      partial_bitnum[i] = (row_size > 0) ? max_bitnum - ((int) i + 1) : -1;
      max_bitnum += row_size;
      if (blk == 0)
        max_global_bitnum += row_size;
    }

Once that is finished, we're ready to allocate the smaller bit vector used for
the bit matrix.  The other benefit is that the accesses to this compressed
bit matrix are exactly like those of a standard upper triangular bit matrix.
A third benefit is that since all of the conflict bits for allocnos from the
same basic block are consecutive in the bit vector, we should get slightly
better cache usage than if they we're sorted that way.

The amount of space savings over a standard upper triangular bit matrix
depends on the structure of the function we're compiling.  Functions with
all global allocnos or functions with one basic block (eg, fpppp from SPEC97)
cause us to degrade into a standard upper triangular bit matrix.  OTOH,
functions like 177.mesa/src/get.c:gl_GetBooleanv() from SPEC2000 show
huge wins.  For gl_GetBooleanv(), we allocate 201613 bytes for the square
bit matrix, 100727 bytes for the standard upper triangular bit matrix and
1722 bytes for the compressed upper triangular bit matrix.  That means
the compressed bit matrix is 117 times smaller than the square bit matrix
and nearly 60 times smaller than the standard upper triangular bit matrix.


I am thinking about how we can compress the matrix even further.  If I had
region info, I should be able to easily sort the global allonos into
groups, such that global allocnos in different groups do not conflict.
Kenny had some region building code, but he said it was costly to compute.
I may play around with it, if for no other reason than just to see what
type of space savings it would buy.  I'm also trying to come up with some
ideas on how we can sort allocnos that are local to the same basic block,
such that we can get space savings for them as well.  If anyone has ideas,
let me know! :)



> I have also another question.  I saw that sparset was used for the 
> conflict builder.  I tried that too when I worked on YARA project.  I 
> even wanted to contribute a generic sparset implementation.  But I found 
> that in general case bitmaps are not worse the sparse sets and much 
> better if we take  a needed space into account.  May be you have another 
> impression?  It would be very interesting for me to hear it.  I found 
> that bitmaps have more advanced design than sparsets.  I always wanted 
> to find inventors the bitmaps but never tracked them down.

The sparseset representation is only used for the allocnos_live set.
The old version was a bit vector to match up with the square bit matrix.
Since I changed that to save space, I had to reimplement allocnos_live.
Danny suggested I use a bitmap for that set and I tried it, but I found
for the particular usage of allocnos_live, a sparseset was noticeably
faster than bitmaps.  I'll note that the main operations on allocnos_live
are to add allocnos to the set, remove allonos to the set and iterate over
the members of the set and occasionally clear the entire set.  These are
all O(1) operations for the sparseset with fairly low constant factors
too.  I didn't look too closely, but I'm guessing that the main problem
with bitmaps for this type of usage was the slower iterating over all
of the members of the set versus the sparseset.

Obviously, bitmaps are much better than sparsesets wrt space usage, so
you have to use them sparingly.  You wouldn't want an array of these
things! :)  But there are use cases where they work very very well.
The currently "live" set is one such use.  Another use I have found
where they work well is in the needLoad set used by Briggs' allocator.

Whether you want/should use a sparseset really depends on the number
and type of set operations your particular usage will see.  I'm sure
there are many usage cases where bitmaps are superior to sparsesets,
just like there are usage cases where sparsesets are superior.  I know
that sounds like a cop-out, but it really does depend on how you're
going to use it.


Peter

Re: IRA for GCC 4.4

Reply via email to