On Thu, 2008-04-24 at 20:23 -0400, Vladimir Makarov wrote: > Hi, Peter. The last time I looked at the conflict builder > (ra-conflict.c), I did not see the compressed matrix. Is it in the > trunk? What should I look at?
Yes, the compressed bit matrix was committed as revision 129037 on October 5th, so it's been there a while. Note that the old square bit matrix was used not only for testing for conflicts, but also for visiting an allocno's neighbors. The new code (and all compilers I've worked on/with), use a {,compressed} upper triangular bit matrix for testing for conflicts and an adjacency list for visiting neighbors. The code that allocates and initializes the compressed bit matrix is in global.c. If you remember how a upper triangular bit matrix works, it's just one big bit vector, where the bit number that represents the conflict between allocnos LOW and HIGH is given by either of these two functions: 1) bitnum = f(HIGH) + LOW 2) bitnum = f(LOW) + HIGH where: 1) f(HIGH) = (HIGH * (HIGH - 1)) / 2 2) f(LOW) = LOW * (max_allocno - LOW) + (LOW * (LOW - 1)) / 2 - LOW - 1 As mentioned in some of the conflict graph bit matrix literature (actually, they only mention expression #1 above), the expensive functions f(HIGH) and f(LOW) can be precomputed and stored in an array, so to access the conflict graph bits only takes a load and an addition. Below is an example bit matrix with initialized array: 0 1 2 3 4 5 6 7 8 9 10 11 ------ ------------------------------------------------------------- | -1 | 0 | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ------ ------------------------------------------------------------- | 9 | 1 | | | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | ------ ------------------------------------------------------------- | 18 | 2 | | | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | ------ ------------------------------------------------------------- | 26 | 3 | | | | | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | ------ ------------------------------------------------------------- | 33 | 4 | | | | | | 38 | 39 | 40 | 41 | 42 | 43 | 44 | ------ ------------------------------------------------------------- | 39 | 5 | | | | | | | 45 | 46 | 47 | 48 | 49 | 50 | ------ ------------------------------------------------------------- | 44 | 6 | | | | | | | | 51 | 52 | 53 | 54 | 55 | ------ ------------------------------------------------------------- | 48 | 7 | | | | | | | | | 56 | 57 | 58 | 59 | ------ ------------------------------------------------------------- | 51 | 8 | | | | | | | | | | 60 | 61 | 62 | ------ ------------------------------------------------------------- | 53 | 9 | | | | | | | | | | | 63 | 64 | ------ ------------------------------------------------------------- | 54 | 10 | | | | | | | | | | | | 65 | ------ ------------------------------------------------------------- | NA | 11 | | | | | | | | | | | | | ------ ------------------------------------------------------------- As an example, if we look at the interference between allocnos 8 and 10, we compute "array[8] + 10" = "51 + 10" = "61", which if you look above, you will see is the correct bit number for that interference bit. The difference between a compressed upper triangular bit matrix from a standard upper triangular bit matrix like the one above, is we eliminate space from the bit matrix for conflicts we _know_ can never exist. The easiest case to catch, and the only one we catch at the moment, is that allocnos that are "local" to a basic block B1 cannot conflict with allocnos that are local to basic block B2, where B1 != B2. To take advantage of this fact, I updated the code in global.c to sort the allocnos such that all "global" allocnos (allocnos that are live in more than one basic block) are given smaller allocno numbers than the "local" allocnos, and all allocnos for a given basic block are grouped together in a contiguous range to allocno numbers. The sorting is accomplished by: /* ...so we can sort them in the order we want them to receive their allocnos. */ qsort (reg_allocno, max_allocno, sizeof (int), regno_compare); Once we have them sorted, our conceptual view of the compressed bit matrix will now look like: G G G B0 B0 B0 B1 B1 B2 B2 B2 B2 0 1 2 3 4 5 6 7 8 9 10 11 ------ ------------------------------------------------------------- | -1 | G 0 | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ------ ------------------------------------------------------------- | 9 | G 1 | | | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | ------ ------------------------------------------------------------- | 18 | G 2 | | | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | ------ ------------------------------------------------------------- | 26 | B0 3 | | | | | 30 | 31 | * | * | * | * | * | * | ------ ------------------------------------------------------------- | 27 | B0 4 | | | | | | 32 | * | * | * | * | * | * | ------ ------------------------------------------------------------- | NA | B0 5 | | | | | | | * | * | * | * | * | * | ------ ------------------------------------------------------------- | 26 | B1 6 | | | | | | | | 33 | * | * | * | * | ------ ------------------------------------------------------------- | NA | B1 7 | | | | | | | | | * | * | * | * | ------ ------------------------------------------------------------- | 25 | B2 8 | | | | | | | | | | 34 | 35 | 36 | ------ ------------------------------------------------------------- | 27 | B2 9 | | | | | | | | | | | 37 | 38 | ------ ------------------------------------------------------------- | 28 | B2 10 | | | | | | | | | | | | 39 | ------ ------------------------------------------------------------- | NA | B2 11 | | | | | | | | | | | | | ------ ------------------------------------------------------------- I have augmented the figure showing which allocnos are global (G) and which are local (B0, B1 and B2). The "*" slots in the matrix correspond to conflicts that cannot exist between the local allocnos. By sorting the allocnos, we have grouped the unneeded space together to make it easy to eliminate. You'll also notice the initialized array now has different values than before. This is how we conceptually "skip" over the empty space we have not allocated. As an example, if we look at the interference between allocnos 8 and 10, we compute "array[8] + 10" = "25 + 10" = "35", which if you look above, you will see is the correct bit number for that interference bit. The code that computes the precomputed bit matrix values as well as the number of bits in the bit matrix is the following code, also from global.c: max_bitnum = 0; max_global_bitnum = 0; for (i = 0; i < (size_t) max_allocno; i++) { int regno = allocno[i].reg; int blk = regno_basic_block (regno); int row_size = --num_allocnos_per_blk[blk]; reg_allocno[regno] = (int) i; partial_bitnum[i] = (row_size > 0) ? max_bitnum - ((int) i + 1) : -1; max_bitnum += row_size; if (blk == 0) max_global_bitnum += row_size; } Once that is finished, we're ready to allocate the smaller bit vector used for the bit matrix. The other benefit is that the accesses to this compressed bit matrix are exactly like those of a standard upper triangular bit matrix. A third benefit is that since all of the conflict bits for allocnos from the same basic block are consecutive in the bit vector, we should get slightly better cache usage than if they we're sorted that way. The amount of space savings over a standard upper triangular bit matrix depends on the structure of the function we're compiling. Functions with all global allocnos or functions with one basic block (eg, fpppp from SPEC97) cause us to degrade into a standard upper triangular bit matrix. OTOH, functions like 177.mesa/src/get.c:gl_GetBooleanv() from SPEC2000 show huge wins. For gl_GetBooleanv(), we allocate 201613 bytes for the square bit matrix, 100727 bytes for the standard upper triangular bit matrix and 1722 bytes for the compressed upper triangular bit matrix. That means the compressed bit matrix is 117 times smaller than the square bit matrix and nearly 60 times smaller than the standard upper triangular bit matrix. I am thinking about how we can compress the matrix even further. If I had region info, I should be able to easily sort the global allonos into groups, such that global allocnos in different groups do not conflict. Kenny had some region building code, but he said it was costly to compute. I may play around with it, if for no other reason than just to see what type of space savings it would buy. I'm also trying to come up with some ideas on how we can sort allocnos that are local to the same basic block, such that we can get space savings for them as well. If anyone has ideas, let me know! :) > I have also another question. I saw that sparset was used for the > conflict builder. I tried that too when I worked on YARA project. I > even wanted to contribute a generic sparset implementation. But I found > that in general case bitmaps are not worse the sparse sets and much > better if we take a needed space into account. May be you have another > impression? It would be very interesting for me to hear it. I found > that bitmaps have more advanced design than sparsets. I always wanted > to find inventors the bitmaps but never tracked them down. The sparseset representation is only used for the allocnos_live set. The old version was a bit vector to match up with the square bit matrix. Since I changed that to save space, I had to reimplement allocnos_live. Danny suggested I use a bitmap for that set and I tried it, but I found for the particular usage of allocnos_live, a sparseset was noticeably faster than bitmaps. I'll note that the main operations on allocnos_live are to add allocnos to the set, remove allonos to the set and iterate over the members of the set and occasionally clear the entire set. These are all O(1) operations for the sparseset with fairly low constant factors too. I didn't look too closely, but I'm guessing that the main problem with bitmaps for this type of usage was the slower iterating over all of the members of the set versus the sparseset. Obviously, bitmaps are much better than sparsesets wrt space usage, so you have to use them sparingly. You wouldn't want an array of these things! :) But there are use cases where they work very very well. The currently "live" set is one such use. Another use I have found where they work well is in the needLoad set used by Briggs' allocator. Whether you want/should use a sparseset really depends on the number and type of set operations your particular usage will see. I'm sure there are many usage cases where bitmaps are superior to sparsesets, just like there are usage cases where sparsesets are superior. I know that sounds like a cop-out, but it really does depend on how you're going to use it. Peter