I've done a bunch of reading, and though I'm not finished, I'm starting to look towards the following overall algorithm based on the below specified assumptions. I'm not _necessarily_ looking for comments at this point, because I haven't finished evaluating the specifics of several algorithms, but feel free to comment.
Foreward: I'm looking at a hybrid of several algorithms, since different regions have different advantages / disadvantages. The typical downside of hybriding is extra overhead in the inner-loop. Here, the cost is in a greater number of branches in the inner loop. You can skip to the posted algorithm pseduo-code for a quick overview. Since we're inlining the data-size, I'm selecting which sub-algorithm to use based on the size. The data-type (beit a raw character-buffer like a string, or a nested data-structure such as an Array) will have to be known and thus interleaved within the GC for now. What I'd like to do is define vtableish ops for each data-type, and have the perl-make-script generate the core routine by concatenating the different types. I don't want the overhead of a vtable function call, but I don't want to expose the garbage collector to extension writers. More to come with this. Small data-types ( size <= 256 Bytes): Ideally, there's zero header-space due to the relative space-cost, and since we're already using the size this should be possible. For now, I'm looking towards a generic copying-collector. Generational schemes can require additional header-space (either for marking or boundry-references). The only advantage of making this generational would be to minimize the scan-depth (most generational algorithms don't decend through data-structures living in aged heaps), but this requires a write-barrier, and I'm currently opposed to that (defeats much of the advantage of abstracted garbage collection over ref-counting). I'm still considering this for terminal (non descending) data-types. I figure small data-types will be the most common by far, and so the less work done per object the better. Further, since each object will probably be visited on each reclaimation stage, we don't get to take advantage of anti-swap-thrashing techniques without imposing relatively high out-of-band overhead. The associated heap is grown (doubled in size) whenever less than 33% of the heap is available after a reclaimation. Unfortunately we don't know that we want to grow the heap until we've finished reclaiming, and a regrowth involves another copying phase, so it's deferred until the next collection. I found that this isn't so bad if we keep the max data-size much less than 33% of the heap. I'm looking at 64K as a starting heap-size, thus 256B is much less than 21K. Medium data-types ( 256 < size <= page-size ): Here we can afford header-prefixes, while copies are more expensive (relative to other operations). Allocations in this region can be pretty eratic and thus hard to coallesce, so I'm looking at using a generational scheme. Unfortunately many such schemes "age" the wrong data, or require a lot of extra copying. Some of the nicer schemes only require comparison to a water-mark (below which an object is considered aged, and above which the object is considered young and thus likely to expire soon). Such watermark schemes allow adaptive modification of the water-mark(s). In other words, when we find that most data is not dying, we want to raise the water-mark (to avoid unneeded copying). When we find that a lot of aged data is dying, we want to lower the water-mark (to alleviate garbage in the aged region). I'm currenty fiddling with a compacting collector here. This requires two stages. The first stage applies a "version" tag prefixed to each object ( 4 byte integer (where possible)). On each execution of gc_reclaim, the version number is incremented. In the Dead-Object-Detection stage (which, in comparison, is also where small objects are copied out to their semi-space), versions of all live objects are updated. Additionally, the allocated size of both the young and old regions are calculated. Unlike the copying-collector, we have the option of growing the mid-sized heap before performing the copying/compaction; providing more up-to-date and thus accurate feed-back. If we don't copy, then we only compact the "young" region. Thus the advantage of aging objects is to avoid copying them (at the expense of leaving fragments). Alternatively, if we determine that there is too much free space in the aged section, we can temporarily reset the water-mark and compact that region too. The big advantage of compaction is that the age is directly related to how far up the heap we are; objects near the top are garunteed to be younger than at lower levels. The water-mark, is therefore an accurate representation of age. Since this heap is efficiently growable, I'm looking to start it at 16K (or some fraction of the semi-space), since each thread requires it's own local set of heaps. The main disadvantage of this scheme is that we access each live-object twice. But since this "should" be statistically more rare than small-object allocation, we should see better overall performance (I have yet to verify this for our needs). I'm not done evaluating generational schemes, and I might still be in favor of using a write-barrier to minimize the decending depth (when arrays of arrays are used, for example). Large Objects ( size > PAGE ): Any header-overhead should be acceptible here; both computational and space-wise, since we can't afford to leave huge memory chunks to accumulate and rot. Ideally we'd be able to "remmap" pages so as to maximize contiguous free space, but I don't think that's portable (or even possible on Linux). Allocations are relegated to the underlying malloc/free-style memory manager. It's likely that the pages of this size will warrant consolidating the memory space of all threads. In other words, the cost of contending for malloc locks is overshadowed by the potentially wasted space of having separate free-lists of objects of size 4K or more (oft-times up to a meg). To allow DOD, I'm adding a simple thread-specific doublly-linked list prefix. Allocations occur on the "from-list". When garbage is colleting, live objects are moved to the "to-list". Note that multiple accesses to the same object merely move from the "to-list" back to the head of the "to-list", so we're fine. When we're done with DOD, we have a second stage where we manually "free" all remaining objects on the "from-list". Since objects are really large and should be few and far between, this overhead should be acceptible. Especially considering that the cost of freeing some objects compared to moving large ones is miniscule. In the ideal form of this algorithm, the remaining elements in the from-list are appended to a free-list in constant time, but this assumes either that all objects are roughly the same size, or that the objects were pre-segregated into separate to/from/free-lists based on their size. This would assume that the underlying memory-manager was integrated with the garbage collector; an assumption I'm not sure I want to make at this point. Finally the to-list is reattached to the from-list in constant time. Preliminarly tests showed that significantly more memory was wasted by allocating/forgetting large memory blocks than by raw allocate/forgetting of many smaller blocks. (One dead-object of size 64K dwarfs 1 thousand dead 16Byte objects). I grafted a "gc_reclaim()" call to underlying memory-manager prior to requesting more memory from the system. The result (surprisingly) was that garbage collection was initiated exclusively by the underlying system, and NOT due to the filling of the copying/compacting collector heaps. Even though this was a synthetic benchmark and I batch allocated 256K worth of pages from the OS on each VM growth, I never had the other regions get more than 60% full. I tested this by putting a wrapper around the GNU malloc/free calls. I assumed optimal consolidation and simply modified an accumulator, and compared it to a water-mark; this should be sufficient to abstract reclaiming from the underlying memory subsystem. If the GC is merged with the mem-subsys, then perhaps a rarely run compactor should be applied to any movable objects within the VM. Even if the VM shares objects from different threads, only the objects in the current thread need by moved (albeit with considerablly less efficiency). Constant-sized objects (object handles): Handles don't necessarily have to, but are apparently desired to be fixed positioned. One advantage of fixed sized data is that we don't generally need to consolidate anyway (e.g. we don't have to move them). Another advantage is that if we instead moved handles, then references need "forwarding-pointers", which slows down the reclaimation. Unfortunately we still need to collect dead handles. I'm really wanting to simply apply a doubly linked list as with the large objects. Since they're the same size, then we can move the "to-list" directly to the "free-list" in constant time. Some algorithms I was considering earlier needed to have access to ALL object-handles within these linked-lists, which was frustrating since XS-code could locally declare an object and thus forego the list-prefix. I think this is irrelevant with the above selected algorithms, however, so we're good. My unit testing has a fixed root-set consiting of an array of handles, thus I've avoided the issue of handle reclaiming thus far. In general, however, arena allocations have high performance, even in GC'd heaps. If our core (or even parrot-byte-code) deterines that a constant-sized object will be allocated regularly, and the object is large enough to warrant a header, then a separate storage area similar to handle-arena's could be used. The only problem is that it adds to the size of the switch statement that directs the DoD to the appropriate data-area. Theoretically we'd have a switch like this: gc_reclaim() { if ( small_heap_should_grow ) { grow_small_heap() ) for r in roots { { ... } // move r from from-list -> to-list // DOD PMCPtr* ptr = r->data; size = ptr->size; if ( size < GC_SMALL_SIZE ) { ... } // small object copying-collector else { switch ( size ) { case ARENA_SIZE1: { ... } // arena'd collectors case ARENA_SIZE2: { ... } ... default: if ( size < PAGE_SIZE ) { ... } // medium object (compactor) else { ... } // large object (alloc/free) } } } if ( smallHeapNotBigEnough() ) { markSmallHeapShouldGrow() } // Reclaiming stage MediumObjectStage2: { ... } // O(num_bytes) grow-heap or // O(num_bytes) compact young, old? HandleStage2: { ... } // const-time from-list -> free-list, // const-time to-list -> from-list Arena1Stage2: { ... } // const-time from-list -> free-list, // const-time to-list -> from-list Arena2Stage2: { ... } // const-time from-list -> free-list, // const-time to-list -> from-list ... LargeObjectStage2: { ... } // O(num_large_items) free items on from-list, // const-time to-list -> from-list } // end gc_reclaim Note that the { ... } may require further switching on r->type to account for descending arrays, etc. Alternatively, the r->type could be the outer switch, but with extra pre-processing complexity. The switch statement is optional and really should be based on profilling. Medium-to-large Powers-of-two might fit this bill since arrays' tend to grow on such size boundries once they exceed a certain size (256, 512, 1024, 2048, 4096, 8192, ... ). Note also that I didn't put small objects in the "default" clause, since it makes the most common case slower. The only side effect is that all arena's must allocate more than GC_SMALL_SIZE bytes. For GC_SMALL_SIZE = 256B, a 16B linked-list header would be 3% or 6% overhead depending on pointer-size. Compare this to 0% for the copying-collector and 1.5%/3% overhead at most for the compacting layer. Determinisitic Object Death: The last topic is that of known object death. This is especially important for large objects or growing objects. The more we can minimize dead-objects created by resizing, the less often we'll have to garbage collect. Strings and arrays are notorious for this and are at the heart of perl. Unfortunately the copying/compating algorithms can't take advantage of early freeing (allocations only occur at the top of their stacks), but the large objects or arenas can immediately put objects back onto the free lists. The most obvious way of doing this through: ObjPtr* gc_resize(ObjPtr* ptr, new_size); Which can free certain objects when / where appropriate. Other variations are involved within incremental garbage collection schemes (with which I haven't fully explored). Finally, for very large objects, it might be worth-while defining a separate "ref-counted" set of vtables that auto-reclaim. Since any such method would still utilize the garbage collector on the back-end (whenever any heap gets full), and the overhead of a ref-count would be neglegable on such an object, as well as the computational overhead for such a small subset of objects, the only down-side is the extra set of vtables (such as for "large strings"). Again, this is something that should be decided after profiling. Status: Thus far I've written a boiled down version of parrot with simplified PMC's and the small-growable-copying-collector coupled with the large-allocator to help flesh out details/problems as I've discussed above. Enjoy... -Michael