On Sunday 11 November 2001 05:41 pm, Michael G Schwern wrote: > While at JAOO, Andy Hunt told me about a little trick Matsumoto is > (was?) trying out for Ruby to speed up it's garbage collection. It > goes something like this (keeping in mind I know very little about GC). > > > Assumtion: Most variables have only one reference (themselves). > > So when a variable is first created, all it has is a single bit > representing it's refcount. When allocated, it's turned on. When it > falls out of scope it's flipped off and swept away. > > When a variable is referenced a second time, it's transmogrified into > a new structure that can support the necessarly more involved garbage > collection system. > > > Basically, for simple variables with no external references, you can > perform a very quick allocation/deallocation model. It's only > necessary to apply the big guns and extra memory for variables with > external references. > > > This is the full extent of my understanding, so I'll just throw it out > there and hope someone runs with it.
What follows is unfortunately rather long, I tried to go back and label common segments for interest and reference. White papers: I've been very eagerly researching GCing techniques; reading whatever white-papers I could get my hands on. (mostly linkable via previous posts, or acm.com) I'm by no means finished, but here's what I've discovered so far: Counting revisited: Most memory freeing requires SOME extra inlined code. Reference counting, as it turns out, is just a variation of garbage collection; most of which utilize approximations for determining recliamable objects. Ref counting is also an approx. since things like circular lists prevent some reclaiming. The main problems with pure ref-counting is that every memory reference requires fiddling with if-statements, adds and subtracts. This can somewhat be alleviated by having the compiler determine when a variable should be created and destroyed (if no refs are ever made to it), though this is akin to being a stack-based auto-variable. If ref-counting is only an approx (unlike perl5) such as the 1-bit method (which was detailed in some papers I read), then you still have the overhead of reclaiming freeing/consolidating. Cost measurement: The cost of an allocator can be based on the cost of allocation, freeing, reclaiming, copying/compaction etc. (This is a terse review, so I could be missing stuff). If there is a lot of allocated memory, then copying GCs will be very expensive, but so will mark/sweep collectors. If there are a lot of accesses (reads/writes) relative to the amount of allocation space, then ref-counting will be very expensive (but so are several other GC techniques, discussed below). In a given time period, if there are more free's than allocations, then copying GCs will be faster than mark-sweep / ref-cnting. Ref-counting has very good cache coherency (when coupled with a good memory manager like vmem), while copying collectors will regularly page-thrash while it navigates every page in each heap (also requiring more total memory). For small applications that have bounded memory requirements, copying collectors have very little overhead, very little slack space, and keep the usable memory in L2/L3 cache. Mark and Sweep: Mark and sweep "compacting" is good for large memory objects which would take a long time to copy. A popular algorithm is to have a doubly linked list for every memory object within this arena (ideally of a fixed size). When memory is starved, navigate the root set, but instead of copying objects, you transition then from the "from-linked-list" to the "to-linked-list". When you're done, you add any remaining items in the from-linked-list to the free-list. Finally, you put the to-linked-list back into the from-linked-list so the operation can be later repeated. You avoid the overhead of copying (in a copying gc), and the overhead of freeing (in reference counting), but you're plagued with internal fragmentation. Additionally, you need an extra 2 pointers for each memory object (for the linked list). If the memory objects were the same size, then you avoid fragmentation issues (but that's harder to do with larger memory sizes, unless you use geometric sizes). The paper suggests that variations on the copying collector theme are ideal for small memory sizes (where copying is quick compared to updating pointers), and using non-copying methods for larger memory. Further, if a heap is homogeneous and single-threaded (which includes having a heap / thread), then the linked-list method can utilize compaction to remove fragmentation / slack.. This will occur much less often then full copying of the copying-gc, and older data will migrate towards the front of the heap, thereby preventing having to slide the entire heap. Generational copying collectors: The more sophisticated algorithms utilized what's called generational copying collectors. Similar to the compactor variation above, we want to take advantage of the fact that younger objects have a higher probability of being reclaimed than older objects. In other words, the more times an object survives a garbage collection, the more likely it won't need to be rechecked for reclaiming. One of the approaches that I'm currently reading about (used in many Lisp architectures) involves using multiple heaps (and forgoing the "semi-space" copy buffer in a raw copying collector). In the simplest case, each time the top-most heap is full, a reclaimation phase is initiated, and valid objects _in that heap_ are copied into a lower heap generation.. If the lower heap is then filled, a reclaimation occurs for _that_ heap to push still live values even further.. The main idea is that old values (those of previous reclaimation generations) are not checked for live-ness (we play the odds that it's less likely that we'll need to), and most importantly, we don't _copy_ them needlessly. We effectively produce a linked list of heaps where the further down the list you go, the older the objects are. Typically algorithms don't pass values down on the first reclaimation so as to avoid excessive heap-depth. I'm currently working on one implementation that uses a semi-space (used in a generic coping collector) for the first tear. Subsequent tears are twice the size so that multiple reclaims are garunteed at each stage. The paper suggests that there should be a finite number of tears, and that the bottom level is a special case (probably utilizing compaction instead of copying). There are several advantages to this [overal archetecture].. During the marking/reclaiming phase, each handle (PMC / String structure) has it's reference address checked against the tear-level. If it's known to not be part of the collection, then we don't modify it's contents (or even read from it; unless it's an indirection, like an array). This allows old [rarely accessed] variables to be paged out to disk. To further this, objects that are known to not contain pointers can be placed into their own heaps to prevent ANY page-swapping during garbage collection. (reading the pointers would require paging back to memory on each root-set-sweep). Other than memory paging, the active memory is the top-tear, which is relatively small, and thus fully cachable. Copier advantages: One of the most obvious advantages with a copying collector is that a stack carver can be used for allocation. This suggests that a single heap should be used for many object-sizes (at least for the top tear of a generational method). Here allocations are only a few assembly language instructions, there is zero overhead per reference assignment, or PMC value modification, and we obtain safe XS code (no memory leaks, premature freeing). Generational architectures theoretically mimizes the overhead of large memory allocations, since the main disadvantage is the copying overhead, and we've dramatically minimized the amount of copied data. I didn't see benchmarks in the paper so in my mock white-board designs, I've already determined a scheme which for a sample execution run (of 66% retained variables across each reclaimation period) there was 136% overhead for my generational copier verses 200% copying overhead with a raw copying collector. Coupled with cache-coherence, this isn't too bad. Real Time, read/write barriers: The next problem is that of "real-time" operation. This actually was the primary focus of the paper. This is extremely complex when you have mutiple threads and an interactive environment which shouldn't periodically "pause" simply because an inner loop allocs and releases references to objects. The larger the memory frame, the longer the pauses will be, and there's little that can be done. All the incremental schemes I've read impose greater overhead, which reduces total throughput, and what's worse, some run the risk of playing catchup; if the GC thread runs slower than the mutator thread(s) (the threads that do work and modify root-set contents / allocate more objects), then the GC may never complete a single pass, thereby not reclaiming _any_ memory (without bringin all mutators to a halt due to memory starvation). More importantly, most "background" gc techniques I've read require what's called reader/writer barriors. This is because the detection/marking stage has the GC navigating root sets that are potentially changing due to concurrent mutators. To prevent this, either ALL reads to collectable memory must involve extra code, or all pointer modifications must modify code. Given our register style, P1 = P2 op P3, it would make sence that a writer barrier would be more efficient (one piece of extra code instead of two), but reader barriers are usually implemented in hardware (setting an entire page to write-only, then having the OS trap access and pass control to GC-handlers). In these special cases (unlikely for us), such reader barriers are blazingly fast in the general case, and only slow when a GC is evaluating a given page. Here is where the simularity to reference counting comes into play.. Ref counting is close cousin to reader/writer barriers. Given our requirements, I seriously doubt that we can rely apon ANY code that requires reader/writer barriers, since XS code would also need it (even if they went through API functions), and we'd not have escaped XS-hell. No barriers for us?: Unless Dan or whomever else believes that it's worth relegating barriers to API macros/functions, then GCing will have to do something similar to "snapshot-at-beginning", which does grunt-work at memory allocation time (possibly due to memory starvation), then add a job to a seperately threaded GC-thread to utilize the marked snapshot. For primative copying collectors, there's no mark-bits - data is simplly copied as discovered. This obviously conflicts, and either requires hybrid algorithms or foregoing of the separatly threaded GC-reclaimer/compactor. Shared memory: Assuming a desirable algorithm exists that allows partial workload w/in a separate thread, then for any object position modification, reader/writer barriers are needed. Shared memory objects (which require locks anyway) could very easily implement this. A lock is a more expensive version of a barrier, but this is a sunk cost (in economics terms). Since we're allowing multiple interpreters and thus threads, there is the requirement that one thread's gc-run can not modify another threads data (via compaction/freeing/etc) without such barriers. Thus (and it's probably obvious), each thread will require it's own independant gc-heap(s) (plural in the case of generational). Given the miniscule overhead of allocating a heap and seting a "top" pointer, this should not have performance overhead, though it does hurt memory space requirements (mem waste is minimized by using a miniature top-tear generatinoal GC). Theoretically each thread desires multiple megs of data, and so long as none is explicitly "shared" (via locks, etc), then there can be no memory usage swaping between threads. Each thread's heap will most likely be "grow-only", since it's difficult to efficiently shrink (since you can't easily speculate as to the required memory sizes without wasteful analysis passes). Incremental (real-time): Incremental techniques are noteworthy, but I haven't yet fully understood them. A simple [perversion of one] algorithm is that there is no separate copying thread, nor is there a periodic copying/compacting run. Each allocation copys/free's exactly one object, thereby making room for subsequent memory allocations. The algorithm seems deceptively simple, since whenever I try and run through the details in my head, there are lots of gotcha's. In theory, this is real-time since we're amortizing the costs across every allocation. Unlike ref-counting, however, we incur little / no cost for general usage (read/write-barriers were heavily discussed, which obviously involves usage-overhead). The devil's in the details though, since my guess is that these incremental operations DO require "SOME" work to be done in the barrier (perhaps the marking or copying). Rant: Given current systems, I'm willing to sacrafice SOME memory allocation efficiency to make a faster algorithm, but it's important to note that simple copying collectors can never have enough memory to avoid inefficient thrashing (copying), since at some point you are relegated to page-thrashing, and the paper suggests that all evidence shames this approach. So, you can't just give it some ungodly memory size like 4 meg to start with (like some java imlementations), then double it's size on each starvation, and hope the problem goes away. My general suggestion: Right now, with my current level of knowledge, generational copying GCs with some as-yet-undertermined method of handling the bottom tear heaps seems to be the best.. You do pause, but if the top tear is like 1K, then the pauses should be very short (albeit many). Larger memory allocations (several K in size) should most likely utilize linked-lists and optionally compaction, since we're unlikely to see regular allocations of size 50K (you'd quickly run out of memory on most systems anyway; even if fully freeable). Vmem ideal: Just as a side note, I'm _really_ excited about a concept in the works. Namely using vmem AS the GC. This assumes knowledge found in my previous email: http://archive.develooper.com/perl6-internals%40perl.org/msg06099.html Thus far in my analysis, vmem has only one fatal flaw, slabs with large number of objects are likely to have poor distribution properties and thus have the potential of starving neighboring slabs. For example, if I allocate a million 8 byte objects, then free every other one, I don't have any memory left (even after consolidation) for any other object size (I instead have to go to the OS for more). Granted, eventually, when freeing those objects, I'll release pages back to the vmem core which can be efficiently reused by other object-sizes. The spark-of-interest I got was that with GC, we're allowed to relocate objects. Since we have externally allocated handles with at least some ability to reference them, we could easily address this "freeing every other object problem". Before we allocate new segments from the OS (via sbrk or mmap), we call "reclaim" which frees cached values in all the depots. But there's no garuntee that we have enough contiguous free objects to relinquish a page-sized slab from neighboring object-stores. BUT, if we're allowed to relocate, we can check all the slabs (of which there are only a couple dozen), and determine if a particular buffer is "mostly" empty, while there are free slots in neighboring buffers. If we can somehow move the existing free objects out of the slab and free it, then we won't even have to reclaim depot-cached objects (which would otherwise reduce run-time efficiency by reducing the "average case" utilization). The easiest way this could work is if objects had handles, and had pointers back to their associated handles. If the handled happened to be owned by the existing thread (making the memory request that starved memory in the first place), then it simply copies the data and moves the pointer in the associated handle. If it's in another thread, it would require locking of the handle (or possibly utilize . Currently, the only way this works is for single threading or "shared" objects. Having a pointer back to the handle is a bit of overhead, and also not compatible with general vmem execution. Variotions involve performing this "compaction" in the reclaimation stage, or having independent slabs (and thus depots and magazines) for "relocatable memory". The latter would involve any generic memory that had [lockable] handles. As with the spirit of vmem, call-back functions would be provided to the configuration of such slabs/depots, which would correctly manipulate arbitrary handle-types. Given that such callbacks would be rare (only when the slab compactor believes it can free an entire page). Vmem + shared mem: Currently this seems seriously plausible for shared PMCs/strings. One nice thing is that aside from the mutex locking, this is very fast and efficient, even compared to the copying collectors. vmem is highly cache-friendly, only requires a handful of assembly language instructions for allocations in the average case, and is bounded in the worst case. The only time copying collectors beat this vmem approach in performance is when some large percentage of all allocated memory is suddenly freed. copying GCs would quickly find the remaining objects, copy them, and be done. This approach requires individually freeing each remaining object. But since shared objects are more complex, this is probably required anyway. If we do utilize something like vmem, and augment it for use with synchronized relocatable objects, we avoid a lot of extra complexity / cost for shared memory as well as extra memory wasted on independent heaps. So long and thanks for all the fish.... -Michael