On Sunday 11 November 2001 05:41 pm, Michael G Schwern wrote:
> While at JAOO, Andy Hunt told me about a little trick Matsumoto is
> (was?) trying out for Ruby to speed up it's garbage collection.  It
> goes something like this (keeping in mind I know very little about GC).
>
>
> Assumtion: Most variables have only one reference (themselves).
>
> So when a variable is first created, all it has is a single bit
> representing it's refcount.  When allocated, it's turned on.  When it
> falls out of scope it's flipped off and swept away.
>
> When a variable is referenced a second time, it's transmogrified into
> a new structure that can support the necessarly more involved garbage
> collection system.
>
>
> Basically, for simple variables with no external references, you can
> perform a very quick allocation/deallocation model.  It's only
> necessary to apply the big guns and extra memory for variables with
> external references.
>
>
> This is the full extent of my understanding, so I'll just throw it out
> there and hope someone runs with it.

What follows is unfortunately rather long, I tried to go back and label 
common segments for interest and reference.

White papers:
I've been very eagerly researching GCing techniques; reading whatever 
white-papers I could get my hands on.  (mostly linkable via previous posts, 
or acm.com)  I'm by no means finished, but here's what I've discovered so far:

Counting revisited:
Most memory freeing requires SOME extra inlined code.  Reference counting, as 
it turns out, is just a variation of garbage collection; most of which 
utilize approximations for determining recliamable objects.  Ref counting is 
also an approx. since things like circular lists prevent some reclaiming.

The main problems with pure ref-counting is that every memory reference 
requires fiddling with if-statements, adds and subtracts.  This can somewhat 
be alleviated by having the compiler determine when a variable should be 
created and destroyed (if no refs are ever made to it), though this is akin 
to being a stack-based auto-variable.  If ref-counting is only an approx 
(unlike perl5) such as the 1-bit method (which was detailed in some papers I 
read), then you still have the overhead of reclaiming freeing/consolidating.

Cost measurement:
The cost of an allocator can be based on the cost of allocation, freeing, 
reclaiming, copying/compaction etc. (This is a terse review, so I could be 
missing stuff).  If there is a lot of allocated memory, then copying GCs will 
be very expensive, but so will mark/sweep collectors.  If there are a lot of 
accesses (reads/writes) relative to the amount of allocation space, then 
ref-counting will be very expensive (but so are several other GC techniques, 
discussed below).  In a given time period, if there are more free's than 
allocations, then copying GCs will be faster than mark-sweep / ref-cnting.  
Ref-counting has very good cache coherency (when coupled with a good memory 
manager like vmem), while copying collectors will regularly page-thrash while 
it navigates every page in each heap (also requiring more total memory).   
For small applications that have bounded memory requirements, copying 
collectors have very little overhead, very little slack space, and keep the 
usable memory in L2/L3 cache.

Mark and Sweep:
Mark and sweep "compacting" is good for large memory objects which would take 
a long time to copy.  A popular algorithm is to have a doubly linked list for 
every memory object within this arena (ideally of a fixed size).  When memory 
is starved, navigate the root set, but instead of copying objects, you 
transition then from the "from-linked-list" to the "to-linked-list".  When 
you're done, you add any remaining items in the from-linked-list to the 
free-list.  Finally, you put the to-linked-list back into the 
from-linked-list so the operation can be later repeated.  You avoid the 
overhead of copying (in a copying gc), and the overhead of freeing (in 
reference counting), but you're plagued with internal fragmentation.  
Additionally, you need an extra 2 pointers for each memory object (for the 
linked list).  If the memory objects were the same size, then you avoid 
fragmentation issues (but that's harder to do with larger memory sizes, 
unless you use geometric sizes).  The paper suggests that variations on the 
copying collector theme are ideal for small memory sizes (where copying 
is quick compared to updating pointers), and using non-copying methods for 
larger memory.  Further, if a heap is homogeneous and single-threaded (which 
includes having a heap / thread), then the linked-list method can utilize 
compaction to remove fragmentation / slack.. This will occur much less often 
then full copying of the copying-gc, and older data will migrate towards the 
front of the heap, thereby preventing having to slide the entire heap. 

Generational copying collectors:
The more sophisticated algorithms utilized what's called generational copying 
collectors.  Similar to the compactor variation above, we want to take 
advantage of the fact that younger objects have a higher probability of being 
reclaimed than older objects.  In other words, the more times an object 
survives a garbage collection, the more likely it won't need to be rechecked 
for reclaiming.  One of the approaches that I'm currently reading about (used 
in many Lisp architectures) involves using multiple heaps (and forgoing the 
"semi-space" copy buffer in a raw copying collector).  In the simplest case, 
each time the top-most heap is full, a reclaimation phase is initiated, and 
valid objects _in that heap_ are copied into a lower heap generation..  If 
the lower heap is then filled, a reclaimation occurs for _that_ heap to push 
still live values even further.. The main idea is that old values (those of 
previous reclaimation generations) are not checked for live-ness (we play the 
odds that it's less likely that we'll need to), and most importantly, we 
don't _copy_ them needlessly.  We effectively produce a linked list of heaps 
where the further down the list you go, the older the objects are.  Typically 
algorithms don't pass values down on the first reclaimation so as to avoid 
excessive heap-depth.  I'm currently working on one implementation that uses 
a semi-space (used in a generic coping collector) for the first tear.  
Subsequent tears are twice the size so that multiple reclaims are garunteed 
at each stage.  The paper suggests that there should be a finite number of 
tears, and that the bottom level is a special case (probably utilizing 
compaction instead of copying).  There are several advantages to this [overal 
archetecture]..  During the marking/reclaiming phase, each handle (PMC / 
String structure) has it's reference address checked against the tear-level.  
If it's known to not be part of the collection, then we don't modify it's 
contents (or even read from it; unless it's an indirection, like an array).  
This allows old [rarely accessed] variables to be paged out to disk.  To 
further this, objects that are known to not contain pointers can be placed 
into their own heaps to prevent ANY page-swapping during garbage collection.  
(reading the pointers would require paging back to memory on each 
root-set-sweep).  Other than memory paging, the active memory is the 
top-tear, which is relatively small, and thus fully cachable.

Copier advantages:
One of the most obvious advantages with a copying collector is that a stack 
carver can be used for allocation.  This suggests that a single heap should 
be used for many object-sizes (at least for the top tear of a generational 
method).  Here allocations are only a few assembly language instructions, 
there is zero overhead per reference assignment, or PMC value modification, 
and we obtain safe XS code (no memory leaks, premature freeing).  
Generational architectures theoretically mimizes the overhead of large memory 
allocations, since the main disadvantage is the copying overhead, and we've 
dramatically minimized the amount of copied data.  I didn't see benchmarks in 
the paper so in my mock white-board designs, I've already determined a scheme 
which for a sample execution run (of 66% retained variables across each 
reclaimation period) there was 136% overhead for my generational copier 
verses 200% copying overhead with a raw copying collector.  Coupled with 
cache-coherence, this isn't too bad.

Real Time, read/write barriers:
The next problem is that of "real-time" operation.  This actually was the 
primary focus of the paper.  This is extremely complex when you have mutiple 
threads and an interactive environment which shouldn't periodically "pause" 
simply because an inner loop allocs and releases references to objects.  The 
larger the memory frame, the longer the pauses will be, and there's little 
that can be done.  All the incremental schemes I've read impose greater 
overhead, which reduces total throughput, and what's worse, some run the risk 
of playing catchup; if the GC thread runs slower than the mutator thread(s) 
(the threads that do work and modify root-set contents / allocate more 
objects), then the GC may never complete a single pass, thereby not 
reclaiming _any_ memory (without bringin all mutators to a halt due to memory 
starvation).  More importantly, most "background" gc techniques I've read 
require what's called reader/writer barriors.  This is because the 
detection/marking stage has the GC navigating root sets that are potentially 
changing due to concurrent mutators.  To prevent this, either ALL reads to 
collectable memory must involve extra code, or all pointer modifications must 
modify code.  Given our register style, P1 = P2 op P3, it would make sence 
that a writer barrier would be more efficient (one piece of extra code 
instead of two), but reader barriers are usually implemented in hardware 
(setting an entire page to write-only, then having the OS trap access and 
pass control to GC-handlers).  In these special cases (unlikely for us), such 
reader barriers are blazingly fast in the general case, and only slow when a 
GC is evaluating a given page.  Here is where the simularity to reference 
counting comes into play.. Ref counting is close cousin to reader/writer 
barriers.  Given our requirements, I seriously doubt that we can rely apon 
ANY code that requires reader/writer barriers, since XS code would also need 
it (even if they went through API functions), and we'd not have escaped 
XS-hell.

No barriers for us?:
Unless Dan or whomever else believes that it's worth relegating barriers to 
API macros/functions, then GCing will have to do something similar to 
"snapshot-at-beginning", which does grunt-work at memory allocation time 
(possibly due to memory starvation), then add a job to a seperately threaded 
GC-thread to utilize the marked snapshot.  For primative copying collectors, 
there's no mark-bits - data is simplly copied as discovered.  This obviously 
conflicts, and either requires hybrid algorithms or foregoing of the 
separatly threaded GC-reclaimer/compactor.

Shared memory: 
Assuming a desirable algorithm exists that allows partial workload w/in a 
separate thread, then for any object position modification, reader/writer 
barriers are needed.  Shared memory objects (which require locks anyway) 
could very easily implement this.  A lock is a more expensive version of a 
barrier, but this is a sunk cost (in economics terms).  Since we're 
allowing multiple interpreters and thus threads, there is the requirement 
that one thread's gc-run can not modify another threads data (via 
compaction/freeing/etc) without such barriers.  Thus (and it's probably 
obvious), each thread will require it's own independant gc-heap(s) (plural in 
the case of generational).  Given the miniscule overhead of allocating a heap 
and seting a "top" pointer, this should not have performance overhead, though 
it does hurt memory space requirements (mem waste is minimized by using a 
miniature top-tear generatinoal GC).  Theoretically each thread desires 
multiple megs of data, and so long as none is explicitly "shared" (via locks, 
etc), then there can be no memory usage swaping between threads.  Each 
thread's heap will most likely be "grow-only", since it's difficult to 
efficiently shrink (since you can't easily speculate as to the required 
memory sizes without wasteful analysis passes).

Incremental (real-time):
Incremental techniques are noteworthy, but I haven't yet fully understood 
them.  A simple [perversion of one] algorithm is that there is 
no separate copying thread, nor is there a periodic copying/compacting run.  
Each allocation copys/free's exactly one object, thereby making room for 
subsequent memory allocations.  The algorithm seems deceptively simple, since 
whenever I try and run through the details in my head, there are lots of 
gotcha's.  In theory, this is real-time since we're amortizing the costs 
across every allocation.  Unlike ref-counting, however, we incur little / no 
cost for general usage (read/write-barriers were heavily discussed, which 
obviously involves usage-overhead).  The devil's in the details though, since 
my guess is that these incremental operations DO require "SOME" work to be 
done in the barrier (perhaps the marking or copying).

Rant:
Given current systems, I'm willing to sacrafice SOME memory allocation 
efficiency to make a faster algorithm, but it's important to note that simple 
copying collectors can never have enough memory to avoid inefficient 
thrashing (copying), since at some point you are relegated to page-thrashing, 
and the paper suggests that all evidence shames this approach.  So, you can't 
just give it some ungodly memory size like 4 meg to start with (like some 
java imlementations), then double it's size on each starvation, and hope the 
problem goes away.

My general suggestion:
Right now, with my current level of knowledge, generational copying GCs with 
some as-yet-undertermined method of handling the bottom tear heaps seems to 
be the best.. You do pause, but if the top tear is like 1K, then the pauses 
should be very short (albeit many).  Larger memory allocations (several K in 
size) should most likely utilize linked-lists and optionally compaction, 
since we're unlikely to see regular allocations of size 50K (you'd quickly 
run out of memory on most systems anyway; even if fully freeable).

Vmem ideal:
Just as a side note, I'm _really_ excited about a concept in the works. 
Namely using vmem AS the GC.  This assumes knowledge found in my previous 
email:
http://archive.develooper.com/perl6-internals%40perl.org/msg06099.html
Thus far in my analysis, vmem has only one fatal flaw, slabs with large 
number of objects are likely to have poor distribution properties and thus 
have the potential of starving neighboring slabs.  For example, if I allocate 
a million 8 byte objects, then free every other one, I don't have any memory 
left (even after consolidation) for any other object size (I instead have to 
go to the OS for more).  Granted, eventually, when freeing those objects, 
I'll release pages back to the vmem core which can be efficiently reused by 
other object-sizes.  The spark-of-interest I got was that with GC, we're 
allowed to relocate objects.  Since we have externally allocated handles with 
at least some ability to reference them, we could easily address this 
"freeing every other object problem".  Before we allocate new segments from 
the OS (via sbrk or mmap), we call "reclaim" which frees cached values in all 
the depots.  But there's no garuntee that we have enough contiguous free 
objects to relinquish a page-sized slab from neighboring object-stores.  BUT, 
if we're allowed to relocate, we can check all the slabs (of which there are 
only a couple dozen), and determine if a particular buffer is "mostly" empty, 
while there are free slots in neighboring buffers.  If we can somehow move 
the existing free objects out of the slab and free it, then we won't even 
have to reclaim depot-cached objects (which would otherwise reduce run-time 
efficiency by reducing the "average case" utilization).  The easiest way this 
could work is if objects had handles, and had pointers back to their 
associated handles.  If the handled happened to be owned by the existing 
thread (making the memory request that starved memory in the first place), 
then it simply copies the data and moves the pointer in the associated 
handle.  If it's in another thread, it would require locking of the handle 
(or possibly utilize .  Currently, the only way this works is for single 
threading or "shared" objects.  Having a pointer back to the handle is a bit 
of overhead, and also not compatible with general vmem execution.  Variotions 
involve performing this "compaction" in the reclaimation stage, or having 
independent slabs (and thus depots and magazines) for "relocatable memory".  
The latter would involve any generic memory that had [lockable] handles.  As 
with the spirit of vmem, call-back functions would be provided to the 
configuration of such slabs/depots, which would correctly manipulate 
arbitrary handle-types.  Given that such callbacks would be rare (only when 
the slab compactor believes it can free an entire page).

Vmem + shared mem:
Currently this seems seriously plausible for shared PMCs/strings. One nice 
thing is that aside from the mutex locking, this is very fast and efficient, 
even compared to the copying collectors.  vmem is highly cache-friendly, only 
requires a handful of assembly language instructions for allocations in the 
average case, and is bounded in the worst case.  The only time copying 
collectors beat this vmem approach in performance is when some large 
percentage of all allocated memory is suddenly freed.  copying GCs would 
quickly find the remaining objects, copy them, and be done.  This approach 
requires individually freeing each remaining object.  But since shared 
objects are more complex, this is probably required anyway.  If we do utilize 
something like vmem, and augment it for use with synchronized relocatable 
objects, we avoid a lot of extra complexity / cost for shared memory as well 
as extra memory wasted on independent heaps.

So long and thanks for all the fish....
-Michael

Reply via email to