I've done a bunch of reading, and though I'm not finished, I'm starting to
look towards the following overall algorithm based on the below specified
assumptions.  I'm not _necessarily_ looking for comments at this point,
because I haven't finished evaluating the specifics of several algorithms,
but feel free to comment.

Foreward:
I'm looking at a hybrid of several algorithms, since different regions
have different advantages / disadvantages.  The typical downside of
hybriding is extra overhead in the inner-loop.  Here, the cost is in
a greater number of branches in the inner loop. You can skip to the
posted algorithm pseduo-code for a quick overview.


Since we're inlining the data-size, I'm selecting which sub-algorithm to
use based on the size.  The data-type (beit a raw character-buffer
like a string, or a nested data-structure such as an Array) will have
to be known and thus interleaved within the GC for now.  What I'd like to do is
define vtableish ops for each data-type, and have the perl-make-script
generate the core routine by concatenating the different types.
I don't want the overhead of a vtable function call, but I don't want
to expose the garbage collector to extension writers.  More to come
with this.

Small data-types ( size <= 256 Bytes):
Ideally, there's zero header-space due to the relative space-cost, and
since we're already using the size this should be possible.  For now, I'm
looking towards a generic copying-collector.  Generational schemes can
require additional header-space (either for marking or
boundry-references). The only advantage of making this generational would
be to minimize the scan-depth (most generational algorithms don't decend
through data-structures living in aged heaps), but this requires a
write-barrier, and I'm currently opposed to that (defeats much of the
advantage of abstracted garbage collection over ref-counting).  I'm still
considering this for terminal (non descending) data-types.  I figure small
data-types will be the most common by far, and so the less work done per
object the better.  Further, since each object will probably be visited on
each reclaimation stage, we don't get to take advantage of
anti-swap-thrashing techniques without imposing relatively high
out-of-band overhead.  The associated heap is grown (doubled in size)
whenever less than 33% of the heap is available after a reclaimation.
Unfortunately we don't know that we want to grow the heap until we've
finished reclaiming, and a regrowth involves another copying phase, so
it's deferred until the next collection.  I found that this isn't so bad
if we keep the max data-size much less than 33% of the heap.  I'm looking
at 64K as a starting heap-size, thus 256B is much less than 21K.

Medium data-types ( 256 < size <= page-size ):
Here we can afford header-prefixes, while copies are more expensive
(relative to other operations).  Allocations in this region can be pretty
eratic and thus hard to coallesce, so I'm looking at using a generational
scheme.  Unfortunately many such schemes "age" the wrong data, or require
a lot of extra copying.  Some of the nicer schemes only require comparison
to a water-mark (below which an object is considered aged, and above which
the object is considered young and thus likely to expire soon). Such
watermark schemes allow adaptive modification of the water-mark(s).  In
other words, when we find that most data is not dying, we want to raise
the water-mark (to avoid unneeded copying).  When we find that a lot of
aged data is dying, we want to lower the water-mark (to alleviate garbage
in the aged region).  I'm currenty fiddling with a compacting collector
here.  This requires two stages.  The first stage applies a "version" tag
prefixed to each object ( 4 byte integer (where possible)).  On each
execution of gc_reclaim, the version number is incremented.  In the
Dead-Object-Detection stage (which, in comparison, is also where small
objects are copied out to their semi-space), versions of all live objects
are updated.  Additionally, the allocated size of both the young and old
regions are calculated.  Unlike the copying-collector, we have the option
of growing the mid-sized heap before performing the copying/compaction;
providing more up-to-date and thus accurate feed-back.  If we don't copy,
then we only compact the "young" region.  Thus the advantage of aging
objects is to avoid copying them (at the expense of leaving fragments).
Alternatively, if we determine that there is too much free space in the
aged section, we can temporarily reset the water-mark and compact that
region too.  The big advantage of compaction is that the age is directly
related to how far up the heap we are; objects near the top are garunteed
to be younger than at lower levels.  The water-mark, is therefore an
accurate representation of age.  Since this heap is efficiently growable,
I'm looking to start it at 16K (or some fraction of the semi-space), since
each thread requires it's own local set of heaps.  The main disadvantage
of this scheme is that we access each live-object twice.  But since this
"should" be statistically more rare than small-object allocation, we
should see better overall performance (I have yet to verify this for our
needs).  I'm not done evaluating generational schemes, and I might still
be in favor of using a write-barrier to minimize the decending depth (when
arrays of arrays are used, for example).


Large Objects ( size > PAGE ):
Any header-overhead should be acceptible here; both computational and
space-wise, since we can't afford to leave huge memory chunks to
accumulate and rot.  Ideally we'd be able to "remmap" pages so as to
maximize contiguous free space, but I don't think that's portable (or even
possible on Linux).  Allocations are relegated to the underlying
malloc/free-style memory manager.  It's likely that the pages of this size
will warrant consolidating the memory space of all threads.  In other
words, the cost of contending for malloc locks is overshadowed by the
potentially wasted space of having separate free-lists of objects of size
4K or more (oft-times up to a meg).  To allow DOD, I'm adding a simple
thread-specific doublly-linked list prefix. Allocations occur on the
"from-list".  When garbage is colleting, live objects are moved to the
"to-list".  Note that multiple accesses to the same object merely move
from the "to-list" back to the head of the "to-list", so we're fine.
When we're done with DOD, we have a second stage where we manually "free"
all remaining objects on the "from-list".  Since objects are really large
and should be few and far between, this overhead should be acceptible.
Especially considering that the cost of freeing some objects compared to
moving large ones is miniscule.  In the ideal form of this algorithm, the
remaining elements in the from-list are appended to a free-list in
constant time, but this assumes either that all objects are roughly the
same size, or that the objects were pre-segregated into separate
to/from/free-lists based on their size.  This would assume that the
underlying memory-manager was integrated with the garbage collector; an
assumption I'm not sure I want to make at this point.  Finally the to-list
is reattached to the from-list in constant time.  Preliminarly tests
showed that significantly more memory was wasted by allocating/forgetting
large memory blocks than by raw allocate/forgetting of many smaller
blocks. (One dead-object of size 64K dwarfs 1 thousand dead 16Byte
objects).  I grafted a "gc_reclaim()" call to underlying memory-manager
prior to requesting more memory from the system.  The result
(surprisingly) was that garbage collection was initiated exclusively by
the underlying system, and NOT due to the filling of the
copying/compacting collector heaps.  Even though this was a synthetic
benchmark and I batch allocated 256K worth of pages from the OS on each VM
growth, I never had the other regions get more than 60% full.  I tested
this by putting a wrapper around the GNU malloc/free calls.  I assumed
optimal consolidation and simply modified an accumulator, and compared it
to a water-mark; this should be sufficient to abstract reclaiming from the
underlying memory subsystem. If the GC is merged with the mem-subsys, then
perhaps a rarely run compactor should be applied to any movable objects
within the VM.  Even if the VM shares objects from different threads, only
the objects in the current thread need by moved (albeit with considerablly
less efficiency).

Constant-sized objects (object handles):
Handles don't necessarily have to, but are apparently desired to be fixed
positioned.  One advantage of fixed sized data is that we don't generally
need to consolidate anyway (e.g. we don't have to move them).  Another
advantage is that if we instead moved handles, then references need
"forwarding-pointers", which slows down the reclaimation.  Unfortunately
we still need to collect dead handles.  I'm really wanting to simply apply
a doubly linked list as with the large objects.  Since they're the same
size, then we can move the "to-list" directly to the "free-list" in
constant time.  Some algorithms I was considering earlier needed to have
access to ALL object-handles within these linked-lists, which was
frustrating since XS-code could locally declare an object and thus forego
the list-prefix.  I think this is irrelevant with the above selected
algorithms, however, so we're good.  My unit testing has a fixed root-set
consiting of an array of handles, thus I've avoided the issue of handle
reclaiming thus far.  In general, however, arena allocations have high
performance, even in GC'd heaps.  If our core (or even parrot-byte-code)
deterines that a constant-sized object will be allocated regularly, and
the object is large enough to warrant a header, then a separate storage
area similar to handle-arena's could be used.  The only problem is that it
adds to the size of the switch statement that directs the DoD to the
appropriate data-area.  Theoretically we'd have a switch like this:

gc_reclaim() {

  if ( small_heap_should_grow ) { grow_small_heap() )

  for r in roots {
    { ... } // move r from from-list -> to-list

    // DOD
    PMCPtr* ptr = r->data;
    size = ptr->size;
    if ( size < GC_SMALL_SIZE ) { ... } // small object copying-collector
    else {
      switch ( size ) {
      case ARENA_SIZE1: { ... } // arena'd collectors
      case ARENA_SIZE2: { ... }
      ...
      default:
        if ( size < PAGE_SIZE ) { ... } // medium object (compactor)
        else { ... } // large object (alloc/free)
      }
    }
  }

  if ( smallHeapNotBigEnough() ) { markSmallHeapShouldGrow() }

  // Reclaiming stage
  MediumObjectStage2: { ... } // O(num_bytes) grow-heap or
                              // O(num_bytes) compact young, old?

  HandleStage2: { ... } // const-time from-list -> free-list,
                        // const-time to-list -> from-list
  Arena1Stage2: { ... } // const-time from-list -> free-list,
                        // const-time to-list -> from-list
  Arena2Stage2: { ... } // const-time from-list -> free-list,
                        // const-time to-list -> from-list
  ...
  LargeObjectStage2: { ... } // O(num_large_items) free items on from-list,
                             // const-time to-list -> from-list

} // end gc_reclaim

Note that the { ... } may require further switching on r->type to
account for descending arrays, etc.  Alternatively, the r->type could
be the outer switch, but with extra pre-processing complexity.

The switch statement is optional and really should be based on
profilling.  Medium-to-large Powers-of-two might fit this bill since
arrays' tend to grow on such size boundries once they exceed a
certain size (256, 512, 1024, 2048, 4096, 8192, ... ).  Note also that
I didn't put small objects in the "default" clause, since it makes the
most common case slower.  The only side effect is that all arena's
must allocate more than GC_SMALL_SIZE bytes.  For GC_SMALL_SIZE =
256B, a 16B linked-list header would be 3% or 6% overhead depending on
pointer-size.  Compare this to 0% for the copying-collector and 1.5%/3%
overhead at most for the compacting layer.


Determinisitic Object Death:
The last topic is that of known object death.  This is especially
important for large objects or growing objects.  The more we can
minimize dead-objects created by resizing, the less
often we'll have to garbage collect.  Strings and arrays are notorious
for this and are at the heart of perl.  Unfortunately the
copying/compating algorithms can't take
advantage of early freeing (allocations only occur at the top of their
stacks), but the large objects or arenas can immediately put objects
back onto the free lists.  The most obvious way of doing this through:

ObjPtr* gc_resize(ObjPtr* ptr, new_size);

Which can free certain objects when / where appropriate.  Other variations are
involved within incremental garbage collection schemes (with which I haven't
fully explored).  Finally, for very large objects, it might be
worth-while defining a separate "ref-counted" set of vtables that
auto-reclaim.  Since any such method would still utilize the garbage
collector on the back-end (whenever any heap gets full), and the
overhead of a ref-count would be neglegable on such an object, as
well as the computational overhead for such a small subset of
objects, the only down-side is the extra set of vtables (such as for
"large strings").  Again, this is something that should be decided
after profiling.

Status:
Thus far I've written a boiled down version of parrot with simplified
PMC's and the small-growable-copying-collector coupled with the
large-allocator to help flesh out details/problems as I've discussed
above.

Enjoy...

-Michael

Reply via email to