This references the previously posted (by someone else) article on SUN's libu memory architecture. If needed I can run through the archives and find the URL.
I've already written a threaded implementation which has a decent responsiveness. It did as much as the SUN libu claimed that it would do. Recently, however, I redesigned the architecture from scratch, taking advantage of many possible optimizations. Since it's towards the end of the day, I figured I'd document my architecture here and wait for comments before beginning the rewrite / redebug. As a refresher, the SUNs design is based on 4 layers: vmem, slab, depot and magazine-cache. At the bottom is vmem, which maps "pages" of memory (in libu, new pages are acquired from mmap; excess are readily freed via munmap). The next higher layer is the slab, which partitions a contiguous collection of pages into an array of raw memory objects. (Objects are fixed sized pieces of memory). Thus each object size requires a separate slab. The slab requests n-pages from vmem to construct a new slab when it runs out of memory objects. It also frees slabs to vmem when all the individual objects have been reclaimed . Both of these occur automatically w/o external intervention. The next higher layer is a "depot", which stores "constructed" objects; e.g. objects that are initialized and potentially contain references to other memory objects. When memory requests are made and no free objects exist, requests are passed down to the associated slab (of corresponding size) and a constructor is applied. This happens automatically. Free's to the slab are asynchronous; they occur when the entire system is memory starved, or when a depot has too many free objects. In any case, prior to a free to the slab, the destructor is applied to each freed memory object. The highest level is the magazine-cache The sole function of the cache is to allow multiple threads to have thread-local accesses to the depot. It does this by caching up to 2*M objects in a thread-local magazine-structure. On an alloc, when the magazine is empty, thread-synchronized access is performed to the underlying depot / slab / vmem layers. On a free, when the magazine is full (contains 2*M objects), thread-synchronized access to lower layers is performed. Most of the time there is a mixture of alloc's and frees which avoids lower-level accesses (and thus mutex locks). Aside from sync-contention alleviation, this minimizes the sharing of memory objects between threads, and on multi-CPU systems, this avoids cache pollution (due to setting a cache flag in MESI to S). So far, nothing revolutionary, albeit clever on SUNs part. The next attribute, however, is the use of external boundry tags. The idea is that the allocated memory should never be written to (SUNs design goal included the use of read-only allocations). Further, these external tags don't have to be contiguous. Runs of contiguous memory regions are collected together in a segment. Segments are allocation regions (such as via mmap). Within a segment, consolidation may occur. When fully consolidated, the segment is freed. Hashtables are used to associate a user's memory pointer with these external boundry tags. From this, SUN claims a constant (bounded) alloc/free time no matter what is going on. From my analysis, the only possible wrinkle to this constant time is in the reclaiming stage (which makes n reclaims, where n is the number of depots which is a relatively low and constant number) At run-time, "object-caches" can be defined for arbitrarily sized objects. Constructors and destructors can optionally be configured. Such object references use the magazine layer interface. For generic memory access, vmem is the API, but small memory requests are redirected to one of several predefined "quantum caches" (of sizes 1-page, 2-pages, .. q_max-pages), which alleviates lock-contention on vmem when used genericly. SUN addressed the memory fragmentation problem by placing a lower-bound on a memory request size. Slabs allocate at least 3 times the object-size. In most cases, a slab is 16 - 32 times the object size. Further, with q-caches, all allocations up to q-max have slabs of the same size. Thus slabs used by 8 and 16 byte objects allocations can be trivially intermixed. The only danger is if you never free all the objects w/in each slab. I call this a shot-gun effect, since you have little bits throughout memory space, preventing the freeing of slabs, and thus preventing the interchange of slabs when needed. My design is as follows: sbrk is completely avoided, which leaves the possibility of having two separate memory managers. The following functions are public: void* vm_alloc(size) void vm_free(void*) void* vm_realloc(void*,size) int vm_sizeof(void*) vm_dump_stats() // automatically allocates a slab (reuses one if possible) depot_t* depot_create( ... ) int depot_size( depot_t*) void depot_reclaim( depot_t* ) // called internally or by GC // cache_create also stores the pointer in thread-local-storage // relative to the depot cache_t* cache_create( depot_t*, ... ) // Note objects are of fixed size.. void* cache_alloc(magazine_t*) void cache_free(magazine_t*, void*) // these two functions pull from thread-local-storage for the cache_t pointers void* depot_alloc(depot_t*) void depot_free(depot_t*, void*) Compilation is contingent apon a MULTITHREAD preprocessor flag. With this flag disabled, all mutex's are obviously avoided, but also, the entire magazine-cache structure is avoided; depots are directly accessed by object-caches. All the other features are at least partially benifitial to even single threaded mode. My intial impression is that a page should be 1024 bytes (though I'm also looking at 512 bytes to reduce the shotgun effect of 8-byte allocs). I've actually found that "q-caches" were detramental to things like conslidation, and the calling depth (overhead) of a worst-case allocation. So I'm currently leaving them out. Instead I'm using a power-of-two basd dq-cache (for differential q-cache). SUNS q-caches are multiples of the page size up to q_max.. So there'd be "q_max" q-caches (usually like 5). What I'm calling dq-caches are subdivisions of a single page. My understanding is that SUNs vmem assumed all sub-page allocations would be via object-caches, or via a vmem configured to some incredibly small page-size (like 4 bytes), but this requires many bytes of overhead for each allocation, and I know that we'll need a lot of sub 32-byte allocations; thus this was unacceptible to me. Hense my use of very large page sizes and the use of dq-caches. This was my first real deviation from SUNs proposal. One thing I was conscious of was memory alignment. SUN uses a complex function in place of simple vmem-xxx for arbitrary alignments. Further, their OS version of vmem requires a size parameter to all vmem functions. Thus you'd have to remember an allocation size when freeing or reallocating (vmem_sizeof would be paradoxical). Their paper didn't say what they did with libu, but I'm assuming they simply prepended 4 bytes to an allocation which contained the memory size. I didn't want this since this provided too much overhead for small allocations AND further frustrated memory aligned allocations. I address the alignment problem by requiring that ALL vmem allocations are aligned on page-boundries. This trivializes alignment for sub page allocations, and there are somewhat sophisticated ways of doing alignment for multiples of a page size. For those alignments that match the memory size, dq-caches are already aligned. For bizzar alignments (still below a page) the size is rounded up to a page (thus using slack space). For sizes larger than a page, the size is rounded up to a page size; 2*num_pages are allocated, then an offset is calculated to make the memory aligned. All prior and subsequent pages are then freed. Since a page is an extremely useful size (used by ALL dq-caches), no memory is wasted (aside from up to a half page slack space on average which might be recovered via a vm_realloc anyway). As with most memory managers, there is an upper bound on memory sizes beyond which mmap is exclusively utilized. At the moment, I'm leaning towards 128K. I'm debating using a power-of-two round-up on such allocations to remove some burdon on the OS. Now comes the freaky part. SUN utilizes a hash for ALL non explicit object-memory-allocations. From this they can deduce if a free goes to their q-cache, regular paging system, or munmap. Obviously an overburden of 8-byte allocs could bring any hash-table to it's knees (including violating SUNs claim of real-time access). I'm assuming that by requiring the memory size even on a free, vmem can use multiple hash-tables; one for each size. But I don't require the memory size being passed in, NOR do I prepend memory assignments with the memory size. The only thing I have left are memory alignments. I assume that any explicit object allocation never uses vmem, and that implicit object caches (via dq-cache) are _always_ offset from a page aligned boundry (revisited below). Further, specially user memory alignment requirements should either use explicit object-caches or the separate vmemx_(free|alloc|realloc) as with SUN. Now the only remaining ambiguity is between mmap'd and page-mapped allocations. For this, I assume that mmap free's are very rare (since you can't have a million allocations of size >=128K). Thus they are relegated to a slower path. vmem_free(void* ptr): ptr_1k = align1k( ptr ) if ptr_1k != ptr: We're in a dq-cache, pass to it (described later) else if ptr_1k in segment-hash: We're paged else if ptr_1k in mmap-hash: We're mmaped else: throw exception "invalid mem reference" Thus I utilize two separate hashes.. One for normal ( size <= 128K-byte allocations), and one for larger allocations that require mmap. Note that the hash-table data-structure is _completely_ inlined with the data-segments.. No mini-malloc's are required. This complicates the use of mmap on aligned memory requests, but that's rare enough to warrant extra processing. The structures are as follows: #define NUM_PAGES_PER_SEGMENT 255 // possibly 127 #define PAGE_SIZE 1024 #define SEGMENT_SIZE ALIGN_1K( NUM_PAGES_PER_SEGMENT * PAGE_SIZE + sizeof( page_seg_t) ); # #mmap-allocation: # struct mm_seg_t { void* address; // hash-unique identifier. contains address of &data[0] struct mm_seg_t* next, *prev; // hash doubly linked list elements int size; // for use in realloc void* data[0]; // begining of data }; mm_seg_t g_mm_seg_hash[ 128 ]; pthread_mutex g_mm_seg_hash_lock; # #segment-allocation # typdef struct _page_seg_hash_t { // address is implicit for segments struct _page_seg_hash_t* next, *prev; // hash doubly linked list elements } page_seg_hash_t; typedef struct { unsigned char prev_tag_size, tag_size, f_free; } tag_t; struct page_seg_t { seg_hash_t hash_header; // used by hashing algorithms // segment size is constant and thus implied int free_pages; // allows quick check of whether to free the segment tag_t boundry_tags[ NUM_PAGES_PER_SEGMENT + 2 ]; // boundry tags for each page + 2 delimiting (begin/end) boundry tags. void* data[0]; // physical location of pages. }; page_seg_hash_t g_page_seg_hash[ 128 ]; pthread_mutex g_page_seg_hash_lock; # # Free pointer chains # typedef struct _page_free_t { struct _page_free_t *next, *prev; page_set_t* header; } page_free_t; typedef struct _available_page_free_t { struct _available_page_free_t *next, *prev; int chain_size; // index into g_free_chains } available_page_free_t; page_free_t g_free_chains[ NUM_PAGES_PER_SEGMENT ]; available_page_free_t g_available_free_head; From the above, you can see the embedding of hash datastructures. Segments are directly allocated from mmap as needed (no pre-allocations on process init). As with: page_seg_t* seg = mmap(0, SEGMENT_SIZE, .... ); Tags are implicitly linked to pages. Offsets within seg->boundry_tags[] correspond directly to PAGE_SIZE offsets into seg->data[], and visa versa. If NUM_PAGES_PER_SEGMENT is <= 256, then a single byte can be used for relative addresses. Thus segments max out at 256K. Additionally, one page is not used (reserved for the header), such that we don't acquire 256K + sizeof(header). The first and last boundry tag are set to "allocated" so as to avoid consolidating beyond the edges of the segment. Allocation of a segment requires initialization of the boundry_tags array, which will either be 129 or 257 entries as well as insertion into the segment hash-table. The mmap hash-table and segment-hashtable have separate mutex's since they are independant. I'm looking to make the hashtables static sized, with 128 buckets. Thus, 640Meg worth of 256 page segments (@ 1k / page ) evently distributed across the hash buckets will have 19 layers of linked-list lookups (in the worst case there would be 2500 lookups). The hash key currently is: int key = (addr >> 18) & 127; 256K aligned mmaps should be linearly assigned (at least initially), and thus pseudo-evenly distribute throughout the hash buckets. It's assumed that segment_size / 2 is the max practical allocation using pages, thus after 128K mmaps are used directly for alloc/free. With P pages in a segment, there are only P possible memory sizes. Thus for allocation, P free-list chains are used. This avoids much of the trouble associated with power-of-two free-chain-lists. The only problem is in searching the free-chain lists. On initialization, there is a single 255 page free element. Allocating 1 page would require navigating 254 empty pointer chains. But I use a doublly-linked list of filled free-pointer-chains. Thus on failing to find any free page-chunks of size 1, the next free-pointer-list is immediately found to be of size 255. Granted, this requires starting at the head of the available_free, so if you're looking for page_size >= 250, and there are entries in most free chains, then you achieve poorer results (but it is at least bounded). The external boundry tags allows 1k-aligned memory requests with almost no overhead. It also allows semi-efficient (constant time) consolidation, though since the segment is of a constant size I'm going to look into using a buddy system as was described in "Understanding the Linux Kernel". Though faster, I believe the buddy system will be more wasteful of memory and less inclined to "grow" memory on realloc's. I should be able to easily interchange the two since they're such a small part of the vmem design. vmem_init will have to initialize both hash-tables, and the free-mem list. It will also have to initialize the dq-caches (below). Spawning new threads might also require creating thread-local magazines for each dq-cache (depending on if thread_id -> cpu_id is possible, which would initialize for each CPU within vmem_init, though this requires locks in the magazine-caches) Before calling mmap to allocate a new segment, reclaim is applied to the list of depots. This causes a speed hickup, as well as leaving some important caches starved, but I can't think of any other time this would be as valuable. Perhaps if a separate thread's GC were used, but that's not portable. Though this is a mismash, it should cover all the technology of the basic vmem layer. Next is the slab layer. Each slab has a centralized data structure which contains various configuation info (alignment, object-size, pointer chains for both slabs and slabs-with-free-objects, etc.). I'm happy with my previous work on the slab layer, so I'm not going to post it here, other then to mention that each slab has a header. This header contains at least: typedef struct _slab_t { int object_size; // size of all allocations w/in this slab. Used by vmem_free, vmem_sizeof, and vmem_realloc to identify the size of a void*. Note that this is redundant with slab_descriptor_t. struct _slab_t *next, *prev, *next_free, *prev_free; int num_free; // free objects in this slab (used to see when slab is empty) void* first_free; // used for slab allocation void* data[0]; // beginning of data } slab_t; typedef struct _slab_descriptor_t { pthread_mutex lock; int object_size; int num_pages; slab_t next, prev, next_free, prev_free; struct _slab_descriptor_t *next_desc, *prev_desc; ... // various precalculated info for more efficient operation } slab_descriptor_t; slab_descriptor_t g_slab_descriptor_list_head; For slab_size <= page_size, the presence of the prefixing header means that no object address can exist on a page boundry. This fact is exploited by vmem_free(void*), vmem_sizeof(void*), and vmem_realloc(void*,new_size). It can immediately skip navigation of g_page_seg_hash, go to the page-boundry and check the slab-size. Since the existance of an external pointer means that the slab is not ready to be freed, no locking need occur for such a lookup. However, in order to allocate-from or free-to the slab, the slab_descriptor must be locked. As soon as a slab's num_free reaches zero, it is sent to vmem_free. Each newly created slab is inserted into the slab-descriptor-list which allows reuse by subsequent internal calls to slab_create(..). The next layer is the depot. Though SUN outlined a constructor / destructor policy. The most common case ( being used by dq_cache at least ) is volitile objects with no constructor/destructor. Since we're not storing valid objects, we are free to overwrite their contents with free-pointer-chains (as do the slab and vmem subsystems). Note that this is only a special case, which warrants a separate set of handlers. My previous implementation adhered to SUNs design, and I rename them obj_magazine_XXX and obj_depot_XXX. They have no use w/in the dq_caches, and their requirements impeed efficiency. Thus my simplified depot design requires a lower bound object size of sizeof(void*) * 2, since multiple free pointers are stored in the object. SUNs design utilized a separately allocated structure called a magazine which was a mini-stack. My design uses the objects as linked-lists for the stack. The magazines are chained together (in the depot) by a second linked list stored on the top-most elements of each stack as with: typedef struct _magazine_el_t { struct _magazine_el_t *next_el; struct _magazine_el_t* next_stack; } magazine_el_t; union object_t { magazine_el_t el; void* data[0]; }; typedef struct _depot_t { object_t* full_magazines; int max_stack_size; // increased after contentions for depot locks int max_num_magazines; // max number of magazines to hold before freeing to slab. int num_free_objects; pthread_mutex lock; struct _depot_t *next_depot, *prev_depot; // for g_depot_list; ... // other items in my previous implementation } depot_t; depot_t g_depot_list_head; From this there is no need to maintain two lists (one of filled magazines and one of empty magazines), since there is no concept of an empty magazine. This adds to performance since you don't have to allocate magazine structures (which in my previous implementation did impeed "free(..)" performance, as well as memory usage). All access to the depot is thread-synchronized. But since each memory-request-size maps to a different depot, there is a smaller likelihood of contention. Contention is further addressed by modifying "max_stack_size" after a failed "try_lock" on the depot. It is reduced when "reclaim" calls are made. I haven't decided how to deal with "max_num_magazines", since theoretically this is handled by reclaim. Lastly the magazine-cache layer uses: typedef struct _cache_t { int num_data_objects; int num_prev_objects; object_t *data, *prev; depot_t * depot; } cache_t; This is pretty close to the SUN architecture except that we're not using magazines, but instead the data-objects themselves as linked-list stacks. When depot->max_stack_size "cache_free"'s occur, then data and prev are swapped. If "data" is still too full, then it's placed into the depot (after locking). Instead of allocating an empty magazine, data is reset to null and num_data_objects is reset to zero. I haven't completely thought out the shared readonly use of depot->max_stack_size. I'm inclined to fix it to 5 for the time being (since I don't know how bad it is to read from this value as another thread is potentially over-writing it). I might just pull a volitile copy of it into cache_t, and have it updated on every access to depot. Additionally, if the stack-size is dynamic, then I'll have to add yet another field to magazine_el_t, which increases the lower bound of memory-sizes (from 8 to 12, or in the case of 64bit pointers, from 16 to 20). Note that the use of constructed objects requires external magazines. In my previous implementation, I held an upper-bound of 10 items / magazine, which meant 48 bytes / magazine. An interesting ramification is that an object free periodically requires an alloc from the 48-byte cache (which thankfully isn't constructed which would otherwise quickly become recursive). The implementation only kept a single free magazine in the depot so as not to waste them (since all constructed object-caches need them). Piecing things together: With page-size = 1024, there are 12 dq-cache sizes of 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384. (pow 2/3). Each thread needs a thread-local pointer to appropriate caches. As before, depot_alloc / depot_free accesses this thread-local storage. Thus when vmem wants to access these dq_caches, it uses the depot_xx functions. I _could_ supply a vmem function varient which is given a pointer to the structure of thread-specific caches which avoid this semi-expensive indirection. For small memory requests the flow go as follows: vmem_alloc(sz) -> depot_alloc(..) -> cache_alloc(..) [ -> ( access depot ) [ -> (access slab) [ -> vmem_alloc(slab->num_pages) [ -> mmap(..) ] ] ] ] Where [..] are successively rarer (though periodic) occurances. The depot and slab access occurs without additional function calls. Pure object allocations skip the first one or two steps. For medium sized allocations the flow is (including the above use of vmem by slab): vmem_alloc(sz) -> (lock g_seg_lock) -> (traverse free-list) [ -> (traverse available-free-lists to find a free-list) [ -> seg = mmap(..) -> (init segment) -> ( add seg to seg_hash) ] ] [ -> split free chunk into two smaller chunks ] -> (allocate mem-object) For large allocations: vmem_alloc(sz) -> (lock g_mm_seg_lock) -> mmap(..) -> (add ptr to mm_seg_hash) For small frees: vmem_free(ptr) -> depot_free(..) -> cache_free(..) [ -> (access depot) [ -> (access slab) [ -> vmem_free(slab) [ -> munmap(..) ] ] ] ] For medium frees: vmem_free(ptr) -> (lock g_seg_lock) -> (traverse g_seg_hash to find associated seg) -> (consolidate free chain) -> [ ((remove seg from g_seg_hash) -> munmap(seg) ) OR ((attach ptr to free-list) [ -> add free-list to g_available_free_lists ] ) ] For large frees: vmem_free(ptr) -> (lock g_mm_seg_lock) -> (traverse g_mm_seg_hash) -> (delete munmap(...) Note that vmem does not lock until it has to. Most notably vmem -> cache -> vmem only locks on the second invocation of vmem, since the first invocation only analyzies the passed parameters. One interesting thing to note. There's no underlying need to use mmap as the back-end. My previous prototype just used malloc / free for the backend.. One disadvantage of doing this is the double memory initialization. On the other hand, it can dramatically avoid OS calls (via mmap) if the underlying memory manager can handle 256K+ allocations in a non-locking manner. (GNU malloc does a good job of not blocking). There are two obvious shortcomming. First with dq_cache size 8, there are 124 objects per slab. So long as a single object remains unalloced (perhaps dormant on an idle thread's magazine-cache, beyond the reach of a reclaimer), the slab can not be reclaimed to vmem. It's possible to allocate 1 permanent then 123 temporary, then 1 permanent ad infinitum. When all the temporaries are freed (possibly several meg worth), the permanent allocations prevent any slabs from being freed (for reuse by other memory sizes, though differing objects of the same size can). The only way to compensate for this is to utilize smaller num_object sizes. Unfortunately if we used a smaller slab-stash size than a page, we'd violate our anti-fragmentation policy (inherent in SUNs design). Alternatively, we could reduce the page size (to 512 bytes) at the expense of fewer dq_caches and a higher number of mmaps (128K segments and 64K upper-bound prior to per-alloc mmaps). On Linux, at least, this could be a heavy burden since its VM-mappings utilize linear linked lists (instead of Solaris's vmem-based real-time system). The second short comming is that vmem has only two locks, no matter how many threads are available. Thankfully this is only a problem for allocations with sizes between page_size and max_pages * page_size. If allocations of that region are known to be popular, a cache should be configured for the appropriate sizes. There is one data-type in particular, however that thwarts this.. The dynamic array (and associated hash bucket array). This doubles in size on each growth, and will thus traverse several mid-ranged sizes (1,2,4,8,16,32, 64, and 128). To make up for this, it's entirely possible that reallocs can avoid memory copies in this central region, thereby reducing realloc compute-time. Additionally, if more memory is needed, the segment-lock can be released while performing garbage collection (traversing the depot list for reclaiming, locking each depot as it goes). In my previous experimentation, there was a measurable performance boost between direct object cache calls and vmem indirect calls, and an even bigger hit when caches used locks (which allows a reclaimer to flush magazine stack-values). Thus I still maintain that cache's should not use locks, and that parrot's interpreter should directly utilize cache objects for all known fixed sized objects (which unfortunately excludes dynamic arrays). This is a lot of stuff and is very verbal (light on actual code) so it's probably hard to read. But it's at least on record for me to reference. This was all information I had on my white-board. Enjoy -Michael