On Mon, 18 Jun 2007 02:58:50 -0700 [EMAIL PROTECTED] wrote: > Slab defragmentation occurs either > > 1. Unconditionally when kmem_cache_shrink is called on slab by the kernel > calling kmem_cache_shrink or slabinfo triggering slab shrinking. This > form performs defragmentation on all nodes of a NUMA system. > > 2. Conditionally when kmem_cache_defrag(<percentage>, <node>) is called. > > The defragmentation is only performed if the fragmentation of the slab > is higher then the specified percentage. Fragmentation ratios are measured > by calculating the percentage of objects in use compared to the total > number of objects that the slab cache could hold. > > kmem_cache_defrag takes a node parameter. This can either be -1 if > defragmentation should be performed on all nodes, or a node number. > If a node number was specified then defragmentation is only performed > on a specific node. > > Slab defragmentation is a memory intensive operation that can be > sped up in a NUMA system if mostly node local memory is accessed. That > is the case if we just have reclaimed reclaim on a node. > > For defragmentation SLUB first generates a sorted list of partial slabs. > Sorting is performed according to the number of objects allocated. > Thus the slabs with the least objects will be at the end. > > We extract slabs off the tail of that list until we have either reached a > mininum number of slabs or until we encounter a slab that has more than a > quarter of its objects allocated. Then we attempt to remove the objects > from each of the slabs taken. > > In order for a slabcache to support defragmentation a couple of functions > must be defined via kmem_cache_ops. These are > > void *get(struct kmem_cache *s, int nr, void **objects) > > Must obtain a reference to the listed objects. SLUB guarantees that > the objects are still allocated. However, other threads may be blocked > in slab_free attempting to free objects in the slab. These may succeed > as soon as get() returns to the slab allocator. The function must > be able to detect the situation and void the attempts to handle such > objects (by for example voiding the corresponding entry in the objects > array). > > No slab operations may be performed in get_reference(). Interrupts
s/get_reference/get/, yes? > are disabled. What can be done is very limited. The slab lock > for the page with the object is taken. Any attempt to perform a slab > operation may lead to a deadlock. > > get() returns a private pointer that is passed to kick. Should we > be unable to obtain all references then that pointer may indicate > to the kick() function that it should not attempt any object removal > or move but simply remove the reference counts. > > void kick(struct kmem_cache *, int nr, void **objects, void *get_result) > > After SLUB has established references to the objects in a > slab it will drop all locks and then use kick() to move objects out > of the slab. The existence of the object is guaranteed by virtue of > the earlier obtained references via get(). The callback may perform > any slab operation since no locks are held at the time of call. > > The callback should remove the object from the slab in some way. This > may be accomplished by reclaiming the object and then running > kmem_cache_free() or reallocating it and then running > kmem_cache_free(). Reallocation is advantageous because the partial > slabs were just sorted to have the partial slabs with the most objects > first. Reallocation is likely to result in filling up a slab in > addition to freeing up one slab so that it also can be removed from > the partial list. > > Kick() does not return a result. SLUB will check the number of > remaining objects in the slab. If all objects were removed then > we know that the operation was successful. > Nice changelog ;) > +static int __kmem_cache_vacate(struct kmem_cache *s, > + struct page *page, unsigned long flags, void *scratch) > +{ > + void **vector = scratch; > + void *p; > + void *addr = page_address(page); > + DECLARE_BITMAP(map, s->objects); A variable-sized local. We have a few of these in-kernel. What's the worst-case here? With 4k pages and 4-byte slab it's 128 bytes of stack? Seems acceptable. (What's the smallest sized object slub will create? 4 bytes?) To hold off a concurrent free while defragging, the code relies upon slab_lock() on the current page, yes? But slab_lock() isn't taken for slabs whose objects are larger than PAGE_SIZE. How's that handled? Overall: looks good. It'd be nice to get a buffer_head shrinker in place, see how that goes from a proof-of-concept POV. How much testing has been done on this code, and of what form, and with what results? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/