On Mon, Mar 17, 2025 at 03:33:03PM +0100, Vlastimil Babka wrote: > Specifying a non-zero value for a new struct kmem_cache_args field > sheaf_capacity will setup a caching layer of percpu arrays called > sheaves of given capacity for the created cache. > > Allocations from the cache will allocate via the percpu sheaves (main or > spare) as long as they have no NUMA node preference. Frees will also > refill one of the sheaves. > > When both percpu sheaves are found empty during an allocation, an empty > sheaf may be replaced with a full one from the per-node barn. If none > are available and the allocation is allowed to block, an empty sheaf is > refilled from slab(s) by an internal bulk alloc operation. When both > percpu sheaves are full during freeing, the barn can replace a full one > with an empty one, unless over a full sheaves limit. In that case a > sheaf is flushed to slab(s) by an internal bulk free operation. Flushing > sheaves and barns is also wired to the existing cpu flushing and cache > shrinking operations. > > The sheaves do not distinguish NUMA locality of the cached objects. If > an allocation is requested with kmem_cache_alloc_node() with a specific > node (not NUMA_NO_NODE), sheaves are bypassed. > > The bulk operations exposed to slab users also try to utilize the > sheaves as long as the necessary (full or empty) sheaves are available > on the cpu or in the barn. Once depleted, they will fallback to bulk > alloc/free to slabs directly to avoid double copying. > > Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects > allocated or freed using the sheaves. Counters sheaf_refill, > sheaf_flush_main and sheaf_flush_other count objects filled or flushed > from or to slab pages, and can be used to assess how effective the > caching is. The refill and flush operations will also count towards the > usual alloc_fastpath/slowpath, free_fastpath/slowpath and other > counters. > > Access to the percpu sheaves is protected by localtry_trylock() when > potential callers include irq context, and localtry_lock() otherwise > (such as when we already know the gfp flags allow blocking). The trylock > failures should be rare and we can easily fallback. Each per-NUMA-node > barn has a spin_lock. > > A current limitation is that when slub_debug is enabled for a cache with > percpu sheaves, the objects in the array are considered as allocated from > the slub_debug perspective, and the alloc/free debugging hooks occur > when moving the objects between the array and slab pages. This means > that e.g. an use-after-free that occurs for an object cached in the > array is undetected. Collected alloc/free stacktraces might also be less > useful. This limitation could be changed in the future. > > On the other hand, KASAN, kmemcg and other hooks are executed on actual > allocations and frees by kmem_cache users even if those use the array, > so their debugging or accounting accuracy should be unaffected. > > Signed-off-by: Vlastimil Babka <vba...@suse.cz> > --- > include/linux/slab.h | 34 ++ > mm/slab.h | 2 + > mm/slab_common.c | 5 +- > mm/slub.c | 1029 > +++++++++++++++++++++++++++++++++++++++++++++++--- > 4 files changed, 1020 insertions(+), 50 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index > 7686054dd494cc65def7f58748718e03eb78e481..0e1b25228c77140d05b5b4433c9d7923de36ec05 > 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -453,12 +489,19 @@ static inline struct kmem_cache_node *get_node(struct > kmem_cache *s, int node) > */ > static nodemask_t slab_nodes; > > -#ifndef CONFIG_SLUB_TINY > /* > * Workqueue used for flush_cpu_slab(). > */ > static struct workqueue_struct *flushwq; > -#endif > + > +struct slub_flush_work { > + struct work_struct work; > + struct kmem_cache *s; > + bool skip; > +}; > + > +static DEFINE_MUTEX(flush_lock); > +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush); > > /******************************************************************** > * Core slab cache functions > @@ -2410,6 +2453,358 @@ static void *setup_object(struct kmem_cache *s, void > *object) > return object; > }
> +/* > + * Bulk free objects to the percpu sheaves. > + * Unlike free_to_pcs() this includes the calls to all necessary hooks > + * and the fallback to freeing to slab pages. > + */ > +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p) > +{ [...snip...] > +next_batch: > + if (!localtry_trylock(&s->cpu_sheaves->lock)) > + goto fallback; > + > + pcs = this_cpu_ptr(s->cpu_sheaves); > + > + if (unlikely(pcs->main->size == s->sheaf_capacity)) { > + > + struct slab_sheaf *empty; > + > + if (!pcs->spare) { > + empty = barn_get_empty_sheaf(pcs->barn); > + if (empty) { > + pcs->spare = pcs->main; > + pcs->main = empty; > + goto do_free; > + } > + goto no_empty; Maybe a silly question, but if neither of alloc_from_pcs_bulk() or free_to_pcs_bulk() allocates empty sheaves (and sometimes put empty or full sheaves in the barn), you should expect usually sheaves not to be in the barn when using bulk interfces? > + } > + > + if (pcs->spare->size < s->sheaf_capacity) { > + stat(s, SHEAF_SWAP); > + swap(pcs->main, pcs->spare); > + goto do_free; > + } > + > + empty = barn_replace_full_sheaf(pcs->barn, pcs->main); > + > + if (!IS_ERR(empty)) { > + pcs->main = empty; > + goto do_free; > + } > + > +no_empty: > + localtry_unlock(&s->cpu_sheaves->lock); > + > + /* > + * if we depleted all empty sheaves in the barn or there are too > + * many full sheaves, free the rest to slab pages > + */ > +fallback: > + __kmem_cache_free_bulk(s, size, p); > + return; > + } > + > +do_free: > + main = pcs->main; > + batch = min(size, s->sheaf_capacity - main->size); > + > + memcpy(main->objects + main->size, p, batch * sizeof(void *)); > + main->size += batch; > + > + localtry_unlock(&s->cpu_sheaves->lock); > + > + stat_add(s, FREE_PCS, batch); > + > + if (batch < size) { > + p += batch; > + size -= batch; > + goto next_batch; > + } > +} > + > #ifndef CONFIG_SLUB_TINY > /* > * Fastpath with forced inlining to produce a kfree and kmem_cache_free that > @@ -5309,8 +6145,8 @@ static inline int calculate_order(unsigned int size) > return -ENOSYS; > } > > -static void > -init_kmem_cache_node(struct kmem_cache_node *n) > +static bool > +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn) > { Why is the return type bool, when it always succeeds? > n->nr_partial = 0; > spin_lock_init(&n->list_lock); > @@ -5320,6 +6156,11 @@ init_kmem_cache_node(struct kmem_cache_node *n) > atomic_long_set(&n->total_objects, 0); > INIT_LIST_HEAD(&n->full); > #endif > + n->barn = barn; > + if (barn) > + barn_init(barn); > + > + return true; > } > > #ifndef CONFIG_SLUB_TINY > @@ -5385,7 +6250,7 @@ static void early_kmem_cache_node_alloc(int node) > slab->freelist = get_freepointer(kmem_cache_node, n); > slab->inuse = 1; > kmem_cache_node->node[node] = n; > - init_kmem_cache_node(n); > + init_kmem_cache_node(n, NULL); > inc_slabs_node(kmem_cache_node, node, slab->objects); > > /* > @@ -5421,20 +6295,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s) > > for_each_node_mask(node, slab_nodes) { > struct kmem_cache_node *n; > + struct node_barn *barn = NULL; > > if (slab_state == DOWN) { > early_kmem_cache_node_alloc(node); > continue; > } > + > + if (s->cpu_sheaves) { > + barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node); > + > + if (!barn) > + return 0; > + } > + > n = kmem_cache_alloc_node(kmem_cache_node, > GFP_KERNEL, node); > - > - if (!n) { > - free_kmem_cache_nodes(s); > + if (!n) > return 0; > - } Looks like it's leaking the barn if the allocation of kmem_cache_node fails? > - init_kmem_cache_node(n); > + init_kmem_cache_node(n, barn); > + > s->node[node] = n; > } > return 1; > @@ -6005,12 +6891,24 @@ static int slab_mem_going_online_callback(void *arg) > */ > mutex_lock(&slab_mutex); > list_for_each_entry(s, &slab_caches, list) { > + struct node_barn *barn = NULL; > + > /* > * The structure may already exist if the node was previously > * onlined and offlined. > */ > if (get_node(s, nid)) > continue; > + > + if (s->cpu_sheaves) { > + barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid); > + > + if (!barn) { > + ret = -ENOMEM; > + goto out; > + } > + } > + Ditto. Otherwise looks good to me :) > /* > * XXX: kmem_cache_alloc_node will fallback to other nodes > * since memory is not yet available from the node that > @@ -6021,7 +6919,9 @@ static int slab_mem_going_online_callback(void *arg) > ret = -ENOMEM; > goto out; > } > - init_kmem_cache_node(n); > + > + init_kmem_cache_node(n, barn); > + > s->node[nid] = n; > } > /* -- Cheers, Harry (formerly known as Hyeonggon)