Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu sheaves

Harry Yoo Thu, 03 Apr 2025 01:31:56 -0700

On Mon, Mar 17, 2025 at 03:33:03PM +0100, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> refill one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() with a specific
> node (not NUMA_NO_NODE), sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
> allocated or freed using the sheaves. Counters sheaf_refill,
> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
> from or to slab pages, and can be used to assess how effective the
> caching is. The refill and flush operations will also count towards the
> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
> counters.
> 
> Access to the percpu sheaves is protected by localtry_trylock() when
> potential callers include irq context, and localtry_lock() otherwise
> (such as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> A current limitation is that when slub_debug is enabled for a cache with
> percpu sheaves, the objects in the array are considered as allocated from
> the slub_debug perspective, and the alloc/free debugging hooks occur
> when moving the objects between the array and slab pages. This means
> that e.g. an use-after-free that occurs for an object cached in the
> array is undetected. Collected alloc/free stacktraces might also be less
> useful. This limitation could be changed in the future.
> 
> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> allocations and frees by kmem_cache users even if those use the array,
> so their debugging or accounting accuracy should be unaffected.
> 
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> ---
>  include/linux/slab.h |   34 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1029 
> +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1020 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 
> 7686054dd494cc65def7f58748718e03eb78e481..0e1b25228c77140d05b5b4433c9d7923de36ec05
>  100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -453,12 +489,19 @@ static inline struct kmem_cache_node *get_node(struct 
> kmem_cache *s, int node)
>   */
>  static nodemask_t slab_nodes;
>  
> -#ifndef CONFIG_SLUB_TINY
>  /*
>   * Workqueue used for flush_cpu_slab().
>   */
>  static struct workqueue_struct *flushwq;
> -#endif
> +
> +struct slub_flush_work {
> +     struct work_struct work;
> +     struct kmem_cache *s;
> +     bool skip;
> +};
> +
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>  
>  /********************************************************************
>   *                   Core slab cache functions
> @@ -2410,6 +2453,358 @@ static void *setup_object(struct kmem_cache *s, void 
> *object)
>       return object;
>  }


> +/*
> + * Bulk free objects to the percpu sheaves.
> + * Unlike free_to_pcs() this includes the calls to all necessary hooks
> + * and the fallback to freeing to slab pages.
> + */
> +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{

[...snip...]

> +next_batch:
> +     if (!localtry_trylock(&s->cpu_sheaves->lock))
> +             goto fallback;
> +
> +     pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +     if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +             struct slab_sheaf *empty;
> +
> +             if (!pcs->spare) {
> +                     empty = barn_get_empty_sheaf(pcs->barn);
> +                     if (empty) {
> +                             pcs->spare = pcs->main;
> +                             pcs->main = empty;
> +                             goto do_free;
> +                     }
> +                     goto no_empty;

Maybe a silly question, but if neither of alloc_from_pcs_bulk() or
free_to_pcs_bulk() allocates empty sheaves (and sometimes put empty or full
sheaves in the barn), you should expect usually sheaves not to be in the barn
when using bulk interfces?

> +             }
> +
> +             if (pcs->spare->size < s->sheaf_capacity) {
> +                     stat(s, SHEAF_SWAP);
> +                     swap(pcs->main, pcs->spare);
> +                     goto do_free;
> +             }
> +
> +             empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +             if (!IS_ERR(empty)) {
> +                     pcs->main = empty;
> +                     goto do_free;
> +             }
> +
> +no_empty:
> +             localtry_unlock(&s->cpu_sheaves->lock);
> +
> +             /*
> +              * if we depleted all empty sheaves in the barn or there are too
> +              * many full sheaves, free the rest to slab pages
> +              */
> +fallback:
> +             __kmem_cache_free_bulk(s, size, p);
> +             return;
> +     }
> +
> +do_free:
> +     main = pcs->main;
> +     batch = min(size, s->sheaf_capacity - main->size);
> +
> +     memcpy(main->objects + main->size, p, batch * sizeof(void *));
> +     main->size += batch;
> +
> +     localtry_unlock(&s->cpu_sheaves->lock);
> +
> +     stat_add(s, FREE_PCS, batch);
> +
> +     if (batch < size) {
> +             p += batch;
> +             size -= batch;
> +             goto next_batch;
> +     }
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -5309,8 +6145,8 @@ static inline int calculate_order(unsigned int size)
>       return -ENOSYS;
>  }
>  
> -static void
> -init_kmem_cache_node(struct kmem_cache_node *n)
> +static bool
> +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
>  {

Why is the return type bool, when it always succeeds?

>       n->nr_partial = 0;
>       spin_lock_init(&n->list_lock);
> @@ -5320,6 +6156,11 @@ init_kmem_cache_node(struct kmem_cache_node *n)
>       atomic_long_set(&n->total_objects, 0);
>       INIT_LIST_HEAD(&n->full);
>  #endif
> +     n->barn = barn;
> +     if (barn)
> +             barn_init(barn);
> +
> +     return true;
>  }
>  
>  #ifndef CONFIG_SLUB_TINY
> @@ -5385,7 +6250,7 @@ static void early_kmem_cache_node_alloc(int node)
>       slab->freelist = get_freepointer(kmem_cache_node, n);
>       slab->inuse = 1;
>       kmem_cache_node->node[node] = n;
> -     init_kmem_cache_node(n);
> +     init_kmem_cache_node(n, NULL);
>       inc_slabs_node(kmem_cache_node, node, slab->objects);
>  
>       /*
> @@ -5421,20 +6295,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>  
>       for_each_node_mask(node, slab_nodes) {
>               struct kmem_cache_node *n;
> +             struct node_barn *barn = NULL;
>  
>               if (slab_state == DOWN) {
>                       early_kmem_cache_node_alloc(node);
>                       continue;
>               }
> +
> +             if (s->cpu_sheaves) {
> +                     barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> +
> +                     if (!barn)
> +                             return 0;
> +             }
> +
>               n = kmem_cache_alloc_node(kmem_cache_node,
>                                               GFP_KERNEL, node);
> -
> -             if (!n) {
> -                     free_kmem_cache_nodes(s);
> +             if (!n)
>                       return 0;
> -             }

Looks like it's leaking the barn
if the allocation of kmem_cache_node fails?

> -             init_kmem_cache_node(n);
> +             init_kmem_cache_node(n, barn);
> +
>               s->node[node] = n;
>       }
>       return 1;
> @@ -6005,12 +6891,24 @@ static int slab_mem_going_online_callback(void *arg)
>        */
>       mutex_lock(&slab_mutex);
>       list_for_each_entry(s, &slab_caches, list) {
> +             struct node_barn *barn = NULL;
> +
>               /*
>                * The structure may already exist if the node was previously
>                * onlined and offlined.
>                */
>               if (get_node(s, nid))
>                       continue;
> +
> +             if (s->cpu_sheaves) {
> +                     barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
> +
> +                     if (!barn) {
> +                             ret = -ENOMEM;
> +                             goto out;
> +                     }
> +             }
> +

Ditto.

Otherwise looks good to me :)

>               /*
>                * XXX: kmem_cache_alloc_node will fallback to other nodes
>                *      since memory is not yet available from the node that
> @@ -6021,7 +6919,9 @@ static int slab_mem_going_online_callback(void *arg)
>                       ret = -ENOMEM;
>                       goto out;
>               }
> -             init_kmem_cache_node(n);
> +
> +             init_kmem_cache_node(n, barn);
> +
>               s->node[nid] = n;
>       }
>       /*

-- 
Cheers,
Harry (formerly known as Hyeonggon)

Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu sheaves

Reply via email to