per cpu memory in the kernel

David Gwynne Wed, 10 Aug 2016 21:44:31 -0700

ive been tinkering with per cpu memory in the kernel.

per cpu memory is pretty much what it sounds like. you allocate
memory for each cpu to operate on independently of the rest of the
system, therefore reducing the contention between cpus on cache
lines.


this introduces wrappers around the kernel memory allocators, so
when you ask to allocate N bytes, you get an N sized allocation for
each cpu, and a way to get access to that memory from each cpu.

cpumem_get() and cpumem_put() are wrappers around pool_get() and
pool_put(), and cpumem_malloc() and cpumem_free() are wrappers
around malloc() and free(). instead of these returning a direct
reference to memory they return a struct cpumem pointer. you then
later get a reference to each cpus allocation with cpumem_enter(),
and then release that reference with cpumem_leave().

im still debating whether the API should do protection against
interrupts on the local cpu by handling spls for you. at the moment
it is up to the caller to manually splfoo() and splx(), but im half
convinced that cpumem_enter and _leave should do that on behalf of
the caller.

this diff also includes two uses of the percpu code. one is to
provide per cpu caches of pool items, and the other is per cpu
counters for mbuf statistics.

ive added a wrapper around percpu memory for counters. basically
you ask it for N counters, where each counter is a uint64_t.
counters_enter will give you a per cpu reference to these N counters
which you can increment and decrement as you wish. internally the
api will version the per cpu counters so a reader can know when
theyre consistent, which is important on 32bit archs (where 64bit
ops arent necessarily atomic), or where you want several counters
to be consistent with each other (like packet and byte counters).
counters_read is provided to use the above machinery for a consistent
read.

the per cpu caches in pools are modelled on the ones described in
the "Magazines and Vmem: Extending the Slab Allocator to Many CPUs
and Arbitrary Resources" paper by Jeff Bonwick and Jonathan Adams.
pools are based on slabs, so it seem defendable to use this as the
basis for further improvements.

like the magazine paper, it maintains a pair of "magazines" of pool
items on each cpu. when both are full, one gets pushed to a global
depot. if both are empty it will try to allocate a whole magazine
from the depot, and if that fails it will fall through to the normal
pool_get allocation. this semantic for mitigating access to the
global data structures is the big take-away from the paper in my
opinion.

unlike the paper though, the per cpu caches in pools take advantage
of the fact that pools are not a caching memory allocator, ie, we
dont keep track of pool items in a constructed state so we can
scribble over the memory to our hearts content. with that in mind,
the per cpu caches in pools build linked lists of free items rather
than allocate magazines to point at pool items. in the future this
will greatly simplify scaling the size of magazines. right now there
is no point because there's no contention on anything in the kernel
except the big lock.

there are some consequences of the per cpu pool caches. the most
important is that a hard limit on pool items cannot work because
that requires access to a shared count, which makes per cpu caches
completely pointless. a compromise might be to limit the total
number of pages available to the pool rather than limiting individual
pool item counts. this would work fine in the mbuf layer for example.

finally, two things to note.

firstly, ive written an alternate backend for this stuff for
uniprocessor kernels that should collapse down to a simple pointer
deref, rather than an indirect reference through a the map of cpus
to allocations. i havent tested it at all though.

secondly, there is a boot strapping problem with per cpu data
structures, which is very apparent with the mbuf layer. the problem
is we dont know how many cpus are in the system until we're halfway
through attaching device drivers in the system. however, if we want
to use percpu data structures during attach we need to know how
many cpus we have. mbufs are allocated during attach, so we need
to know how many cpus we have before attach.

solaris deals with this problem by assuming MAXCPUS when allocate
the map of cpus to allocations. im not a fan of this because on
sparc64 MAXCPUs is 256, but my v445 has 4. using MAXCPUS for the
per cpu map will cause me to waste nearly 2k of memory (256 cpus *
8 bytes per pointer), which makes per cpu data structures less
attractive.

i also want to avoid conditionals in hot code like the mbuf layer, so i dont 
want to put if (mbstat != NULL) { cpumem_enter(mbstat); } etc when that will 
evaluate as true all the time except during boot.

instead i have a compromise where allocate a single cpus worth of
memory as a global to be used during boot. we only spin up cpus
late during boot (relatively speaking) so we can assume ncpus is 1
until after the hardware has attached. at that point percpu_init
bootstraps the per cpu map allocations and the mbuf layer reallocates
the mbstat percpu mem with it.

the bet is we will waste less memory with these globals as boot
allocation than we will on the majority of systems once we know how
many cpus we have. if i do end up with a sparc64 machine with 256
cpus, i am almost certainly going to have enough ram to cope with
losing some bytes here and there.

thoughts?

Index: sys/percpu.h
===================================================================
RCS file: sys/percpu.h
diff -N sys/percpu.h
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ sys/percpu.h        11 Aug 2016 03:52:12 -0000
@@ -0,0 +1,153 @@
+/*     $OpenBSD$ */
+
+/*
+ * Copyright (c) 2016 David Gwynne <[email protected]>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifndef _SYS_PERCPU_H_
+#define _SYS_PERCPU_H_
+
+#ifndef __upunused /* this should go in param.h */
+#ifdef MULTIPROCESSOR
+#define __upunused
+#else
+#define __upunused __attribute__((__unused__))
+#endif
+#endif
+
+struct cpumem {
+       void            *mem;
+};
+
+struct cpumem_iter {
+       unsigned int    cpu;
+} __upunused;
+
+struct counters_ref {
+       uint64_t        *gen;
+};
+
+#ifdef _KERNEL
+struct pool;
+
+struct cpumem  *cpumem_get(struct pool *);
+void            cpumem_put(struct pool *, struct cpumem *);
+
+struct cpumem  *cpumem_malloc(size_t, int);
+struct cpumem  *cpumem_realloc(struct cpumem *, size_t, int);
+void            cpumem_free(struct cpumem *, int, size_t);
+
+#ifdef MULTIPROCESSOR
+static inline void *
+cpumem_enter(struct cpumem *cm)
+{
+       unsigned int cpu = CPU_INFO_UNIT(curcpu());
+       return (cm[cpu].mem);
+}
+
+static inline void
+cpumem_leave(struct cpumem *cm, void *mem)
+{
+       /* KDASSERT? */
+}
+
+void           *cpumem_first(struct cpumem_iter *, struct cpumem *);
+void           *cpumem_next(struct cpumem_iter *, struct cpumem *);
+
+#define CPUMEM_BOOT_MEMORY(_name, _sz)                                 \
+static struct {                                                                
\
+       struct cpumem   cpumem;                                         \
+       unsigned char   mem[_sz];                                       \
+} _name##_boot_cpumem = {                                              \
+       .cpumem = { _name##_boot_cpumem.mem }                           \
+}
+
+#define CPUMEM_BOOT_INITIALIZER(_name)                                 \
+       { &_name##_boot_cpumem.cpumem }
+
+#else /* MULTIPROCESSOR */
+static inline void *
+cpumem_enter(struct cpumem *cm)
+{
+       return (cm);
+}
+
+static inline void
+cpumem_leave(struct cpumem *cm, void *mem)
+{
+       /* KDASSERT? */
+}
+
+static inline void *
+cpumem_first(struct cpumem_iter *i, struct cpumem *cm)
+{
+       return (cm);
+}
+
+static inline void *
+cpumem_next(struct cpumem_iter *i, struct cpumem *cm)
+{
+       return (NULL);
+}
+
+#define CPUMEM_BOOT_MEMORY(_name, _sz)                                 \
+static struct {                                                                
\
+       unsigned char   mem[_sz];                                       \
+} _name##_boot_cpumem
+
+#define CPUMEM_BOOT_INITIALIZER(_name)                                         
\
+       { (struct cpumem *)&_name##_boot_cpumem.mem }
+
+#endif /* MULTIPROCESSOR */
+
+#define CPUMEM_FOREACH(_var, _iter, _cpumem)                           \
+       for ((_var) = cpumem_first((_iter), (_cpumem));                 \
+           (_var) != NULL;                                             \
+           (_var) = cpumem_next((_iter), (_cpumem)))
+
+struct cpumem  *counters_alloc(unsigned int, int);
+struct cpumem  *counters_realloc(struct cpumem *, unsigned int, int);
+void            counters_free(struct cpumem *, int, unsigned int);
+void            counters_read(struct cpumem *, uint64_t *, unsigned int);
+void            counters_zero(struct cpumem *, unsigned int);
+
+#ifdef MULTIPROCESSOR
+uint64_t       *counters_enter(struct counters_ref *, struct cpumem *);
+void            counters_leave(struct counters_ref *, struct cpumem *);
+
+#define COUNTERS_BOOT_MEMORY(_name, _n)                                \
+       CPUMEM_BOOT_MEMORY(_name, ((_n) + 1) * sizeof(uint64_t))
+#else
+static inline uint64_t *
+counters_enter(struct counters_ref *r, struct cpumem *cm)
+{
+       r->gen = cpumem_enter(cm);
+       return (r);
+}
+
+static inline void
+counters_leave(struct counters_ref *r, struct cpumem *cm)
+{
+       cpumem_leave(cm, r->gen);
+}
+
+#define COUNTERS_BOOT_MEMORY(_name, _n)                                        
\
+       CPUMEM_BOOT_MEMORY(_name, (_n) * sizeof(uint64_t))
+#endif
+
+#define COUNTERS_BOOT_INITIALIZER(_name)       CPUMEM_BOOT_INITIALIZER(_name)
+
+#endif /* _KERNEL */
+#endif /* _SYS_PERCPU_H_ */
Index: kern/subr_percpu.c
===================================================================
RCS file: kern/subr_percpu.c
diff -N kern/subr_percpu.c
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ kern/subr_percpu.c  11 Aug 2016 03:52:12 -0000
@@ -0,0 +1,341 @@
+/*     $OpenBSD$ */
+
+/*
+ * Copyright (c) 2016 David Gwynne <[email protected]>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/pool.h>
+#include <sys/malloc.h>
+#include <sys/types.h>
+#include <sys/atomic.h>
+
+#include <sys/percpu.h>
+
+#ifndef CACHELINESIZE
+#define CACHELINESIZE 64
+#endif
+
+#ifdef MULTIPROCESSOR
+struct pool cpumem_pl;
+
+void
+percpu_init(void)
+{
+       pool_init(&cpumem_pl, sizeof(struct cpumem) * ncpus, 0, 0,
+           PR_WAITOK, "percpumem", &pool_allocator_single);
+       pool_setipl(&cpumem_pl, IPL_NONE);
+}
+
+struct cpumem *
+cpumem_get(struct pool *pp)
+{
+       struct cpumem *cm;
+       unsigned int cpu;
+
+       cm = pool_get(&cpumem_pl, PR_WAITOK);
+
+       for (cpu = 0; cpu < ncpus; cpu++)
+               cm[cpu].mem = pool_get(pp, PR_WAITOK | PR_ZERO);
+
+       return (cm);
+}
+
+void
+cpumem_put(struct pool *pp, struct cpumem *cm)
+{
+       unsigned int cpu;
+
+       for (cpu = 0; cpu < ncpus; cpu++)
+               pool_put(pp, cm[cpu].mem);
+
+       pool_put(&cpumem_pl, cm);
+}
+
+struct cpumem *
+cpumem_malloc(size_t sz, int type)
+{
+       struct cpumem *cm;
+       unsigned int cpu;
+
+       sz = roundup(sz, CACHELINESIZE);
+
+       cm = pool_get(&cpumem_pl, PR_WAITOK);
+
+       for (cpu = 0; cpu < ncpus; cpu++)
+               cm[cpu].mem = malloc(sz, type, M_WAITOK | M_ZERO);
+
+       return (cm);
+}
+
+struct cpumem *
+cpumem_realloc(struct cpumem *bootcm, size_t sz, int type)
+{
+       struct cpumem *newcm;
+
+       newcm = cpumem_malloc(sz, type);
+       memcpy(newcm[0].mem, bootcm[0].mem, sz);
+
+       return (newcm);
+}
+
+void
+cpumem_free(struct cpumem *cm, int type, size_t sz)
+{
+       unsigned int cpu;
+
+       sz = roundup(sz, CACHELINESIZE);
+
+       for (cpu = 0; cpu < ncpus; cpu++)
+               free(cm[cpu].mem, type, sz);
+
+       pool_put(&cpumem_pl, cm);
+}
+
+void *
+cpumem_first(struct cpumem_iter *i, struct cpumem *cm)
+{
+       i->cpu = 0;
+
+       return (cm[0].mem);
+}
+
+void *
+cpumem_next(struct cpumem_iter *i, struct cpumem *cm)
+{
+       unsigned int cpu = ++i->cpu;
+
+       if (cpu >= ncpus)
+               return (NULL);
+
+       return (cm[cpu].mem);
+}
+
+struct cpumem *
+counters_alloc(unsigned int n, int type)
+{
+       struct cpumem *cm;
+       struct cpumem_iter cmi;
+       uint64_t *counters;
+       unsigned int i;
+
+       KASSERT(n > 0);
+
+       n++; /* add space for a generation number */
+       cm = cpumem_malloc(n * sizeof(uint64_t), type);
+
+       CPUMEM_FOREACH(counters, &cmi, cm) {
+               for (i = 0; i < n; i++)
+                       counters[i] = 0;
+       }
+
+       return (cm);
+}
+
+struct cpumem *
+counters_realloc(struct cpumem *cm, unsigned int n, int type)
+{
+       n++; /* the generation number */
+       return (cpumem_realloc(cm, n * sizeof(uint64_t), type));
+}
+
+void
+counters_free(struct cpumem *cm, int type, unsigned int n)
+{
+       n++; /* generation number */
+       cpumem_free(cm, type, n * sizeof(uint64_t));
+}
+
+uint64_t *
+counters_enter(struct counters_ref *ref, struct cpumem *cm)
+{
+       ref->gen = cpumem_enter(cm);
+       (*ref->gen)++; /* make the generation number odd */
+       return (ref->gen + 1);
+}
+
+void
+counters_leave(struct counters_ref *ref, struct cpumem *cm)
+{
+       membar_producer();
+       (*ref->gen)++; /* make the generation number even again */
+       cpumem_leave(cm, ref->gen);
+}
+
+void
+counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
+{
+       struct cpumem_iter cmi;
+       uint64_t *gen, *counters, *temp;
+       uint64_t enter, leave;
+       unsigned int i;
+
+       for (i = 0; i < n; i++)
+               output[i] = 0;
+
+       temp = mallocarray(n, sizeof(uint64_t), M_TEMP, M_WAITOK);
+
+       gen = cpumem_first(&cmi, cm);
+       do {
+               counters = gen + 1;
+
+               enter = *gen;
+               for (;;) {
+                       /* the generation number is odd during an update */
+                       while (enter & 1) {
+                               yield();
+                               membar_consumer();
+                               enter = *gen;
+                       }
+
+                       for (i = 0; i < n; i++)
+                               temp[i] = counters[i];
+
+                       membar_consumer();
+                       leave = *gen;
+
+                       if (enter == leave)
+                               break;
+
+                       enter = leave;
+               }
+
+               for (i = 0; i < n; i++)
+                       output[i] += temp[i];
+
+               gen = cpumem_next(&cmi, cm);
+       } while (gen != NULL);
+
+       free(temp, M_TEMP, n * sizeof(uint64_t));
+}
+
+void
+counters_zero(struct cpumem *cm, unsigned int n)
+{
+       struct cpumem_iter cmi;
+       uint64_t *counters;
+       unsigned int i;
+
+       n++; /* zero the generation numbers too */
+
+       counters = cpumem_first(&cmi, cm);
+       do {
+               for (i = 0; i < n; i++)
+                       counters[i] = 0;
+
+               counters = cpumem_next(&cmi, cm);
+       } while (counters != NULL);
+}
+
+#else /* MULTIPROCESSOR */
+
+/*
+ * Uniprocessor implementation of per-CPU data structures.
+ *
+ * UP percpu memory is a single memory allocation cast to/from the
+ * cpumem struct. It is not scaled up to the size of cacheline because
+ * there's no other cache to contend with.
+ */
+
+void
+percpu_attach(void)
+{
+       /* nop */
+}
+
+struct cpumem *
+cpumem_get(struct pool *pp)
+{
+       return (pool_get(pp, PR_WAITOK));
+}
+
+void
+cpumem_put(struct pool *pp, struct cpumem *cm)
+{
+       pool_put(pp, cm);
+}
+
+struct cpumem *
+cpumem_malloc(size_t sz, int type)
+{
+       return (malloc(sz, type, M_WAITOK));
+}
+
+struct cpumem *
+cpumem_realloc(struct cpumem *cm, size_t sz, int type)
+{
+       return (cm);
+}
+
+void
+cpumem_free(struct cpumem *cm, int type, size_t sz)
+{
+       free(cm, type, sz);
+}
+
+struct cpumem *
+counters_alloc(unsigned int n, int type)
+{
+       KASSERT(n > 0);
+
+       return (cpumem_malloc(n * sizeof(uint64_t), type));
+}
+
+struct cpumem *
+counters_realloc(struct cpumem *cm, unsigned int n, int type)
+{
+       /* this is unecessary, but symmetrical */
+       return (cpumem_realloc(cm, n * sizeof(uint64_t), type));
+}
+
+void
+counters_free(struct cpumem *cm, int type, unsigned int n)
+{
+       cpumem_free(cm, type, n * sizeof(uint64_t));
+}
+
+void
+counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
+{
+       uint64_t *counters;
+       unsigned int i;
+       int s;
+
+       counters = (uint64_t *)cm;
+
+       s = splhigh();
+       for (i = 0; i < n; i++)
+               output[i] = counters[i];
+       splx(s);
+}
+
+void
+counters_zero(struct cpumem *cm, unsigned int n)
+{
+       uint64_t *counters;
+       unsigned int i;
+       int s;
+
+       counters = (uint64_t *)cm;
+
+       s = splhigh();
+       for (i = 0; i < n; i++)
+               counters[i] = 0;
+       splx(s);
+}
+
+#endif /* MULTIPROCESSOR */
+
Index: conf/files
===================================================================
RCS file: /cvs/src/sys/conf/files,v
retrieving revision 1.622
diff -u -p -r1.622 files
--- conf/files  5 Aug 2016 19:00:25 -0000       1.622
+++ conf/files  11 Aug 2016 03:52:12 -0000
@@ -687,6 +687,7 @@ file kern/subr_evcount.c
 file kern/subr_extent.c
 file kern/subr_hibernate.c             hibernate
 file kern/subr_log.c
+file kern/subr_percpu.c
 file kern/subr_poison.c                        diagnostic
 file kern/subr_pool.c
 file kern/dma_alloc.c
Index: kern/init_main.c
===================================================================
RCS file: /cvs/src/sys/kern/init_main.c,v
retrieving revision 1.253
diff -u -p -r1.253 init_main.c
--- kern/init_main.c    17 May 2016 23:28:03 -0000      1.253
+++ kern/init_main.c    11 Aug 2016 03:52:12 -0000
@@ -143,6 +143,7 @@ void        init_exec(void);
 void   kqueue_init(void);
 void   taskq_init(void);
 void   pool_gc_pages(void *);
+void   percpu_init(void);
 
 extern char sigcode[], esigcode[], sigcoderet[];
 #ifdef SYSCALL_DEBUG
@@ -413,6 +414,9 @@ main(void *framep)
                __guard_local = newguard;
        }
 #endif
+
+       percpu_init();          /* per cpu memory allocation */
+       mbcache();              /* enable per cpu caches on mbuf pools */
 
        /* init exec and emul */
        init_exec();
Index: kern/kern_sysctl.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
retrieving revision 1.306
diff -u -p -r1.306 kern_sysctl.c
--- kern/kern_sysctl.c  14 Jul 2016 15:39:40 -0000      1.306
+++ kern/kern_sysctl.c  11 Aug 2016 03:52:12 -0000
@@ -77,6 +77,7 @@
 #include <sys/sched.h>
 #include <sys/mount.h>
 #include <sys/syscallargs.h>
+#include <sys/percpu.h>
 
 #include <uvm/uvm_extern.h>
 
@@ -386,9 +387,24 @@ kern_sysctl(int *name, u_int namelen, vo
        case KERN_FILE:
                return (sysctl_file(name + 1, namelen - 1, oldp, oldlenp, p));
 #endif
-       case KERN_MBSTAT:
-               return (sysctl_rdstruct(oldp, oldlenp, newp, &mbstat,
-                   sizeof(mbstat)));
+       case KERN_MBSTAT: {
+               extern struct cpumem *mbstat;
+               uint64_t counters[MBSTAT_COUNT];
+               struct mbstat mbs;
+               unsigned int i;
+
+               memset(&mbs, 0, sizeof(mbs));
+               counters_read(mbstat, counters, MBSTAT_COUNT);
+               for (i = 0; i < MBSTAT_TYPES; i++)
+                       mbs.m_mtypes[i] = counters[i];
+
+               mbs.m_drops = counters[MBSTAT_DROPS];
+               mbs.m_wait = counters[MBSTAT_WAIT];
+               mbs.m_drain = counters[MBSTAT_DRAIN];
+
+               return (sysctl_rdstruct(oldp, oldlenp, newp,
+                   &mbs, sizeof(mbs)));
+       }
 #ifdef GPROF
        case KERN_PROF:
                return (sysctl_doprof(name + 1, namelen - 1, oldp, oldlenp,
Index: kern/subr_pool.c
===================================================================
RCS file: /cvs/src/sys/kern/subr_pool.c,v
retrieving revision 1.194
diff -u -p -r1.194 subr_pool.c
--- kern/subr_pool.c    15 Jan 2016 11:21:58 -0000      1.194
+++ kern/subr_pool.c    11 Aug 2016 03:52:12 -0000
@@ -42,6 +42,7 @@
 #include <sys/sysctl.h>
 #include <sys/task.h>
 #include <sys/timeout.h>
+#include <sys/percpu.h>
 
 #include <uvm/uvm_extern.h>
 
@@ -74,6 +75,27 @@ struct rwlock pool_lock = RWLOCK_INITIAL
 /* Private pool for page header structures */
 struct pool phpool;
 
+struct pool_list {
+       struct pool_list        *pl_next;       /* next in list */
+       unsigned long            pl_cookie;
+       struct pool_list        *pl_nextl;      /* next list */
+       unsigned long            pl_nitems;     /* items in list */
+};
+
+struct pool_cache {
+       struct pool_list        *pc_actv;
+       unsigned long            pc_nactv;      /* cache pc_actv nitems */
+       struct pool_list        *pc_prev;
+
+       uint64_t                 pc_gets;
+       uint64_t                 pc_puts;
+       uint64_t                 pc_fails;
+
+       int                      pc_nout;
+};
+
+struct pool pool_caches; /* per cpu cache entries */
+
 struct pool_item_header {
        /* Page headers */
        TAILQ_ENTRY(pool_item_header)
@@ -117,6 +139,8 @@ int  pool_chk(struct pool *);
 void    pool_get_done(void *, void *);
 void    pool_runqueue(struct pool *, int);
 
+void    pool_cache_destroy(struct pool *);
+
 void   *pool_allocator_alloc(struct pool *, int, int *);
 void    pool_allocator_free(struct pool *, void *);
 
@@ -347,11 +371,52 @@ pool_init(struct pool *pp, size_t size, 
 }
 
 void
+pool_cache_init(struct pool *pp)
+{
+       struct cpumem *cm;
+       struct pool_cache *pc;
+       struct cpumem_iter i;
+
+       if (pool_caches.pr_size == 0) {
+               pool_init(&pool_caches, sizeof(struct pool_cache), 64, 0,
+                   PR_WAITOK, "plcache", NULL);
+               pool_setipl(&pool_caches, IPL_NONE);
+       }
+
+       KASSERT(pp->pr_size >= sizeof(*pc));
+
+       cm = cpumem_get(&pool_caches);
+
+       mtx_init(&pp->pr_cache_mtx, pp->pr_ipl);
+       pp->pr_cache_list = NULL;
+       pp->pr_cache_nlist = 0;
+       pp->pr_cache_items = 8;
+       pp->pr_cache_contention = 0;
+       pp->pr_cache_contention_prev = 0;
+
+       CPUMEM_FOREACH(pc, &i, cm) {
+               pc->pc_actv = NULL;
+               pc->pc_nactv = 0;
+               pc->pc_prev = NULL;
+
+               pc->pc_gets = 0;
+               pc->pc_puts = 0;
+               pc->pc_fails = 0;
+               pc->pc_nout = 0;
+       }
+
+       pp->pr_cache = cm;
+}
+
+void
 pool_setipl(struct pool *pp, int ipl)
 {
        pp->pr_ipl = ipl;
        mtx_init(&pp->pr_mtx, ipl);
        mtx_init(&pp->pr_requests_mtx, ipl);
+
+       if (pp->pr_cache != NULL)
+               mtx_init(&pp->pr_cache_mtx, ipl);
 }
 
 /*
@@ -363,6 +428,9 @@ pool_destroy(struct pool *pp)
        struct pool_item_header *ph;
        struct pool *prev, *iter;
 
+       if (pp->pr_cache != NULL)
+               pool_cache_destroy(pp);
+
 #ifdef DIAGNOSTIC
        if (pp->pr_nout != 0)
                panic("%s: pool busy: still out: %u", __func__, pp->pr_nout);
@@ -397,6 +465,48 @@ pool_destroy(struct pool *pp)
        KASSERT(TAILQ_EMPTY(&pp->pr_partpages));
 }
 
+struct pool_list *
+pool_list_put(struct pool *pp, struct pool_list *pl)
+{
+       struct pool_list *rpl, *npl;
+
+       if (pl == NULL)
+               return (NULL);
+
+       rpl = (struct pool_list *)pl->pl_next;
+
+       do {
+               npl = pl->pl_next;
+               pool_put(pp, pl);
+               pl = npl;
+       } while (pl != NULL);
+
+       return (rpl);
+}
+
+void
+pool_cache_destroy(struct pool *pp)
+{
+       struct pool_cache *pc;
+       struct pool_list *pl;
+       struct cpumem_iter i;
+       struct cpumem *cm;
+
+       cm = pp->pr_cache;
+       pp->pr_cache = NULL; /* make pool_put avoid the cache */
+
+       CPUMEM_FOREACH(pc, &i, cm) {
+               pool_list_put(pp, pc->pc_actv);
+               pool_list_put(pp, pc->pc_prev);
+       }
+
+       cpumem_put(&pool_caches, cm);
+
+       pl = pp->pr_cache_list;
+       while (pl != NULL)
+               pl = pool_list_put(pp, pl);
+}
+
 void
 pool_request_init(struct pool_request *pr,
     void (*handler)(void *, void *), void *cookie)
@@ -420,6 +530,121 @@ struct pool_get_memory {
        void * volatile v;
 };
 
+static inline void
+pool_list_enter(struct pool *pp)
+{
+       if (mtx_enter_try(&pp->pr_cache_mtx) == 0) {
+               mtx_enter(&pp->pr_cache_mtx);
+               pp->pr_cache_contention++;
+       }
+}
+
+static inline void
+pool_list_leave(struct pool *pp)
+{
+       mtx_leave(&pp->pr_cache_mtx);
+}
+
+static inline struct pool_list *
+pool_list_alloc(struct pool *pp, struct pool_cache *pc)
+{
+       struct pool_list *pl;
+
+       pool_list_enter(pp);
+       pl = pp->pr_cache_list;
+       if (pl != NULL) {
+               pp->pr_cache_list = pl->pl_nextl;
+               pp->pr_cache_nlist--;
+       }
+
+       pp->pr_cache_nout += pc->pc_nout;
+       pc->pc_nout = 0;
+       pool_list_leave(pp);
+
+
+       return (pl);
+}
+
+static inline void
+pool_list_free(struct pool *pp, struct pool_cache *pc, struct pool_list *pl)
+{
+       pool_list_enter(pp);
+       pl->pl_nextl = pp->pr_cache_list;
+       pp->pr_cache_list = pl;
+       pp->pr_cache_nlist++;
+
+       pp->pr_cache_nout += pc->pc_nout;
+       pc->pc_nout = 0;
+       pool_list_leave(pp);
+}
+
+void *
+pool_cache_get(struct pool *pp)
+{
+       struct pool_cache *pc;
+       struct pool_list *pl;
+       int s;
+
+       pc = cpumem_enter(pp->pr_cache);
+       s = splraise(pp->pr_ipl);
+
+       if (pc->pc_actv != NULL) {
+               pl = pc->pc_actv;
+       } else if (pc->pc_prev != NULL) {
+               pl = pc->pc_prev;
+               pc->pc_prev = NULL;
+       } else if ((pl = pool_list_alloc(pp, pc)) == NULL) {
+               pc->pc_fails++;
+               goto done;
+       }
+
+       pc->pc_actv = pl->pl_next;
+       pc->pc_nactv = pl->pl_nitems - 1;
+       pc->pc_gets++;
+       pc->pc_nout++;
+done:
+       splx(s);
+       cpumem_leave(pp->pr_cache, pc);
+
+       return (pl);
+}
+
+void
+pool_cache_put(struct pool *pp, void *v)
+{
+       struct pool_cache *pc;
+       struct pool_list *pl = v;
+       unsigned long cache_items = pp->pr_cache_items;
+       unsigned long nitems;
+       int s;
+
+       pc = cpumem_enter(pp->pr_cache);
+       s = splraise(pp->pr_ipl);
+
+       nitems = pc->pc_nactv;
+       if (__predict_false(nitems >= cache_items)) {
+               if (pc->pc_prev != NULL)
+                       pool_list_free(pp, pc, pc->pc_prev);
+                       
+               pc->pc_prev = pc->pc_actv;
+
+               pc->pc_actv = NULL;
+               pc->pc_nactv = 0;
+               nitems = 0;
+       }
+
+       pl->pl_next = pc->pc_actv;
+       pl->pl_nitems = ++nitems;
+
+       pc->pc_actv = pl;
+       pc->pc_nactv = nitems;
+
+       pc->pc_puts++;
+       pc->pc_nout--;
+       splx(s);
+       cpumem_leave(pp->pr_cache, pc);
+}
+
 /*
  * Grab an item from the pool.
  */
@@ -429,8 +654,13 @@ pool_get(struct pool *pp, int flags)
        void *v = NULL;
        int slowdown = 0;
 
-       KASSERT(flags & (PR_WAITOK | PR_NOWAIT));
+       if (pp->pr_cache != NULL) {
+               v = pool_cache_get(pp);
+               if (v != NULL)
+                       goto good;
+       }
 
+       KASSERT(flags & (PR_WAITOK | PR_NOWAIT));
 
        mtx_enter(&pp->pr_mtx);
        if (pp->pr_nout >= pp->pr_hardlimit) {
@@ -462,6 +692,7 @@ pool_get(struct pool *pp, int flags)
                v = mem.v;
        }
 
+good:
        if (ISSET(flags, PR_ZERO))
                memset(v, 0, pp->pr_size);
 
@@ -551,7 +782,7 @@ pool_do_get(struct pool *pp, int flags, 
        MUTEX_ASSERT_LOCKED(&pp->pr_mtx);
 
        if (pp->pr_ipl != -1)
-               splassert(pp->pr_ipl);
+               splassertpl(pp->pr_ipl, pp->pr_wchan);
 
        /*
         * Account for this item now to avoid races if we need to give up
@@ -641,6 +872,11 @@ pool_put(struct pool *pp, void *v)
                panic("%s: NULL item", __func__);
 #endif
 
+       if (pp->pr_cache != NULL && TAILQ_EMPTY(&pp->pr_requests)) {
+               pool_cache_put(pp, v);
+               return;
+       }
+
        mtx_enter(&pp->pr_mtx);
 
        if (pp->pr_ipl != -1)
@@ -1345,6 +1581,21 @@ sysctl_dopool(int *name, u_int namelen, 
                pi.pr_nidle = pp->pr_nidle;
                if (pp->pr_ipl != -1)
                        mtx_leave(&pp->pr_mtx);
+
+               if (pp->pr_cache != NULL) {
+                       struct pool_cache *pc;
+                       struct cpumem_iter i;
+
+                       mtx_enter(&pp->pr_cache_mtx);
+                       CPUMEM_FOREACH(pc, &i, pp->pr_cache) {
+                               pi.pr_nout += pc->pc_nout;
+                               pi.pr_nget += pc->pc_gets; /* XXX */
+                               pi.pr_nput += pc->pc_puts; /* XXX */
+                       }
+
+                       pi.pr_nout += pp->pr_cache_nout;
+                       mtx_leave(&pp->pr_cache_mtx);
+               }
 
                rv = sysctl_rdstruct(oldp, oldlenp, NULL, &pi, sizeof(pi));
                break;
Index: kern/uipc_mbuf.c
===================================================================
RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v
retrieving revision 1.226
diff -u -p -r1.226 uipc_mbuf.c
--- kern/uipc_mbuf.c    13 Jun 2016 21:24:43 -0000      1.226
+++ kern/uipc_mbuf.c    11 Aug 2016 03:52:12 -0000
@@ -83,6 +83,7 @@
 #include <sys/domain.h>
 #include <sys/protosw.h>
 #include <sys/pool.h>
+#include <sys/percpu.h>
 
 #include <sys/socket.h>
 #include <sys/socketvar.h>
@@ -99,9 +100,11 @@
 #include <net/pfvar.h>
 #endif /* NPF > 0 */
 
-struct mbstat mbstat;          /* mbuf stats */
-struct mutex mbstatmtx = MUTEX_INITIALIZER(IPL_NET);
-struct pool mbpool;            /* mbuf pool */
+/* mbuf stats */
+COUNTERS_BOOT_MEMORY(mbstat_boot, MBSTAT_COUNT);
+struct cpumem *mbstat = COUNTERS_BOOT_INITIALIZER(mbstat_boot);
+/* mbuf pools */
+struct pool mbpool;
 struct pool mtagpool;
 
 /* mbuf cluster pools */
@@ -133,8 +136,8 @@ void        m_zero(struct mbuf *);
 static void (*mextfree_fns[4])(caddr_t, u_int, void *);
 static u_int num_extfree_fns;
 
-const char *mclpool_warnmsg =
-    "WARNING: mclpools limit reached; increase kern.maxclusters";
+const char *mbufpl_warnmsg =
+    "WARNING: mbuf limit reached; increase kern.maxclusters";
 
 /*
  * Initialize the mbuf allocator.
@@ -167,7 +170,6 @@ mbinit(void)
                    mclnames[i], NULL);
                pool_setipl(&mclpools[i], IPL_NET);
                pool_set_constraints(&mclpools[i], &kp_dma_contig);
-               pool_setlowat(&mclpools[i], mcllowat);
        }
 
        (void)mextfree_register(m_extfree_pool);
@@ -177,27 +179,22 @@ mbinit(void)
 }
 
 void
-nmbclust_update(void)
+mbcache(void)
 {
        int i;
-       /*
-        * Set the hard limit on the mclpools to the number of
-        * mbuf clusters the kernel is to support.  Log the limit
-        * reached message max once a minute.
-        */
-       for (i = 0; i < nitems(mclsizes); i++) {
-               (void)pool_sethardlimit(&mclpools[i], nmbclust,
-                   mclpool_warnmsg, 60);
-               /*
-                * XXX this needs to be reconsidered.
-                * Setting the high water mark to nmbclust is too high
-                * but we need to have enough spare buffers around so that
-                * allocations in interrupt context don't fail or mclgeti()
-                * drivers may end up with empty rings.
-                */
-               pool_sethiwat(&mclpools[i], nmbclust);
-       }
-       pool_sethiwat(&mbpool, nmbclust);
+
+       mbstat = counters_realloc(mbstat, MBSTAT_COUNT, M_DEVBUF);
+
+       pool_cache_init(&mbpool);
+       pool_cache_init(&mtagpool);
+       for (i = 0; i < nitems(mclsizes); i++)
+               pool_cache_init(&mclpools[i]);
+}
+
+void
+nmbclust_update(void)
+{
+       (void)pool_sethardlimit(&mbpool, nmbclust, mbufpl_warnmsg, 60);
 }
 
 /*
@@ -207,14 +204,21 @@ struct mbuf *
 m_get(int nowait, int type)
 {
        struct mbuf *m;
+       struct counters_ref cr;
+       uint64_t *counters;
+       int s;
+
+       KDASSERT(type < MT_NTYPES);
 
        m = pool_get(&mbpool, nowait == M_WAIT ? PR_WAITOK : PR_NOWAIT);
        if (m == NULL)
                return (NULL);
 
-       mtx_enter(&mbstatmtx);
-       mbstat.m_mtypes[type]++;
-       mtx_leave(&mbstatmtx);
+       s = splnet();
+       counters = counters_enter(&cr, mbstat);
+       counters[type]++;
+       counters_leave(&cr, mbstat);
+       splx(s);
 
        m->m_type = type;
        m->m_next = NULL;
@@ -233,14 +237,21 @@ struct mbuf *
 m_gethdr(int nowait, int type)
 {
        struct mbuf *m;
+       struct counters_ref cr;
+       uint64_t *counters;
+       int s;
+
+       KDASSERT(type < MT_NTYPES);
 
        m = pool_get(&mbpool, nowait == M_WAIT ? PR_WAITOK : PR_NOWAIT);
        if (m == NULL)
                return (NULL);
 
-       mtx_enter(&mbstatmtx);
-       mbstat.m_mtypes[type]++;
-       mtx_leave(&mbstatmtx);
+       s = splnet();
+       counters = counters_enter(&cr, mbstat);
+       counters[type]++;
+       counters_leave(&cr, mbstat);
+       splx(s);
 
        m->m_type = type;
 
@@ -352,13 +363,18 @@ struct mbuf *
 m_free(struct mbuf *m)
 {
        struct mbuf *n;
+       struct counters_ref cr;
+       uint64_t *counters;
+       int s;
 
        if (m == NULL)
                return (NULL);
 
-       mtx_enter(&mbstatmtx);
-       mbstat.m_mtypes[m->m_type]--;
-       mtx_leave(&mbstatmtx);
+       s = splnet();
+       counters = counters_enter(&cr, mbstat);
+       counters[m->m_type]--;
+       counters_leave(&cr, mbstat);
+       splx(s);
 
        n = m->m_next;
        if (m->m_flags & M_ZEROIZE) {
Index: sys/mbuf.h
===================================================================
RCS file: /cvs/src/sys/sys/mbuf.h,v
retrieving revision 1.216
diff -u -p -r1.216 mbuf.h
--- sys/mbuf.h  19 Jul 2016 08:13:45 -0000      1.216
+++ sys/mbuf.h  11 Aug 2016 03:52:12 -0000
@@ -236,6 +236,7 @@ struct mbuf {
 #define        MT_FTABLE       5       /* fragment reassembly header */
 #define        MT_CONTROL      6       /* extra-data protocol message */
 #define        MT_OOBDATA      7       /* expedited data  */
+#define MT_NTYPES      8
 
 /* flowid field */
 #define M_FLOWID_VALID 0x8000  /* is the flowid set */
@@ -397,6 +398,12 @@ struct mbstat {
        u_short m_mtypes[256];  /* type specific mbuf allocations */
 };
 
+#define MBSTAT_TYPES           MT_NTYPES
+#define MBSTAT_DROPS           (MBSTAT_TYPES + 0)
+#define MBSTAT_WAIT            (MBSTAT_TYPES + 1)
+#define MBSTAT_DRAIN           (MBSTAT_TYPES + 2)
+#define MBSTAT_COUNT           (MBSTAT_TYPES + 3)
+
 #include <sys/mutex.h>
 
 struct mbuf_list {
@@ -414,7 +421,6 @@ struct mbuf_queue {
 
 #ifdef _KERNEL
 
-extern struct mbstat mbstat;
 extern int nmbclust;                   /* limit on the # of clusters */
 extern int mblowat;                    /* mbuf low water mark */
 extern int mcllowat;                   /* mbuf cluster low water mark */
@@ -423,6 +429,7 @@ extern      int max_protohdr;               /* largest pro
 extern int max_hdr;                    /* largest link+protocol header */
 
 void   mbinit(void);
+void   mbcache(void);
 struct mbuf *m_copym2(struct mbuf *, int, int, int);
 struct mbuf *m_copym(struct mbuf *, int, int, int);
 struct mbuf *m_free(struct mbuf *);
Index: sys/pool.h
===================================================================
RCS file: /cvs/src/sys/sys/pool.h,v
retrieving revision 1.59
diff -u -p -r1.59 pool.h
--- sys/pool.h  21 Apr 2016 04:09:28 -0000      1.59
+++ sys/pool.h  11 Aug 2016 03:52:12 -0000
@@ -84,6 +84,9 @@ struct pool_allocator {
 
 TAILQ_HEAD(pool_pagelist, pool_item_header);
 
+struct pool_list;
+struct cpumem;
+
 struct pool {
        struct mutex    pr_mtx;
        SIMPLEQ_ENTRY(pool)
@@ -118,12 +121,23 @@ struct pool {
 #define PR_LIMITFAIL   0x0004 /* M_CANFAIL */
 #define PR_ZERO                0x0008 /* M_ZERO */
 #define PR_WANTED      0x0100
+#define PR_CPUCACHE    0x0200
 
        int             pr_ipl;
 
        RB_HEAD(phtree, pool_item_header)
                        pr_phtree;
 
+       struct cpumem * pr_cache;
+       struct mutex    pr_cache_mtx;
+       struct pool_list *
+                       pr_cache_list;
+       u_int           pr_cache_nlist;
+       u_int           pr_cache_items;
+       u_int           pr_cache_contention;
+       u_int           pr_cache_contention_prev;
+       int             pr_cache_nout;
+
        u_int           pr_align;
        u_int           pr_maxcolors;   /* Cache coloring */
        int             pr_phoffset;    /* Offset in page of page header */
@@ -175,6 +189,7 @@ struct pool_request {
 
 void           pool_init(struct pool *, size_t, u_int, u_int, int,
                    const char *, struct pool_allocator *);
+void           pool_cache_init(struct pool *);
 void           pool_destroy(struct pool *);
 void           pool_setipl(struct pool *, int);
 void           pool_setlowat(struct pool *, int);
Index: sys/srp.h
===================================================================
RCS file: /cvs/src/sys/sys/srp.h,v
retrieving revision 1.11
diff -u -p -r1.11 srp.h
--- sys/srp.h   7 Jun 2016 07:53:33 -0000       1.11
+++ sys/srp.h   11 Aug 2016 03:52:12 -0000
@@ -21,10 +21,12 @@
 
 #include <sys/refcnt.h>
 
+#ifndef __upunused
 #ifdef MULTIPROCESSOR
 #define __upunused
 #else
 #define __upunused __attribute__((__unused__))
+#endif
 #endif
 
 struct srp {

per cpu memory in the kernel

Reply via email to