Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote: > > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > > > Oh, and because vm_acct_memory() is counting a singleton object, it can > > > use > > > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc > > > overhead. > > > > Actually, I don't think that's true. we're allocating a sizeof(long) with > > kmalloc_node() so there shouldn't be memory wastage. > > Oh yeah there is. Each dynamic per-cpu object would have been atleast > (NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ). > Now kmalloc_node will fall back on size-32 for allocation of long, so > replace the cacheline_size above with 32 -- which then means dynamic per-cpu > data are not on a cacheline boundary anymore (most modern cpus have > 64byte/128 > byte cache lines) which means per-cpu data could end up false shared.... >
OK. But isn't the core of the problem the fact that __alloc_percpu() is using kmalloc_node() rather than a (new, as-yet-unimplemented) kmalloc_cpu()? kmalloc_cpu() wouldn't need the L1 cache alignment. It might be worth creating just a small number of per-cpu slabs (4-byte, 8-byte). A kmalloc_cpu() would just need a per-cpu array of kmem_cache_t*'s and it'd internally use kmalloc_node(cpu_to_node), no? Or we could just give __alloc_percpu() a custom, hand-rolled, not-cacheline-padded sizeof(long) slab per CPU and use that if (size == sizeof(long)). Or something. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html