On Fri, 2007-10-26 at 14:37 -0700, Christoph Lameter wrote: > On Fri, 26 Oct 2007, Lee Schermerhorn wrote: > > > > > Now, if we could replace the 'cpuset_mems_allowed' nodemask with a > > > > pointer to something stable, it might be a win. > > > > > > The memory policies are already shared and have refcounters for that > > > purpose. > > > > I must have missed that in the code I'm reading :) > > What is the benefit of having pointers to nodemasks? We likely would need > to have refcounts in those nodemasks too? So we duplicate a lot of > the characteristics of memory policies?
Hi, Christoph: remoting the nodemasks from the mempolicy and allocating them only when needed is something that you and Mel and I discussed last month, in the context of Mel's "one zonelist filtered by nodemask" patches. I just put together the dynamic nodemask patch [included below FYI, NOT for serious consideration] to see what it looked like and whether it helped. Conclusion: it's ugly/complex [especially trying to keep the nodemasks embedded for systems that don't require > a pointer's worth of bits] and they probably don't help much if most uses of non-default mempolicy requires a nodemask. I only brought it up again because now you all are considering another nodemask per policy. In fact, I only considered it in the first place because nodemasks on our [HP's] platform don't require more than a pointer's worth of bits [today, at least--I don't know about future plans]. However, since we share an arch--ia64-with SGI and distros don't want to support special kernels for different vendors, if they can avoid it, we have 1K-bit nodemasks. Since this is ia64 we're talking about, most folks don't care. Now that you're going to do the same for x86_64, it might become more visible. Then again, maybe there are few enough mempolicy structs that no-one will care anyway. Note: I don't [didn't] think I need to ref count the nodemasks associated with the mempolicies because they are allocated when the mempolicy is and destroyed when the policy is--not shared. Just like the custom zonelist for bind policy, and we have no ref count there. I.e., they're protected by the mempol's ref. However, now that you bring it up, I'm wondering about the effects of policy remapping, and whether we have the reference counting or indirect protection [mmap_sem, whatever] correct there in current code. I'll have to take a look. Lee
PATCH/RFC Memory Policy: dynamically allocate policy nodemaps Something that Christoph Lameter, Mel Gorman and I discussed. Once Mel's "one zonelist" patches go in, we no longer have the MPOL_BIND custom zonelist in the mempolicy struct. However, we still have 2 nodemask_t's that can get quite large: one in the union with the preferred_node, and one for the cpuset current mems allowed. Note that this 2nd nodemask_t does NOT depend on whether or not cpusets are configured. So, on an ia64 platform with NODES_SHIFT configured at 8 [256 nodes], this results in a nodemask_t size of 32 bytes and a mempolicy size of 72 bytes. Real soon now, we'll be seeing x86_64 platforms with hundreds of nodes. Indirect/dynamically allocated policy nodemasks can reduce this size, for policies that don't need them, to: TODO [Some distros ship with NODES_SHIFT=10 => 128 byte nodemasks.] However, on platforms and configurations where BITS_PER_LONG >= (1 << NODES_SHIFT), indirect policy nodemasks are unnecessary overhead, as the maximum number of nodes will fit in a pointers worth of data. So, this patch implements indirect, dynamically allocated nodemasks for memory policies when: BITS_PER_LONG < (1 << NODES_SHIFT). etc... Signed-off-by: Lee Schermerhorn <[EMAIL PROTECTED]> include/linux/mempolicy.h | 20 +++++- mm/mempolicy.c | 134 ++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 135 insertions(+), 19 deletions(-) Index: Linux/include/linux/mempolicy.h =================================================================== --- Linux.orig/include/linux/mempolicy.h 2007-10-02 14:11:20.000000000 -0400 +++ Linux/include/linux/mempolicy.h 2007-10-02 16:07:43.000000000 -0400 @@ -47,6 +47,12 @@ struct mm_struct; #ifdef CONFIG_NUMA +#if BITS_PER_LONG < (1 << NODES_SHIFT) +#define INDIRECT_POLICY_NODEMASK 1 +#else +#define INDIRECT_POLICY_NODEMASK 0 +#endif + /* * Describe a memory policy. * @@ -71,11 +77,19 @@ struct mempolicy { atomic_t refcnt; short policy; /* See MPOL_* above */ union { - short preferred_node; /* preferred */ - nodemask_t nodes; /* interleave/bind */ + short preferred_node; /* preferred */ +#if INDIRECT_POLICY_NODEMASK + nodemask_t *nodes; /* interleave/bind */ +#else + nodemask_t nodes; /* interleave/bind */ +#endif /* undefined for default */ } v; - nodemask_t cpuset_mems_allowed; /* mempolicy relative to these nodes */ +#if INDIRECT_POLICY_NODEMASK + nodemask_t *cpuset_mems_allowed; /* mempolicy relative to these nodes */ +#else + nodemask_t cpuset_mems_allowed; /* mempolicy relative to these nodes */ +#endif }; /* Index: Linux/mm/mempolicy.c =================================================================== --- Linux.orig/mm/mempolicy.c 2007-10-02 14:12:35.000000000 -0400 +++ Linux/mm/mempolicy.c 2007-10-02 17:18:24.000000000 -0400 @@ -100,6 +100,7 @@ static struct kmem_cache *policy_cache; static struct kmem_cache *sn_cache; +static struct kmem_cache *nodemask_cache; /* Highest zone. An specific allocation for a zone below that is not policied. */ @@ -158,10 +159,68 @@ static int is_valid_nodemask(nodemask_t return 0; } +#if INDIRECT_POLICY_NODEMASK +/* + * mempolicy operations for indirect policy nodemasks + * operate on pointers to nodemask_t pointers in mempolicy struct + */ +static int set_policy_nodemask(nodemask_t **pol_nodes, nodemask_t *nodes) +{ + **pol_nodes = *nodes; + return 0; +} + +static int new_policy_nodemask(nodemask_t **pol_nodes, nodemask_t *nodes) +{ + *pol_nodes = kmem_cache_alloc(nodemask_cache, GFP_KERNEL); + if (!*pol_nodes) + return -ENOMEM; + return set_policy_nodemask(pol_nodes, nodes); +} + +static nodemask_t *policy_nodemask_ref(nodemask_t **pol_nodes) +{ + return *pol_nodes; +} + +static void free_policy_nodemask(nodemask_t **pol_nodes) +{ + kmem_cache_free(nodemask_cache, *pol_nodes); +} + +#else + +/* + * mempolicy operations for embedded policy nodemasks + * operate on pointers to nodemasks embedded in mempolicy structs + */ +static int set_policy_nodemask(nodemask_t *pol_nodes, nodemask_t *nodes) +{ + *policy->v.nodes = *nodes; + return 0; +} + +static int new_policy_nodemask(nodemask_t *pol_nodes, nodemask_t *nodes) +{ + return set_policy_nodemask(pol_nodes, nodes); +} + +static nodemask_t *policy_nodemask_ref(nodemask_t *pol_nodes) +{ + return pol_nodes; +} + +static void free_policy_nodemask(nodemask_t *pol_nodes) +{ +} +#endif + /* Create a new policy */ static struct mempolicy *mpol_new(int mode, nodemask_t *nodes) { struct mempolicy *policy; + nodemask_t mems_allowed; + int ret; pr_debug("setting mode %d nodes[0] %lx\n", mode, nodes ? nodes_addr(*nodes)[0] : -1); @@ -175,14 +234,18 @@ static struct mempolicy *mpol_new(int mo atomic_set(&policy->refcnt, 1); switch (mode) { case MPOL_INTERLEAVE: - policy->v.nodes = *nodes; + ret = new_policy_nodemask(&policy->v.nodes, nodes); + if (ret) + return ERR_PTR(ret); if (nodes_weight(*nodes) == 0) { mode |= MPOL_CONTEXT; break; } - nodes_and(policy->v.nodes, policy->v.nodes, + nodes_and(*policy_nodemask_ref(&policy->v.nodes), + *policy_nodemask_ref(&policy->v.nodes), node_states[N_HIGH_MEMORY]); - if (nodes_weight(policy->v.nodes) == 0) { + if (nodes_weight(*policy_nodemask_ref(&policy->v.nodes)) == 0) { + free_policy_nodemask(&policy->v.nodes); kmem_cache_free(policy_cache, policy); return ERR_PTR(-EINVAL); } @@ -197,11 +260,21 @@ static struct mempolicy *mpol_new(int mo kmem_cache_free(policy_cache, policy); return ERR_PTR(-EINVAL); } - policy->v.nodes = *nodes; + ret = new_policy_nodemask(&policy->v.nodes, nodes); + if (ret) + return ERR_PTR(ret); break; } policy->policy = mode; - policy->cpuset_mems_allowed = cpuset_mems_allowed(current); + +//TODO: I ought to be able to figure out how to reference the +// mems_allowed [current task's or node_possible_map] w/o +// allocating a new copy. Need more helper function? + mems_allowed = cpuset_mems_allowed(current); + ret = new_policy_nodemask(&policy->cpuset_mems_allowed, + &mems_allowed); + if (ret) + return ERR_PTR(ret); return policy; } @@ -464,7 +537,7 @@ static nodemask_t *get_interleave_nodes( if (unlikely(p->policy & MPOL_CONTEXT)) return &cpuset_current_mems_allowed; - return &p->v.nodes; + return policy_nodemask_ref(&p->v.nodes); } /* Set the process memory policy */ @@ -1135,8 +1208,8 @@ static inline nodemask_t *nodemask_polic /* Lower zones don't get a nodemask applied for MPOL_BIND */ if (unlikely(policy->policy == MPOL_BIND && gfp_zone(gfp) >= policy_zone && - cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))) - return &policy->v.nodes; + cpuset_nodemask_valid_mems_allowed(policy_nodemask_ref(&policy->v.nodes)))) + return policy_nodemask_ref(&policy->v.nodes); return NULL; } @@ -1200,8 +1273,9 @@ unsigned slab_node(struct mempolicy *pol struct zoneref *z; enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL); zonelist = &NODE_DATA(numa_node_id())->node_zonelist; - z = first_zones_zonelist(zonelist, &policy->v.nodes, - highest_zoneidx); + z = first_zones_zonelist(zonelist, + policy_nodemask_ref(&policy->v.nodes), + highest_zoneidx); return zonelist_node_idx(z); } @@ -1421,7 +1495,25 @@ struct mempolicy *__mpol_copy(struct mem nodemask_t mems = cpuset_mems_allowed(current); mpol_rebind_policy(old, &mems); } - *new = *old; + + /* + * need to copy members explicitly to handle indirect vs + * embedded nodemasks + */ + new->policy = old->policy; + switch (policy_mode(old)) { + case MPOL_PREFERRED: + new->v.preferred_node = old->v.preferred_node; + break; + + case MPOL_BIND: + /*FALL THROUGH*/ + case MPOL_INTERLEAVE: + new_policy_nodemask(&new->v.nodes, + policy_nodemask_ref(&old->v.nodes)); + } + new_policy_nodemask(&new->cpuset_mems_allowed, + policy_nodemask_ref(&old->cpuset_mems_allowed)); atomic_set(&new->refcnt, 1); return new; } @@ -1441,7 +1533,9 @@ int __mpol_equal(struct mempolicy *a, st * works for MPOL_BIND as it shouldn't have MPOL_CONTEXT set */ return a->policy & MPOL_CONTEXT || - nodes_equal(a->v.nodes, b->v.nodes); + nodes_equal( + *policy_nodemask_ref(&a->v.nodes), + *policy_nodemask_ref(&b->v.nodes)); case MPOL_PREFERRED: return a->v.preferred_node == b->v.preferred_node; default: @@ -1455,6 +1549,8 @@ void __mpol_free(struct mempolicy *p) { if (!atomic_dec_and_test(&p->refcnt)) return; + free_policy_nodemask(&p->v.nodes); + free_policy_nodemask(&p->cpuset_mems_allowed); kmem_cache_free(policy_cache, p); } @@ -1646,7 +1742,8 @@ int mpol_set_shared_policy(struct shared pr_debug("set_shared_policy %lx sz %lu %d %lx\n", vma->vm_pgoff, sz, npol? npol->policy : -1, - npol ? nodes_addr(npol->v.nodes)[0] : -1); + npol ? nodes_addr(*policy_nodemask_ref(&npol->v.nodes))[0] : + -1); if (npol) { new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol); @@ -1694,6 +1791,10 @@ void __init numa_policy_init(void) sizeof(struct sp_node), 0, SLAB_PANIC, NULL); + if (INDIRECT_POLICY_NODEMASK) + nodemask_cache = kmem_cache_create("nodemask", + sizeof(nodemask_t), + 0, SLAB_PANIC, NULL); /* * Set interleaving policy for system init. Interleaving is only * enabled across suitably sized nodes (default is >= 16MB), or @@ -1738,7 +1839,7 @@ static void mpol_rebind_policy(struct me if (!pol) return; - mpolmask = &pol->cpuset_mems_allowed; + mpolmask = policy_nodemask_ref(&pol->cpuset_mems_allowed); if (nodes_equal(*mpolmask, *newmask)) return; @@ -1746,8 +1847,9 @@ static void mpol_rebind_policy(struct me case MPOL_BIND: /* Fall through */ case MPOL_INTERLEAVE: - nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask); - pol->v.nodes = tmp; + nodes_remap(tmp, *policy_nodemask_ref(&pol->v.nodes), + *mpolmask, *newmask); + set_policy_nodemask(&pol->v.nodes, &tmp); *mpolmask = *newmask; current->il_next = node_remap(current->il_next, *mpolmask, *newmask);