On Sat, 2007-10-27 at 12:16 -0700, David Rientjes wrote: > On Fri, 26 Oct 2007, David Rientjes wrote: > > > Hacking and requiring an updated version of libnuma to allow empty > > nodemasks to be passed is a poor solution; if mempolicy's are supposed to > > be independent from cpusets, then what semantics does an empty nodemask > > actually imply when using MPOL_INTERLEAVE? To me, it means the entire > > set_mempolicy() should be a no-op, and that's exactly how mainline > > currently treats it _as_well_ as libnuma. So justifying this change in > > the man page is respectible, but passing an empty nodemask just doesn't > > make sense. > > > > Another reason that passing an empty nodemask to set_mempolicy() doesn't > make sense is that libnuma uses numa_set_interleave_mask(&numa_no_nodes) > to disable interleaving completely. >
David: as we discussed when you contacted me off-list about this, the libnuma API and the system call interface are two quite different APIs. For example, numa_set_interleave_mask(&numa_no_nodes) does not pass MPOL_INTERLEAVE with an empty mask to set_mempolicy(). Rather it "installs" an MPOL_DEFAULT policy which internally just deletes the task's mempolicy, allowing fallback to system default policy. I would not propose to change this behavior, nor break libnuma in any way. For other, who weren't involved in the off-list exchange, here's an excerpt from my response to David: [ At the libnuma level, I think we need an explicit "numa_set_interleave_allowed()"--analogous to "numa_set_localalloc()". The current "numa_alloc_interleaved()" should, I think, allocate on all *allowed* nodes, rather than all nodes. It can do this using the sys call interface as defined. Independent of cpuset-independent interleave, an application needs to pass a valid subset of the current mems allowed to "numa_alloc_interleaved_subset()". An application can now obtain the mems_allowed using the MPOL_F_MEMS_ALLOWED flag that I added, but we need a libnuma wrapper for this as well. [Yeah, this info can change at any time, but that's always been the case....] "numa_interleave_memory()" is essentially mbind(), I think [not looking at the libnuma source code at this moment]. Maybe provide "numa_interleave_memory_allowed(void *mem, size_t size)" ??? Finally, I think we need to add a query function: "nodemask_t numa_get_mems_allowed()" to return the mask of valid nodes in the current context [cpuset]. This would just be a wrapper around get_mempolicy() with the MPOL_F_MEMS_ALLOWED flag. ] Couple of comments on the above: 1. "the sys call interface as defined" in the 2nd paragraph of the except refers to my patch that uses null/empty nodemask to indicate "all allowed". 2. As this thread progresses, you've discussed relaxing the requirement that applications pass a valid subset of mems_allowed. I.e., something that was illegal becomes legal. An API change, I think. But, a backward compatible one, so that's OK, right? :-) 3. If we do change the semantics of the mempolicy system calls to allow nodes outside of the cpuset, then maybe we don't need to query the mems allowed. I still find it useful, but not absolutely necessary--e.g., to construct a nodemask that will be acceptable in the current cpuset. 4. I looked at libnuma source. numa_interleave_memory() does use mbind() which, again, does not complain about nodemasks that include non-allowed nodes. Another thing occurs to me: perhaps numactl would need an additional 'nodes' specifier such as 'allowed'. Alternatively, 'all' could be redefined to me 'all allowed'. This is independent of how you specify 'all allowed' to the system call. Regards, Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/