Issue: Are the nodes and nodemasks passed into set_mempolicy() to be presumed relative to the cpuset or not? [Careful, this question doesn't mean what you might think it means.]
Let's say our system has 100 nodes, numbered 0-99, and we have a task in a cpuset that includes the twenty nodes 10-29 at the moment. Currently, if that task does say an MPOL_PREFERRED on node 12, we take that to mean the 3rd node of its cpuset. If we move that task to a cpuset on nodes 40-59, the kernel will change that MPOL_PREFERRED to node 42. Similarly for the other MPOL_* policies. Ok so far ... seems reasonable. Node numbers passed into the set_mempolicy call are taken to be absolute node numbers that are to be mapped relative to the tasks current cpuset, perhaps unbeknownst to the calling task, and remapped if that cpuset changes. But now imagine that a task happens to be in a cpuset of just two nodes, and wants to request an MPOL_PREFERRED policy for the fourth node of its cpuset, anytime there actually is a fourth node. That task can't say that using numbering relative to its current cpuset, because that cpuset only has two nodes. It could say it relative to a mask of all possible nodes by asking for the fourth possible node, likely numbered node 3. If that task happened to be in a cpuset on nodes 10 and 11, asking for the fourth node in the system (node 3) would still be rather unambiguous, as node 3 can't be either of 10 or 11, so must be relative to all possible nodes, meaning "the fourth available node, if I'm ever fortunate enough to have that many nodes." But if that task happened to be in a cpuset on nodes 2 and 3, then the node number 3 could mean: Choice A: as it does today, the second node in the tasks cpuset or it could mean Choice B: the fourth node in the cpuset, if available, just as it did in the case above involving a cpuset on nodes 10 and 11. Let me restate this. Either way, passing in node 3 means node 3, as numbered in the system. But the question is whether (Choice A) node 3 is specified because it is the second node in the tasks cpuset, or (Choice B) because it is the fourth node in the system. Choice A is what we do now. But if we stay with Choice A, then a task stuck in a small cpuset at the moment can't express non-trivial mempolicy's for larger cpusets that it might be in later. Choice B lets the task calculate its mempolicy mask as if it owned the entire system, and express whatever elaborate mempolicy placement it might need, when blessed with enough memory nodes to matter. The system would automatically scrunch that request down to whatever is the current size and placement of the cpuset holding that task. Given a clean slate, I prefer Choice B. But Choice B is incompatible. Switching now would break tasks that had been carefully adapting their set_mempolicy requests to whatever nodes were in their current cpuset. This is probably too incompatible to be acceptable. Therefore it must be Choice A. However ... If I approach this from another angle, I can show it should be Choice B. Fasten your seatbelt ... Before the days of cpusets, Choice B was essentially how it was. Tasks computing memory policies for set_mempolicy() calls computed node numbers as if they owned the entire system. Essentially, cpusets introduced an incompatibility, imposing Choice A instead. If a task wants to name the fourth node allowed to it in a memory policy, it can no longer just say node "3", but now has to determine its cpuset, and count off the fourth node that is currently allowed to it. This is an inherently racey calculation, of the sort that some of us would find unacceptable, because it can't cope very easily with simultaneously changing cpusets. My hunch (unsupported by any real evidence or experience) is that there is very little user level code that actually depends on this incompatible change imposed by cpusets. I'm guessing that most codes making heavy use of memory policies are still coded as if the task owns the system, and would be ill prepared to cope with a heavy cpuset environment. If that's the case, we'd break less user code by going with Choice B. I have a little bit of code that will notice the difference (so if we go with Choice B, there has to be a way for user level code that cares to probe which choice applies), but I'm not a major user of mempolicy calls. I'll have to rely on the experience of others more involved with memory policy aware user code as to which Choice would be less disruptive. Recommendations? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/