Re: [patch 2/2] cpusets: add interleave_over_allowed option

Paul Jackson Fri, 26 Oct 2007 18:09:38 -0700

Issue:

    Are the nodes and nodemasks passed into set_mempolicy() to be
    presumed relative to the cpuset or not?  [Careful, this question
    doesn't mean what you might think it means.]


Let's say our system has 100 nodes, numbered 0-99, and we have a task
in a cpuset that includes the twenty nodes 10-29 at the moment.

Currently, if that task does say an MPOL_PREFERRED on node 12, we take
that to mean the 3rd node of its cpuset.  If we move that task to a
cpuset on nodes 40-59, the kernel will change that MPOL_PREFERRED to
node 42.  Similarly for the other MPOL_* policies.

Ok so far ... seems reasonable.  Node numbers passed into the
set_mempolicy call are taken to be absolute node numbers that are to
be mapped relative to the tasks current cpuset, perhaps unbeknownst
to the calling task, and remapped if that cpuset changes.

But now imagine that a task happens to be in a cpuset of just two
nodes, and wants to request an MPOL_PREFERRED policy for the fourth
node of its cpuset, anytime there actually is a fourth node.  That
task can't say that using numbering relative to its current cpuset,
because that cpuset only has two nodes.  It could say it relative to
a mask of all possible nodes by asking for the fourth possible node,
likely numbered node 3.

If that task happened to be in a cpuset on nodes 10 and 11, asking
for the fourth node in the system (node 3) would still be rather
unambiguous, as node 3 can't be either of 10 or 11, so must be
relative to all possible nodes, meaning "the fourth available node,
if I'm ever fortunate enough to have that many nodes."

But if that task happened to be in a cpuset on nodes 2 and 3, then
the node number 3 could mean:

Choice A:
    as it does today, the second node in the tasks cpuset or it could
    mean

Choice B:
    the fourth node in the cpuset, if available, just as
    it did in the case above involving a cpuset on nodes 10 and 11.

Let me restate this.

Either way, passing in node 3 means node 3, as numbered in the system.

But the question is whether (Choice A) node 3 is specified because
it is the second node in the tasks cpuset, or (Choice B) because it
is the fourth node in the system.

Choice A is what we do now.  But if we stay with Choice A, then a
task stuck in a small cpuset at the moment can't express non-trivial
mempolicy's for larger cpusets that it might be in later.

Choice B lets the task calculate its mempolicy mask as if it owned
the entire system, and express whatever elaborate mempolicy placement
it might need, when blessed with enough memory nodes to matter.
The system would automatically scrunch that request down to whatever
is the current size and placement of the cpuset holding that task.

Given a clean slate, I prefer Choice B.

But Choice B is incompatible.   Switching now would break tasks that
had been carefully adapting their set_mempolicy requests to whatever
nodes were in their current cpuset.  This is probably too incompatible
to be acceptable.

Therefore it must be Choice A.

However ...

If I approach this from another angle, I can show it should be
Choice B.  Fasten your seatbelt ...

Before the days of cpusets, Choice B was essentially how it was.
Tasks computing memory policies for set_mempolicy() calls computed
node numbers as if they owned the entire system.

Essentially, cpusets introduced an incompatibility, imposing Choice
A instead.  If a task wants to name the fourth node allowed to it in
a memory policy, it can no longer just say node "3", but now has to
determine its cpuset, and count off the fourth node that is currently
allowed to it.  This is an inherently racey calculation, of the sort
that some of us would find unacceptable, because it can't cope very
easily with simultaneously changing cpusets.

My hunch (unsupported by any real evidence or experience) is that
there is very little user level code that actually depends on this
incompatible change imposed by cpusets.  I'm guessing that most
codes making heavy use of memory policies are still coded as if the
task owns the system, and would be ill prepared to cope with a heavy
cpuset environment.

If that's the case, we'd break less user code by going with Choice B.

I have a little bit of code that will notice the difference (so if
we go with Choice B, there has to be a way for user level code that
cares to probe which choice applies), but I'm not a major user of
mempolicy calls.

I'll have to rely on the experience of others more involved with memory
policy aware user code as to which Choice would be less disruptive.

Recommendations?


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/2] cpusets: add interleave_over_allowed option

Reply via email to