On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> On Tue, Jun 02, 2026 at 09:57:48AM +0100, Gregory Price wrote:
> > On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> > > 
> > > I was think we wouldn't need explicit flags and that allocations would
> > > happen from user space using __GFP_THISNODE to the node or via a nodemask
> > > based on nodes of interest. Is there a reason to add this flag, a system
> > > might have more than one source of N_MEMORY_PRIVATE?
> > > 
> > 
> > There's a few things to unpack here.  I discussed this many times on
> > list and at LSF, but to reiterate.
> > 
> > 1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
> >    not particularly useful.  Additionally, from userland, it's not
> >    something you can actually set.
> 
> I was thinking mbind()/mempolicy() is how we get to it. It already
> accepts a nodemask.
>

First let me say:  I want to enable mbind access to these nodes.

But let me caveat:  I think that needs more time to develop, and
in the meantime, we can enable the /dev/xxx pattern somewhat trivially.

First let me address a few things about mbind/mempolicy and how it
interacts with page_alloc.c, I gave this overview at LSF but I don't
remember if I posted it in any of my follow ups.


1) Fallback lists are filtered by nodemask, the nodemask does not replace
   the fallback list.

Here is how the page allocator fallback lists and nodemasks interact:

   Fallbacks A:  A B 
   Fallbacks B:  B A
   Fallbacks C:  C A B   (Private)
   Fallbacks D:  D B A   (Private)

Lets say you pass:

   alloc_pages_node(C, ..., nodemask(A,C,D))

So we get

  Fallback(C,A,B) & nodemask(A,C,D) -> iterate(C,A)

If we wanted to change this behavior, realistically we'd be looking for
a way to add specific nodes to certain fallback lists - rather than
modify the nodemask interaction in some way.

I think this is out of scope for the first iteration - so supporting
anything other than mbind() from the start is just pointless.

The only feasible mempolicy you can apply is single-node bind, so
realistically you can only support mbind.


2) full mempolicy support doesn't really make sense

   task mempolicy PROBABLY should never really touch private nodes,
   while VMA policy certainly can.  Assuming we're able to support
   multi-private-node masks, none of the non-bind mempolicies even
   make sense for most private nodes (interleave? weighted interleave?)

   I haven't worked through all the implications of a task policy having
   a private node attached, but the longer I think about it, the less it
   makes sense to just support this outright.


3) Introducing mbind support is not just a simple nodemask on a VMA,
   It also implies migration, cgroup/cpuset, and UAPI interactions.

   a) migration:
      
      mbind/mempolicy can and will engage migration when it is called
      with certain flags.  Migration has subtle LRU interactions, but
      the patch set I have at least allows this to work.

   b) cgroup/cpuset:
   
      cpuset.mems rebinding will cause private nodes to be quietly
      rebound to non-private nodes within a nodemask.

   c) between A and B - we really want MPOL_F_STATIC to be required
      for mbind to be applied to private node so that it is never
      forcefully remapped.

      That's a UAPI semantic change specific for private nodes we
      should really take time to consider.


4) File VMA interactions don't entirely make sense with mbind

   In theory you might want:

   fd = open("somefile", ...);
   mem = mmap(fd, ...);
   mbind(mem, ..., private_node);
   for page in mem:
      mem[page_off] /* fault file into private memory */

   In reality: This does not work the way you want.

   I went digging and we need a few mild extensions to allow
   migration on mbind to work for pagecache pages, and the fault
   path does not necessarily respect the vma mempolicy always.

   You also start getting into the question of "what happens when
   the node is out of memory and you don't have reclaim support?".
   The OOM implications jump out at you pretty aggressively.

   Moreover other tasks can force the page cache pages to be moved
   as well.  So the programming model here just kind of sucks.

   Works great for anon memory though :]

For all these reasons, I think the be mbind/mempolicy support with
private nodes needs to be brought in with follow up work - not
introduced as part of the baseline set.

> > 
> >    for node in possible_nodes:
> >        alloc_pages_node(private_node, __GFP_THISNODE)
> > 
> >    In fact it's the opposite semantic of what we want.
> >    THISNODE says: "Do not fallback back to OTHER nodes".
> > 
> 
> That's why we need to control the fallback nodes carefully for
> N_MEMORY_PRIVATE
>

My point is that __GFP_THISNODE is not actually useful.

If we go by nodemask, submitting a single-node nodemask is the
equivalent of an empty fallback list.

If we gate access to a private node by __GFP_THISNODE... this is the
same as just providing a single-node nodelist (putting aside the OOM
implications for a moment).

And it doesn't even buy you any new filtering ability against existing
nodemask iterators that may already utilize __GFP_THISNODE.  i.e.

   for node in online_nodes:
       alloc_pages_node(node, __GFP_THISNODE, ...)
       /* Alloc per-node resources */

   This pattern is undesirable, but completely valid.

So overloading/requiring __GFP_THISNODE is just not useful.

I will follow up soon with a new version that limits the private node
interface to just nodemask and fallback list controls.

I need to test a few more things related to removing normal nodes from
private node fallbacks before I feel comfortable shipping without
__GFP_PRIVATE.

> >    The semantic we want is "Do not allow allocations from private
> >    nodes UNLESS we specifically request" (__GFP_PRIVATE).
> > 
> >    __GFP_THISNODE does not actually buy you anything here, AND it's
> >    worse, in the scenario where a private node makes its way into the
> >    preferred slot (via possible_nodes or some other nodemask), the
> >    allocator cannot fall back to a node it can access.
> > 
> >    __GFP_THISNODE cannot be overloaded to do anything useful here.
> 
> Let me clarify, I meant to say, let's use a nodemask for allocation
> and __GFP_THISNODE gets us to the node we desire, if that is the only
> node. My earlier comment might not have been clear.
>

My point was that __GFP_THISNODE is pointless and reduces to providing a
single node nodemask anyway.

The contention over __GFP_PRIVATE is a bit ideological - do we want:

  1) A hard guarantee that allocations to a private node are controlled
     (__GFP_PRIVATE implies the caller knows what it's doing)

  or

  2) A soft guarantee (fallback list isolation only), and needing to
     deal with undesired behavior that's "not technically a bug"
     associated with existing users of global nodemasks (possible,
     online, etc).

I am arguing for #1 - the community has argued for #2 and "fixing
existing nodemask users".  I think we can ship #2 and pivot to #1 if we
find fixing existing users is infeasible or too much of a maintenance
burden.

> 
> Why not use mbind() API's? Do we want to gate allocation/privileges
> via a /dev?
>

We want to eventually enable it, but we really need to treat these
extensions as a separate step from the base so that the UAPI
implications are given proper scrutiny.

In the short term, /dev/xxx and driver-local/service-local control
of a node is still very useful.

For example, for my compressed memory work, I have found that if
implemented as a swap backend - the kernel can manage the node without
any UAPI implications at all :].

A driver managing memory on a private node could do the same.

~Gregory

Reply via email to