On Tue, Jun 02, 2026 at 09:57:48AM +0100, Gregory Price wrote: > On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote: > > On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote: > > > > > > I'm debating on whether to include OPS_MEMPOLICY in the initial version > > > if only because it's not intuitive how it interacts with pagecache. That > > > needs more time to bake. > > > > > > > It makes sense to look at it and then decide if it makes sense. > > > > I am thinking i will ship without any OPS flags at all for now and the > have the introduction of ops as a separate series. > > > > alloc_pages_node() is the kernel interface > > > > I was think we wouldn't need explicit flags and that allocations would > > happen from user space using __GFP_THISNODE to the node or via a nodemask > > based on nodes of interest. Is there a reason to add this flag, a system > > might have more than one source of N_MEMORY_PRIVATE? > > > > There's a few things to unpack here. I discussed this many times on > list and at LSF, but to reiterate. > > 1) __GFP_THISNODE is insufficient to enforce isolation and otherwise > not particularly useful. Additionally, from userland, it's not > something you can actually set.
I was thinking mbind()/mempolicy() is how we get to it. It already accepts a nodemask. > > for node in possible_nodes: > alloc_pages_node(private_node, __GFP_THISNODE) > > In fact it's the opposite semantic of what we want. > THISNODE says: "Do not fallback back to OTHER nodes". > That's why we need to control the fallback nodes carefully for N_MEMORY_PRIVATE > The semantic we want is "Do not allow allocations from private > nodes UNLESS we specifically request" (__GFP_PRIVATE). > > __GFP_THISNODE does not actually buy you anything here, AND it's > worse, in the scenario where a private node makes its way into the > preferred slot (via possible_nodes or some other nodemask), the > allocator cannot fall back to a node it can access. > > __GFP_THISNODE cannot be overloaded to do anything useful here. Let me clarify, I meant to say, let's use a nodemask for allocation and __GFP_THISNODE gets us to the node we desire, if that is the only node. My earlier comment might not have been clear. > > 2) We're trying not to expose *ANY* userland APIs for this, at all. > > The ultimate goal here should be one of two things: > > 1) fd = open(/dev/xxx, ...); > mem = mmap(fd, ...); > mem[0] = 0xDEADBEEF; /* Fault device page into page table */ > > In this case, the driver is responsible for doing the > alloc_pages_node() call. > > or > > 2) mem = mmap(NULL, ..., ANON); > mbind(mem, ..., private_node); > mem[0] = 0xDEADBEEF; /* Fault device page into page table */ > > in this case mempolicy.c is responsible for doing the > alloc_pages_node() call via the _mpol() alloc variants. > > Addition OPT flags (reclaim, compaction, whatever), would > (optionally) allow mm/ to operate on the device memory with, for > example, mmu_notifier callbacks to tell the device to invalidate > whatever it's caching about that page. > > This would all be relatively transparent the userland, all userland > "knows" is that it's getting memory from a device (/dev/xxx) or a > node it's otherwise aware of hosting device memory somehow. > Why not use mbind() API's? Do we want to gate allocation/privileges via a /dev? Balbir
