On Tue, Jun 02, 2026 at 09:57:48AM +0100, Gregory Price wrote:
> On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> > On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote:
> > > 
> > > I'm debating on whether to include OPS_MEMPOLICY in the initial version
> > > if only because it's not intuitive how it interacts with pagecache. That
> > > needs more time to bake.
> > >
> > 
> > It makes sense to look at it and then decide if it makes sense.
> >
> 
> I am thinking i will ship without any OPS flags at all for now and the
> have the introduction of ops as a separate series.
> 
> > > alloc_pages_node() is the kernel interface
> > 
> > I was think we wouldn't need explicit flags and that allocations would
> > happen from user space using __GFP_THISNODE to the node or via a nodemask
> > based on nodes of interest. Is there a reason to add this flag, a system
> > might have more than one source of N_MEMORY_PRIVATE?
> > 
> 
> There's a few things to unpack here.  I discussed this many times on
> list and at LSF, but to reiterate.
> 
> 1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
>    not particularly useful.  Additionally, from userland, it's not
>    something you can actually set.

I was thinking mbind()/mempolicy() is how we get to it. It already
accepts a nodemask.

> 
>    for node in possible_nodes:
>        alloc_pages_node(private_node, __GFP_THISNODE)
> 
>    In fact it's the opposite semantic of what we want.
>    THISNODE says: "Do not fallback back to OTHER nodes".
> 

That's why we need to control the fallback nodes carefully for
N_MEMORY_PRIVATE

>    The semantic we want is "Do not allow allocations from private
>    nodes UNLESS we specifically request" (__GFP_PRIVATE).
> 
>    __GFP_THISNODE does not actually buy you anything here, AND it's
>    worse, in the scenario where a private node makes its way into the
>    preferred slot (via possible_nodes or some other nodemask), the
>    allocator cannot fall back to a node it can access.
> 
>    __GFP_THISNODE cannot be overloaded to do anything useful here.

Let me clarify, I meant to say, let's use a nodemask for allocation
and __GFP_THISNODE gets us to the node we desire, if that is the only
node. My earlier comment might not have been clear.

> 
> 2) We're trying not to expose *ANY* userland APIs for this, at all.
> 
>    The ultimate goal here should be one of two things:
> 
>    1) fd = open(/dev/xxx, ...);
>       mem = mmap(fd, ...);
>       mem[0] = 0xDEADBEEF; /* Fault device page into page table */
> 
>       In this case, the driver is responsible for doing the
>       alloc_pages_node() call.
> 
>    or
> 
>    2) mem = mmap(NULL, ..., ANON);
>       mbind(mem, ..., private_node);
>       mem[0] = 0xDEADBEEF; /* Fault device page into page table */
> 
>       in this case mempolicy.c is responsible for doing the
>       alloc_pages_node() call via the _mpol() alloc variants.
> 
> Addition OPT flags (reclaim, compaction, whatever), would
> (optionally) allow mm/ to operate on the device memory with, for
> example, mmu_notifier callbacks to tell the device to invalidate
> whatever it's caching about that page.
> 
> This would all be relatively transparent the userland, all userland
> "knows" is that it's getting memory from a device (/dev/xxx) or a
> node it's otherwise aware of hosting device memory somehow.
> 

Why not use mbind() API's? Do we want to gate allocation/privileges
via a /dev?

Balbir

Reply via email to