On 11/13/25 06:29, Gregory Price wrote: > This is a code RFC for discussion related to > > "Mempolicy is dead, long live memory policy!" > https://lpc.events/event/19/contributions/2143/ >
:) I am trying to read through your series, but in the past I tried https://lwn.net/Articles/720380/ > base-commit: 24172e0d79900908cf5ebf366600616d29c9b417 > (version notes at end) > > At LSF 2026, I plan to discuss: > - Why? (In short: shunting to DAX is a failed pattern for users) > - Other designs I considered (mempolicy, cpusets, zone_device) > - Why mempolicy.c and cpusets as-is are insufficient > - SPM types seeking this form of interface (Accelerator, Compression) > - Platform extensions that would be nice to see (SPM-only Bits) > > Open Questions > - Single SPM nodemask, or multiple based on features? > - Apply SPM/SysRAM bit on-boot only or at-hotplug? > - Allocate extra "possible" NUMA nodes for flexbility? > - Should SPM Nodes be zone-restricted? (MOVABLE only?) > - How to handle things like reclaim and compaction on these nodes. > > > With this set, we aim to enable allocation of "special purpose memory" > with the page allocator (mm/page_alloc.c) without exposing the same > memory as "System RAM". Unless a non-userland component, and does so > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated. > > This isolation mechanism is a requirement for memory policies which > depend on certain sets of memory never being used outside special > interfaces (such as a specific mm/component or driver). > > We present an example of using this mechanism within ZSWAP, as-if > a "compressed memory node" was present. How to describe the features > of memory present on nodes is left up to comment here and at LPC '26. > > Userspace-driven allocations are restricted by the sysram_nodes mask, > nothing in userspace can explicitly request memory from SPM nodes. > > Instead, the intent is to create new components which understand memory > features and register those nodes with those components. This abstracts > the hardware complexity away from userland while also not requiring new > memory innovations to carry entirely new allocators. > > The ZSwap example demonstrates this with the `mt_spm_nodemask`. This > hack treats all spm nodes as-if they are compressed memory nodes, and > we bypass the software compression logic in zswap in favor of simply > copying memory directly to the allocated page. In a real design > > There are 4 major changes in this set: > > 1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes > the set of nodes which are eligible for use as normal system ram > > Some existing users now pass mt_sysram_nodelist into the page > allocator instead of NULL, but passing a NULL pointer in will simply > have it replaced by mt_sysram_nodelist anyway. Should a fully NULL > pointer still make it to the page allocator, without GFP_SPM_NODE > SPM node zones will simply be skipped. > > mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes > present during __init, but if empty the use of mt_sysram_nodes() > will return a NULL to preserve current behavior. > > > 2) The addition of `cpuset.mems.sysram` which restricts allocations to > `mt_sysram_nodes` unless GFP_SPM_NODE is used. > > SPM Nodes are still allowed in cpuset.mems.allowed and effective. > > This is done to allow separate control over sysram and SPM node sets > by cgroups while maintaining the existing hierarchical rules. > > current cpuset configuration > cpuset.mems_allowed > |.mems_effective < (mems_allowed ∩ parent.mems_effective) > |->tasks.mems_allowed < cpuset.mems_effective > > new cpuset configuration > cpuset.mems_allowed > |.mems_effective < (mems_allowed ∩ parent.mems_effective) > |.sysram_nodes < (mems_effective ∩ default_sys_nodemask) > |->task.sysram_nodes < cpuset.sysram_nodes > > This means mems_allowed still restricts all node usage in any given > task context, which is the existing behavior. > > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the > capacity being added should mark the node as an SPM Node. > > A node is either SysRAM or SPM - never both. Attempting to add > incompatible memory to a node results in hotplug failure. > > DAX and CXL are made aware of the bit and have `spm_node` bits added > to their relevant subsystems. > > 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory > from the provided node or nodemask. It changes the behavior of > the cpuset mems_allowed and mt_node_allowed() checks. > > v1->v2: > - naming improvements > default_node -> sysram_node > protected -> spm (Specific Purpose Memory) > - add missing constify patch > - add patch to update callers of __cpuset_zone_allowed > - add additional logic to the mm sysram_nodes patch > - fix bot build issues (ifdef config builds) > - fix out-of-tree driver build issues (function renames) > - change compressed_nodelist to spm_nodelist > - add latch mechanism for sysram/spm nodes (Dan Williams) > this drops some extra memory-hotplug logic which is nice > v1: > https://lore.kernel.org/linux-mm/[email protected]/ > > Gregory Price (11): > mm: constify oom_control, scan_control, and alloc_context nodemask > mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed > gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations > memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes > mm: restrict slub, oom, compaction, and page_alloc to sysram by > default > mm,cpusets: rename task->mems_allowed to task->sysram_nodes > cpuset: introduce cpuset.mems.sysram > mm/memory_hotplug: add MHP_SPM_NODE flag > drivers/dax: add spm_node bit to dev_dax > drivers/cxl: add spm_node bit to cxl region > [HACK] mm/zswap: compressed ram integration example > > drivers/cxl/core/region.c | 30 ++++++ > drivers/cxl/cxl.h | 2 + > drivers/dax/bus.c | 39 ++++++++ > drivers/dax/bus.h | 1 + > drivers/dax/cxl.c | 1 + > drivers/dax/dax-private.h | 1 + > drivers/dax/kmem.c | 2 + > fs/proc/array.c | 2 +- > include/linux/cpuset.h | 62 +++++++------ > include/linux/gfp_types.h | 5 + > include/linux/memory-tiers.h | 47 ++++++++++ > include/linux/memory_hotplug.h | 10 ++ > include/linux/mempolicy.h | 2 +- > include/linux/mm.h | 4 +- > include/linux/mmzone.h | 6 +- > include/linux/oom.h | 2 +- > include/linux/sched.h | 6 +- > include/linux/swap.h | 2 +- > init/init_task.c | 2 +- > kernel/cgroup/cpuset-internal.h | 8 ++ > kernel/cgroup/cpuset-v1.c | 7 ++ > kernel/cgroup/cpuset.c | 158 ++++++++++++++++++++------------ > kernel/fork.c | 2 +- > kernel/sched/fair.c | 4 +- > mm/compaction.c | 10 +- > mm/hugetlb.c | 8 +- > mm/internal.h | 2 +- > mm/memcontrol.c | 3 +- > mm/memory-tiers.c | 66 ++++++++++++- > mm/memory_hotplug.c | 7 ++ > mm/mempolicy.c | 34 +++---- > mm/migrate.c | 4 +- > mm/mmzone.c | 5 +- > mm/oom_kill.c | 11 ++- > mm/page_alloc.c | 57 +++++++----- > mm/show_mem.c | 11 ++- > mm/slub.c | 15 ++- > mm/vmscan.c | 6 +- > mm/zswap.c | 66 ++++++++++++- > 39 files changed, 532 insertions(+), 178 deletions(-) > Balbir

