On Mon, Nov 24, 2025 at 10:19:37AM +0100, David Hildenbrand (Red Hat) wrote: > [...] >
Apologies in advance for the wall of text, both of your questions really do cut to the core of the series. The first (SPM nodes) is basically a plumbing problem I haven't had time to address pre-LPC, the second (GFP) is actually a design decision that is definitely up in the air. So consider this a dump of everything I wouldn't have had time to cover in the LPC session. > > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the > > capacity being added should mark the node as an SPM Node. > > Sounds a bit like the wrong interface for configuring this. This smells like > a per-node setting that should be configured before hotplugging any memory. > Assuming you're specifically talking about the MHP portion of this. I agree, and I think the plumbing ultimately goes through acpi and kernel configs. This was my shortest path to demonstrate a functional prototype by LPC. I think the most likely option simply reserving additional NUMA nodes for hotpluggable regions based on a Kconfig setting. I think the real setup process should look like follows: 1. At __init time, Linux reserves additional SPM nodes based on some configuration (build? runtime? etc) Essentially create: nodes[N_SPM] 2. At SPM setup time, a driver registers an "Abstract Type" with mm/memory_tiers.c which maps SPM->Type. This gives the core some management callback infrastructure without polluting the core with device specific nonsense. This also gives the driver a change to define things like SLIT distances for those nodes, which otherwise won't exist. 3. At hotplug time, memory-hotplug.c should only have to flip a bit in `mt_sysram_nodes` if NID is not in nodes[N_SPM]. That logic is still there to ensure the base filtering works as intended. I haven't quite figured out how to plumb out nodes[N_SPM] as described above, but I did figure out how to demonstrate roughly the same effect through memory-hotplug.c - hopefully that much is clear. The problem with the above plan, is whether that "Makes sense" according to ACPI specs and friends. This operates in "Ambiguity Land", which is uncomfortable. ======== How Linux ingests ACPI Tables to make NUMA nodes ======= For the sake of completeness: NUMA nodes are "marked as possible" primarily via entries in the ACPI SRAT (Static Resource Affinity Table). https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html Subtable Type : 01 [Memory Affinity] Length : 28 Proximity Domain : 00000001 <- NUMA Node 1 A proximity domain (PXM) is simply a logical grouping of components according to the OSPM. Linux takes PXMs and maps them to NUMA nodes. In most cases (NR_PXM == NR_NODES), but not always. For example, if the CXL Early Detection Table (CEDT) describes a CXL memory region for which there is no SRAT entry, Linux reserves a "Fake PXM" id and marks that ID as a "possible" NUMA node. = drivers/acpi/numa/srat.c int __init acpi_numa_init(void) { ... /* fake_pxm is the next unused PXM value after SRAT parsing */ for (i = 0, fake_pxm = -1; i < MAX_NUMNODES; i++) { if (node_to_pxm_map[i] > fake_pxm) fake_pxm = node_to_pxm_map[i]; } last_real_pxm = fake_pxm; fake_pxm++; acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws, &fake_pxm); ... } static int __init acpi_parse_cfmws(union acpi_subtable_headers *header, void *arg, const unsigned long table_end) { ... /* No SRAT description. Create a new node. */ node = acpi_map_pxm_to_node(*fake_pxm); ... node_set(node, numa_nodes_parsed); <- this is used to set N_POSSIBLE } Here's where we get into "Specification Ambiguity" The ACPI spec does not limit (as far as I can see) a memory region from being associated with multiple proximity domains (NUMA nodes). Therefore, the OSPM could actually report it multiple times in the SRAT in order to reserve multiple NUMA node possiblities for the same device. A further extention to ACPI could be used to mark such Memory PXMs as "Specific Purpose" - similar to the EFI_MEMORY_SP bit used to mark memory regions as "Soft Reserved". (this would probably break quite a lot of existing linux code, which a quick browse around gives you the sense that there's an assumption a given page can only be affiliated with one possible numa node) But Linux could also utilize build or runtime settings to add additional nodes which are reserved for SPM use - but are otherwise left out of all the default maps. This at least seems reasonable. Note: N_POSSIBLE nodes is set at __init time, and is more or less expected to never change. It's probably preferable to work with this restriction, rather than to try to change it. Many race conditions. <skippable wall> ================= Spec nonsense for reference ==================== (ACPI 6.5 Spec) 5.2.16.2 Memory Affinity Structure The Memory Affinity structure provides the following topology information statically to the operating system: • The association between a memory range and the proximity domain to which it belongs • Information about whether the memory range can be hot-plugged. 5.2.19 Maximum System Characteristics Table (MSCT) This section describes the format of the Maximum System Characteristic Table (MSCT), which provides OSPM with information characteristics of a system’s maximum topology capabilities. If the system maximum topology is not known up front at boot time, then this table is not present. OSPM will use information provided by the MSCT only when the System Resource Affinity Table (SRAT) exists. The MSCT must contain all proximity and clock domains defined in the SRAT. -- field: Maximum Number of Proximity Domains Indicates the maximum number of Proximity Domains ever possible in the system. In theory an OSPM could make (MAX_NODES > (NR_NODES in SRAT)) and that delta could be used to indicate the presense of SPM nodes. This doesn't solve the SLIT PXM distance problem. 6.2.14 _PXM (Proximity) This optional object is used to describe proximity domain associations within a machine. _PXM evaluates to an integer that identifies a device as belonging to a Proximity Domain defined in the System Resource Affinity Table (SRAT). OSPM assumes that two devices in the same proximity domain are tightly coupled. 17.2.1 System Resource Affinity Table Definition The optional System Resource Affinity Table (SRAT) provides the boot time description of the processor and memory ranges belonging to a system locality. OSPM will consume the SRAT only at boot time. For any devices not in the SRAT, OSPM should use _PXM (Proximity) for them or their ancestors that are hot-added into the system after boot up. The SRAT describes the system locality that all processors and memory present in a system belong to at system boot. This includes memory that can be hot-added (that is memory that can be added to the system while it is running, without requiring a reboot). OSPM can use this information to optimize the performance of NUMA architecture systems. For example, OSPM could utilize this information to optimize allocation of memory resources and the scheduling of software threads. ============================================================= </skippable wall> So TL;DR: Yes, I agree, this logic should __init time configured, but while we work on that plumbing, the memory-hotplug.c interface can be used to unblock exploratory work (such as Alistair's GPU interests). > > 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory > > from the provided node or nodemask. It changes the behavior of > > the cpuset mems_allowed and mt_node_allowed() checks. > > I wonder why that is required. Couldn't we disallow allocation from one of > these special nodes as default, and only allow it if someone explicitly > passes in the node for allocation? > > What's the problem with that? > Simple answer: We can choose how hard this guardrail is to break. This initial attempt makes it "Hard": You cannot "accidentally" allocate SPM, the call must be explicit. Removing the GFP would work, and make it "Easier" to access SPM memory. (There would be other adjustments needed, but the idea is the same). To do this you would revert the mems_allowed check changes in cpuset to check mems_allowed always (instead of sysram_nodes). This would allow a trivial mbind(range, SPM_NODE_ID) Which is great, but is also an incredible tripping hazard: numactl --interleave --all and in kernel land: __alloc_pages_noprof(..., nodes[N_MEMORY]) These will now instantly be subject to SPM node memory. The first pass leverages the GFP flag to make all these tripping hazards disappear. You can pass a completely garbage nodemask into the page allocator and still rest assured that you won't touch SPM nodes. So TL;DR: "What do we want here?" (if anything at all) For completeness, here are the page_alloc/cpuset/mempolicy interactions which lead me to a GFP flag as the "loosening mechanism" for the filter, rather than allowing any nodemask to "just work". Apologies again for the wall of text here, essentially dumping ~6 months of research and prototyping. ==================== There are basically 3 components which interact with each other: 1) the page allocator nodemask / zone logic 2) cpuset.mems_allowed 3) mempolicy (task, vma) and now: 4) GFP_SPM_NODE === 1) the page allocator nodemask and zone iteration logic - page allocator uses prepare_alloc_pages() to decide what alloc_context.nodemask will contain - nodemask can be NULL or a set of nodes. - for_zone() iteration logic will iterate all zones if mask=NULL Otherwise, it skips zones on nodes not present in the mask - the value of alloc_context.nodemask may change for example it may end up loosened if in an interrupt context or if reclaim/compaction/fallbacks are invoked. Some issues might be obvious: It would be bad, for example, for an interrupt to have its allocation context loosened to nodes[N_MEMORY] and end up allocating SPM memory Capturing all of these scenarios would be very difficult if not impossible. The page allocator does an initial filtering of nodes if nodemask=NULL, or it defers the filter operation to the allocation logic if a nodemask is present (or we're in a interrupt context). static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid, nodemask_t *nodemask, struct alloc_context *ac, gfp_t *alloc_gfp, unsigned int *alloc_flags) { ... ac->nodemask = nodemask; if (cpuset_enabled()) { ... if (in_task() && !ac->nodemask) ac->nodemask = &cpuset_current_mems_allowed; ^^^^ current_task.mems_allowed else *alloc_flags |= ALLOC_CPUSET; ^^^ apply cpuset check during allocation instead } } Note here: If cpuset is not enabled, we don't filter! patch 05/11 uses mt_sysram_nodes to filter in that scenario In the actual allocation logic, we use this nodemask (or cpusets) to filter out unwanted nodes. static struct page * get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, const struct alloc_context *ac) { z = ac->preferred_zoneref; for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx, ac->nodemask) { ^ if nodemask=NULL - iterates ALL zones in all nodes ^ ... if (cpuset_enabled() && (alloc_flags & ALLOC_CPUSET) && !__cpuset_zone_allowed(zone, gfp_mask)) continue; ^^^^^^^^ Skip zone if not in mems_allowed ^^^^^^^^^ Of course we could change the page allocator logic more explicitly to support this kind of scenario. For example: We might add alloc_spm_pages() which checks mems_allowed instead of sysram_nodes. I tried this, and the code duplication and spaghetti it resulted in was embarassing. It did work, but adding hundreds of lines to page_alloc.c, with the risk of breaking something just lead me to quickly disgarded it. It also just bluntly made using SPM memory worse - you just want to call alloc_pages(nodemask) and be done with it. This is what lead me to focus on modifying cpuset.mems_allowed and add global filter logic when cpusets is disabled. === 2) cpuset.mems - cpuset.mems_allowed is the "primary filter" for most allocations - if cpusets is not enabled, basically all nodes are "allowed" - cpuset.mems_allowed is an *inherited value* child cgroups are restricted by the parent's mems_allowed cpuset.effective_mems is the actual nodemask filter. cpuset.mems_allowed as-is cannot both restrict *AND* allow SPM nodes. See the filtering functions above: If you remove an SPM node from root_cgroup.cpuset.mems_allowed to all of its children from using it, you effectively prevent ANYTHING from using it: The node is simply not allowed. Since all tasks operate from within a the root context or its children - you can never "Allow" the node. If you don't remove the SPM node from the root cgroup, you aren't preventing tasks in the root cgroup from accessing the memory. I chose to break mems_allowed into (mems_allowed, sysram_nodes) to: a) create simple nodemask=NULL default nodemask filters: mt_sysram_nodes, cpuset.sysram_nodes, task.sysram_nodes b) Leverage the existing cpuset filtering mechanism in mems_allowed() checks c) Simplify the non-cpuset filter mechanism to a 2-line change in page_alloc.c -- from Patch 04/11: @@ -3753,6 +3754,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, if ((alloc_flags & ALLOC_CPUSET) && !cpuset_zone_allowed(zone, gfp_mask)) continue; + else if (!mt_node_allowed(zone_to_nid(zone), gfp_mask)) + continue; page_alloc.c changes are much cleaner and easy to understand this way === 3) mempolicy - mempolicy allows you change the task or vma node-policy, separate from (but restricted by) cpuset.mems - there are some policies like interleave which provide (ALL) options which create, basically, a nodemask=nodes[N_MEMORY] scenario. - This is entirely controllable via userspace. - There exists a lot of software out there which makes use of this interface via numactl syscalls (set_mempolicy, mbind, etc) - There is a global "default" mempolicy which is leveraged when task->mempolicy=NULL or vma->vm_policy=NULL. The default policy is essentially "Allocate from local node, but fallback to any possible node as-needed" During my initial explorations I started by looking at whether a filter function could be implemented via the global policy. It should be somewhat obvious this falls apart completely as soon as you find the page allocator actually filters using cpusets. So mempolicies are dead as a candidate for any real isolation mechanism. It is nothing more than a suggestion at best, and is actually explicitly ignored by things like reclaim. (cough: Mempolicy is dead, long live Memory Policy) I was also very worried about introducing an SPM Node solution which presented as an isolation mechanism... which then immediately crashed and burned when deployed by anyone already using numactl. I have since, however, been experimenting with how you might enable mempolicy to include SPM nodes more explicitly (with the GFP flag). (attached at the end, completely untested, just conceptual). === 4) GFP_SPM_NODE Once the filtering functions are in place (sysram_nodes), we've hit a point where absolutely nothing can actually touch those nodes at all. So that was requirement #1... but of course we do actually want to allocate this memory, that's the point. But now we have a choice... If a node is present in the nodemask, we can: 1) filter it based on sysram_nodes a) cpuset.sysram, or b) mt_sysram_nodes or 2) filter it based on mems_allowed a) cpuset.effective_mems, or b) nodes[N_MEMORY] The first choice is "Hard Guardrails" - it requires both an explict mask AND the GFP flag to reach SPM memory. The second choice is "Soft Guardrails" - more or less any nodemask is allowed, and we trust the callers to be sane. The cpuset filter functions already had gfp argument by the way: bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) {...} I chose the former for the first pass due to the mempolicy section above. If someone has an idea of how to apply this filtering logic WITHOUT the GFP flag - I am absolutely welcome to suggestions. My only other idea was separate alloc_spm_pages() interfaces, and that just felt bad. ~Gregory --------------- mempolicy extension ---------- mempolicy: add MPOL_F_SPM_NODE Add a way for mempolicies to access SPM nodes. Require MPOL_F_STATIC_NODES to prevent the policy mask from being remapped onto other nodes. Note: This doesn't work as-is because mempolicies are restricted by cpuset.sysram_nodes instead of cpuset.mems_allowed, so the nodemask will be rejected. This can be changed in the new/rebind mempolicy interfaces. Signed-off-by: Gregory Price diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 8fbbe613611a..c26aa8fb56d3 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -31,6 +31,7 @@ enum { #define MPOL_F_STATIC_NODES (1 << 15) #define MPOL_F_RELATIVE_NODES (1 << 14) #define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing if possible */ +#define MPOL_F_SPM_NODE (1 << 12) /* Nodemask contains SPM Nodes */ /* * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to diff --git a/mm/memory.c b/mm/memory.c index b59ae7ce42eb..7097d7045954 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3459,8 +3459,14 @@ static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) { struct file *vm_file = vma->vm_file; - if (vm_file) - return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO; + if (vm_file) { + gfp_t gfp; + gfp = mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO; + if (vma->vm_policy) + gfp |= (vma->vm_policy->flags & MPOL_F_SPM_NODE) ? + __GFP_SPM_NODE : 0; + return gfp; + } /* * Special mappings (e.g. VDSO) do not have any file so fake diff --git a/mm/mempolicy.c b/mm/mempolicy.c index e1e8a1f3e1a2..2b4d23983ef8 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1652,6 +1652,8 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) return -EINVAL; if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; + if ((*flags & MPOL_F_SPM_NODE) && !(*flags & MPOL_F_STATIC_NODES)) + return -EINVAL; if (*flags & MPOL_F_NUMA_BALANCING) { if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) *flags |= (MPOL_F_MOF | MPOL_F_MORON);

