On 17.11.21 15:30, Jonathan Cameron wrote: > On Tue, 16 Nov 2021 12:11:29 +0100 > David Hildenbrand <da...@redhat.com> wrote: > >>>> >>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW, >>>> this memory is exposed via cpu-less, special nodes. In contrast to real >>>> HW, the memory is hotplugged later (I don't think HW supports hotplug >>>> like that yet, but it might just be a matter of time). >>> >>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT >>> some by MEMORY entries. Or nodes created dynamically like with normal >>> hotplug memory. >>> >
Hi Jonathan, > The naming of the define is unhelpful. GENERIC_AFFINITY here corresponds > to Generic Initiator Affinity. So no good for memory. This is meant for > representation of accelerators / network cards etc so you can get the NUMA > characteristics for them accessing Memory in other nodes. > > My understanding of 'traditional' memory hotplug is that typically the > PA into which memory is hotplugged is known at boot time whether or not > the memory is physically present. As such, you present that in SRAT and rely > on the EFI memory map / other information sources to know the memory isn't > there. When it is hotplugged later the address is looked up in SRAT to > identify > the NUMA node. in virtualized environments we use the SRAT only to indicate the hotpluggable region (-> indicate maximum possible PFN to the guest OS), the actual present memory+PXM assignment is not done via SRAT. We differ quite a lot here from actual hardware I think. > > That model is less useful for more flexible entities like virtio-mem or > indeed physical hardware such as CXL type 3 memory devices which typically > need their own nodes. > > For the CXL type 3 option, currently proposal is to use the CXL table entries > representing Physical Address space regions to work out how many NUMA nodes > are needed and just create extra ones at boot. > https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.st...@dwillia2-desk3.amr.corp.intel.com > > It's a heuristic as we might need more nodes to represent things well kernel > side, but it's better than nothing and less effort that true dynamic node > creation. > If you chase through the earlier versions of Alison's patch you will find some > discussion of that. > > I wonder if virtio-mem should just grow a CDAT instance via a DOE? > > That would make all this stuff discoverable via PCI config space rather than > ACPI > CDAT is at: > https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf > but the table access protocol over PCI DOE is currently in the CXL 2.0 spec > (nothing stops others using it though AFAIK). > > However, then we'd actually need either dynamic node creation in the OS, or > some sort of reserved pool of extra nodes. Long term it may be the most > flexible option. I think for virtio-mem it's actually a bit simpler: a) The user defined on the QEMU cmdline an empty node b) The user assigned a virtio-mem device to a node, either when coldplugging or hotplugging the device. So we don't actually "hotplug" a new node, the (possible) node is already known to QEMU right when starting up. It's just a matter of exposing that fact to the guest OS -- similar to how we expose the maximum possible PFN to the guest OS. It's seems to boild down to an ACPI limitation. Conceptually, virtio-mem on an empty node in QEMU is not that different from hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to an empty node. But I guess it all just doesn't work with QEMU as of now. In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via build_srat_memory(table_data, machine->device_memory->base, hotpluggable_address_space_size, nb_numa_nodes - 1, MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); So we tell the guest OS "this range is hotpluggable" and "it contains to this node unless the device says something different". From both values we can -- when under QEMU -- conclude the maximum possible PFN and the maximum possible node. But the latter is not what Linux does: it simply maps the last numa node (indicated in the memory entry) to a PXM (-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()). I do wonder if we could simply expose the same hotpluggable range via multiple nodes: diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index a3ad6abd33..6c0ab442ea 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -2084,6 +2084,22 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine) * providing _PXM method if necessary. */ if (hotpluggable_address_space_size) { + /* + * For the guest to "know" about possible nodes, we'll indicate the + * same hotpluggable region to all empty nodes. + */ + for (i = 0; i < nb_numa_nodes - 1; i++) { + if (machine->numa_state->nodes[i].node_mem > 0) { + continue; + } + build_srat_memory(table_data, machine->device_memory->base, + hotpluggable_address_space_size, i, + MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); + } + /* + * Historically, we always indicated all hotpluggable memory to the + * last node -- if it was empty or not. + */ build_srat_memory(table_data, machine->device_memory->base, hotpluggable_address_space_size, nb_numa_nodes - 1, MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); Of course, this won't make CPU hotplug to empty nodes happy if we don't have mempory hotplug enabled for a VM. I did not check in detail if that is valid according to ACPI -- Linux might eat it (did not try yet, though). -- Thanks, David / dhildenb