On Mon, 25 Dec 2023 10:26:01 +0530 <ank...@nvidia.com> wrote: > From: Ankit Agrawal <ank...@nvidia.com> > > There are upcoming devices which allow CPU to cache coherently access > their memory. It is sensible to expose such memory as NUMA nodes separate > from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT > called Generic Initiator Affinity Structure [1] to allow an association > between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g. > heterogeneous processors and accelerators, GPUs, and I/O devices with > integrated compute or DMA engines). > > While a single node per device may cover several use cases, it is however > insufficient for a full utilization of the NVIDIA GPUs MIG > (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the > GPU device resources (including device memory) into several (upto 8) > isolated instances. Each of the partitioned memory requires a dedicated NUMA > node to operate. The partitions are not fixed and they can be created/deleted > at runtime. > > Linux OS does not provide a means to dynamically create/destroy NUMA nodes > and such feature implementation is expected to be non-trivial. The nodes > that OS discovers at the boot time while parsing SRAT remains fixed. So we > utilize the GI Affinity structures that allows association between nodes > and devices. Multiple GI structures per device/BDF is possible, allowing > creation of multiple nodes in the VM by exposing unique PXM in each of these > structures. > > Implement the mechanism to build the GI affinity structures as Qemu currently > does not. Introduce a new acpi-generic-initiator object that allows an > association of a set of nodes with a device. During SRAT creation, all such > objected are identified and used to add the GI Affinity Structures. Currently, > only PCI device is supported. On a multi device system, each device supporting > the features needs a unique acpi-generic-initiator object with its own set of > NUMA nodes associated to it. > > The admin will create a range of 8 nodes and associate that with the device > using the acpi-generic-initiator object. While a configuration of less than > 8 nodes per device is allowed, such configuration will prevent utilization of > the feature to the fullest. This setting is applicable to all the Grace+Hopper > systems. The following is an example of the Qemu command line arguments to > create 8 nodes and link them to the device 'dev0': > > -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \ > -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \ > -numa node,nodeid=8 -numa node,nodeid=9 \ > -device > vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \ > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \ >
I'd find it helpful to see the resulting chunk of SRAT for these examples (disassembled) in this cover letter and the patches (where there are more examples).