On Mon, Mar 11, 2024 at 11:39:11AM +0100, Cédric Le Goater wrote: > On 3/8/24 15:55, ank...@nvidia.com wrote: > > From: Ankit Agrawal <ank...@nvidia.com> > > > > There are upcoming devices which allow CPU to cache coherently access > > their memory. It is sensible to expose such memory as NUMA nodes separate > > from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT > > called Generic Initiator Affinity Structure [1] to allow an association > > between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g. > > heterogeneous processors and accelerators, GPUs, and I/O devices with > > integrated compute or DMA engines). > > > > While a single node per device may cover several use cases, it is however > > insufficient for a full utilization of the NVIDIA GPUs MIG > > (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the > > GPU device resources (including device memory) into several (upto 8) > > isolated instances. Each of the partitioned memory requires a dedicated NUMA > > node to operate. The partitions are not fixed and they can be > > created/deleted > > at runtime. > > > > Linux OS does not provide a means to dynamically create/destroy NUMA nodes > > and such feature implementation is expected to be non-trivial. The nodes > > that OS discovers at the boot time while parsing SRAT remains fixed. So we > > utilize the GI Affinity structures that allows association between nodes > > and devices. Multiple GI structures per device/BDF is possible, allowing > > creation of multiple nodes in the VM by exposing unique PXM in each of these > > structures. > > > > Implement the mechanism to build the GI affinity structures as Qemu > > currently > > does not. Introduce a new acpi-generic-initiator object to allow host admin > > link a device with an associated NUMA node. Qemu maintains this association > > and use this object to build the requisite GI Affinity Structure. > > > > When multiple NUMA nodes are associated with a device, it is required to > > create those many number of acpi-generic-initiator objects, each > > representing > > a unique device:node association. > > > > Following is one of a decoded GI affinity structure in VM ACPI SRAT. > > [0C8h 0200 1] Subtable Type : 05 [Generic Initiator > > Affinity] > > [0C9h 0201 1] Length : 20 > > > > [0CAh 0202 1] Reserved1 : 00 > > [0CBh 0203 1] Device Handle Type : 01 > > [0CCh 0204 4] Proximity Domain : 00000007 > > [0D0h 0208 16] Device Handle : 00 00 20 00 00 00 00 00 00 > > 00 00 > > 00 00 00 00 00 > > [0E0h 0224 4] Flags (decoded below) : 00000001 > > Enabled : 1 > > [0E4h 0228 4] Reserved2 : 00000000 > > > > [0E8h 0232 1] Subtable Type : 05 [Generic Initiator > > Affinity] > > [0E9h 0233 1] Length : 20 > > > > On Grace Hopper systems, an admin will create a range of 8 nodes and > > associate > > them with the device using the acpi-generic-initiator object. While a > > configuration of less than 8 nodes per device is allowed, such configuration > > will prevent utilization of the feature to the fullest. This setting is > > applicable to all the Grace+Hopper systems. The following is an example of > > the Qemu command line arguments to create 8 nodes and link them to the > > device > > 'dev0': > > > > -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \ > > -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \ > > -numa node,nodeid=8 -numa node,nodeid=9 \ > > -device > > vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \ > > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \ > > -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \ > > -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \ > > -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \ > > -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \ > > -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \ > > -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \ > > -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \ > > > > The performance benefits can be realized by providing the NUMA node > > distances > > appropriately (through libvirt tags or Qemu params). The admin can get the > > distance among nodes in hardware using `numactl -H`. > > > > This series goes along with the recenty added vfio-pci variant driver [3]. > > > > Applied over v8.2.2 > > base commit: 11aa0b1ff115b86160c4d37e7c37e6a6b13b77ea > > > > [1] ACPI Spec 6.3, Section 5.2.16.6 > > Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [2] > > Link: https://lore.kernel.org/all/20240220115055.23546-1-ank...@nvidia.com/ > > [3] > > > > Link for v8: > > Link: https://lore.kernel.org/all/20240306123317.4691-1-ank...@nvidia.com/ > > v9 looks ready for QEMU 9.0. An Ack from the ACPI supporters is missing > though. > > Michal, Igor, Ani, > > Did you have time to take a look ? > > Thanks > > C.
I tagged it already. > > > > v8 -> v9 > > - Removed unused included headers based on Jonathan's suggestion. > > - Collected Reviewed-by from Jonathan. > > - Added acpi-generic-initiator support for i386 > > - Moved HMAT change from patch 1/2 to 2/3. > > - Fixed nits. > > > > v7 -> v8 > > - Replaced the code to collect the acpi-generic-initiator objects > > with the code to use recursive helper object_child_foreach_recursive > > based on suggestion from Jonathan Cameron. > > - Added sanity check for the node id passed to the > > acpi-generic-initiator object. > > - Added change to use GI as HMAT initiator as per Jonathan's suggestion. > > - Fixed nits pointed by Marcus and Jonathan. > > - Collected Marcus' Acked-by. > > - Rebased to v8.2.2. > > > > v6 -> v7 > > - Updated code and the commit message to make acpi-generic-initiator > > define a 1:1 relationship between device and node based on > > Jonathan Cameron's suggestion. > > - Updated commit message to include the decoded GI entry in the SRAT. > > - Rebased to v8.2.1. > > > > v5 -> v6 > > - Updated commit message for the [1/2] and the cover letter. > > - Updated the acpi-generic-initiator object comment description for > > clarity on the input host-nodes. > > - Rebased to v8.2.0-rc4. > > > > v4 -> v5 > > - Removed acpi-dev option until full support. > > - The NUMA nodes are saved as bitmap instead of uint16List. > > - Replaced asserts to exit calls. > > - Addressed other miscellaneous comments. > > > > v3 -> v4 > > - changed the ':' delimited way to a uint16 array to communicate the > > nodes associated with the device. > > - added asserts to handle invalid inputs. > > - addressed other miscellaneous v3 comments. > > > > v2 -> v3 > > - changed param to accept a ':' delimited list of NUMA nodes, instead > > of a range. > > - Removed nvidia-acpi-generic-initiator object. > > - Addressed miscellaneous comments in v2. > > > > v1 -> v2 > > - Removed dependency on sysfs to communicate the feature with variant > > module. > > - Use GI Affinity SRAT structure instead of Memory Affinity. > > - No DSDT entries needed to communicate the PXM for the device. SRAT GI > > structure is used instead. > > - New objects introduced to establish link between device and nodes. > > > > Ankit Agrawal (3): > > qom: new object to associate device to NUMA node > > hw/acpi: Implement the SRAT GI affinity structure > > hw/i386/acpi-build: Add support for SRAT Generic Initiator structures > > > > hw/acpi/acpi_generic_initiator.c | 148 +++++++++++++++++++++++ > > hw/acpi/hmat.c | 2 +- > > hw/acpi/meson.build | 1 + > > hw/arm/virt-acpi-build.c | 3 + > > hw/core/numa.c | 3 +- > > hw/i386/acpi-build.c | 3 + > > include/hw/acpi/acpi_generic_initiator.h | 47 +++++++ > > include/sysemu/numa.h | 1 + > > qapi/qom.json | 17 +++ > > 9 files changed, 223 insertions(+), 2 deletions(-) > > create mode 100644 hw/acpi/acpi_generic_initiator.c > > create mode 100644 include/hw/acpi/acpi_generic_initiator.h > >