On Wed, 6 Mar 2024 12:33:16 +0000 <ank...@nvidia.com> wrote: > From: Ankit Agrawal <ank...@nvidia.com> > > NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows > partitioning of the GPU device resources (including device memory) into > several (upto 8) isolated instances. Each of the partitioned memory needs > a dedicated NUMA node to operate. The partitions are not fixed and they > can be created/deleted at runtime. > > Unfortunately Linux OS does not provide a means to dynamically create/destroy > NUMA nodes and such feature implementation is not expected to be trivial. The > nodes that OS discovers at the boot time while parsing SRAT remains fixed. So > we utilize the Generic Initiator Affinity structures that allows association > between nodes and devices. Multiple GI structures per BDF is possible, > allowing creation of multiple nodes by exposing unique PXM in each of these > structures. > > Implement the mechanism to build the GI affinity structures as Qemu currently > does not. Introduce a new acpi-generic-initiator object to allow host admin > link a device with an associated NUMA node. Qemu maintains this association > and use this object to build the requisite GI Affinity Structure. > > When multiple NUMA nodes are associated with a device, it is required to > create those many number of acpi-generic-initiator objects, each representing > a unique device:node association. > > Following is one of a decoded GI affinity structure in VM ACPI SRAT. > [0C8h 0200 1] Subtable Type : 05 [Generic Initiator Affinity] > [0C9h 0201 1] Length : 20 > > [0CAh 0202 1] Reserved1 : 00 > [0CBh 0203 1] Device Handle Type : 01 > [0CCh 0204 4] Proximity Domain : 00000007 > [0D0h 0208 16] Device Handle : 00 00 20 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 > [0E0h 0224 4] Flags (decoded below) : 00000001 > Enabled : 1 > [0E4h 0228 4] Reserved2 : 00000000 > > [0E8h 0232 1] Subtable Type : 05 [Generic Initiator Affinity] > [0E9h 0233 1] Length : 20 > > An admin can provide a range of acpi-generic-initiator objects, each > associating a device (by providing the id through pci-dev argument) > to the desired NUMA node (using the node argument). Currently, only PCI > device is supported. > > For the grace hopper system, create a range of 8 nodes and associate that > with the device using the acpi-generic-initiator object. While a configuration > of less than 8 nodes per device is allowed, such configuration will prevent > utilization of the feature to the fullest. The following sample creates 8 > nodes per PCI device for a VM with 2 PCI devices and link them to the > respecitve PCI device using acpi-generic-initiator objects: > > -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \ > -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \ > -numa node,nodeid=8 -numa node,nodeid=9 \ > -device > vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \ > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \ > -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \ > -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \ > -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \ > -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \ > -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \ > -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \ > -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \ > > -numa node,nodeid=10 -numa node,nodeid=11 -numa node,nodeid=12 \ > -numa node,nodeid=13 -numa node,nodeid=14 -numa node,nodeid=15 \ > -numa node,nodeid=16 -numa node,nodeid=17 \ > -device > vfio-pci-nohotplug,host=0009:01:01.0,bus=pcie.0,addr=05.0,rombar=0,id=dev1 \ > -object acpi-generic-initiator,id=gi8,pci-dev=dev1,node=10 \ > -object acpi-generic-initiator,id=gi9,pci-dev=dev1,node=11 \ > -object acpi-generic-initiator,id=gi10,pci-dev=dev1,node=12 \ > -object acpi-generic-initiator,id=gi11,pci-dev=dev1,node=13 \ > -object acpi-generic-initiator,id=gi12,pci-dev=dev1,node=14 \ > -object acpi-generic-initiator,id=gi13,pci-dev=dev1,node=15 \ > -object acpi-generic-initiator,id=gi14,pci-dev=dev1,node=16 \ > -object acpi-generic-initiator,id=gi15,pci-dev=dev1,node=17 \ > > The performance benefits can be realized by providing the NUMA node distances > appropriately (through libvirt tags or Qemu params). The admin can get the > distance among nodes in hardware using `numactl -H`.
That's a lot of description when you could just have claimed you want a normal GI node for HMAT and we'd have all believed you ;) > > Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [1] > Cc: Jonathan Cameron <qemu-devel@nongnu.org> > Cc: Alex Williamson <alex.william...@redhat.com> > Cc: Markus Armbruster <arm...@redhat.com> > Acked-by: Markus Armbruster <arm...@redhat.com> > Signed-off-by: Ankit Agrawal <ank...@nvidia.com> Hi Ankit, Some minor things inline. With the includes tidied up. Reviewed-by: Jonathan Cameron <jonathan.came...@huawei.com> > diff --git a/include/hw/acpi/acpi_generic_initiator.h > b/include/hw/acpi/acpi_generic_initiator.h > new file mode 100644 > index 0000000000..23d0b591c6 > --- /dev/null > +++ b/include/hw/acpi/acpi_generic_initiator.h > @@ -0,0 +1,32 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved > + */ > + > +#ifndef ACPI_GENERIC_INITIATOR_H > +#define ACPI_GENERIC_INITIATOR_H > + > +#include "hw/mem/pc-dimm.h" Why? > +#include "hw/acpi/bios-linker-loader.h" > +#include "hw/acpi/aml-build.h" > +#include "sysemu/numa.h" This should only include headers that it uses directly. If they are needed down in the c files, then include them there. Most of these are not related to what you have here. > +#include "qemu/uuid.h" > +#include "qom/object.h" > +#include "qom/object_interfaces.h" > + > +#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator" > + > +typedef struct AcpiGenericInitiator { > + /* private */ > + Object parent; > + > + /* public */ > + char *pci_dev; > + uint16_t node; > +} AcpiGenericInitiator; > + > +typedef struct AcpiGenericInitiatorClass { > + ObjectClass parent_class; > +} AcpiGenericInitiatorClass; Trivial, but you could push the class definition down into the c file given it's not accessed from anywhere else. > + > +#endif