On Mon, Mar 11, 2024 at 11:39:11AM +0100, Cédric Le Goater wrote:
> On 3/8/24 15:55, ank...@nvidia.com wrote:
> > From: Ankit Agrawal <ank...@nvidia.com>
> > 
> > There are upcoming devices which allow CPU to cache coherently access
> > their memory. It is sensible to expose such memory as NUMA nodes separate
> > from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT
> > called Generic Initiator Affinity Structure [1] to allow an association
> > between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g.
> > heterogeneous processors and accelerators, GPUs, and I/O devices with
> > integrated compute or DMA engines).
> > 
> > While a single node per device may cover several use cases, it is however
> > insufficient for a full utilization of the NVIDIA GPUs MIG
> > (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the
> > GPU device resources (including device memory) into several (upto 8)
> > isolated instances. Each of the partitioned memory requires a dedicated NUMA
> > node to operate. The partitions are not fixed and they can be 
> > created/deleted
> > at runtime.
> > 
> > Linux OS does not provide a means to dynamically create/destroy NUMA nodes
> > and such feature implementation is expected to be non-trivial. The nodes
> > that OS discovers at the boot time while parsing SRAT remains fixed. So we
> > utilize the GI Affinity structures that allows association between nodes
> > and devices. Multiple GI structures per device/BDF is possible, allowing
> > creation of multiple nodes in the VM by exposing unique PXM in each of these
> > structures.
> > 
> > Implement the mechanism to build the GI affinity structures as Qemu 
> > currently
> > does not. Introduce a new acpi-generic-initiator object to allow host admin
> > link a device with an associated NUMA node. Qemu maintains this association
> > and use this object to build the requisite GI Affinity Structure.
> > 
> > When multiple NUMA nodes are associated with a device, it is required to
> > create those many number of acpi-generic-initiator objects, each 
> > representing
> > a unique device:node association.
> > 
> > Following is one of a decoded GI affinity structure in VM ACPI SRAT.
> > [0C8h 0200   1]                Subtable Type : 05 [Generic Initiator 
> > Affinity]
> > [0C9h 0201   1]                       Length : 20
> > 
> > [0CAh 0202   1]                    Reserved1 : 00
> > [0CBh 0203   1]           Device Handle Type : 01
> > [0CCh 0204   4]             Proximity Domain : 00000007
> > [0D0h 0208  16]                Device Handle : 00 00 20 00 00 00 00 00 00 
> > 00 00
> > 00 00 00 00 00
> > [0E0h 0224   4]        Flags (decoded below) : 00000001
> >                                       Enabled : 1
> > [0E4h 0228   4]                    Reserved2 : 00000000
> > 
> > [0E8h 0232   1]                Subtable Type : 05 [Generic Initiator 
> > Affinity]
> > [0E9h 0233   1]                       Length : 20
> > 
> > On Grace Hopper systems, an admin will create a range of 8 nodes and 
> > associate
> > them with the device using the acpi-generic-initiator object. While a
> > configuration of less than 8 nodes per device is allowed, such configuration
> > will prevent utilization of the feature to the fullest. This setting is
> > applicable to all the Grace+Hopper systems. The following is an example of
> > the Qemu command line arguments to create 8 nodes and link them to the 
> > device
> > 'dev0':
> > 
> > -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> > -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> > -numa node,nodeid=8 -numa node,nodeid=9 \
> > -device 
> > vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
> > -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
> > -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
> > -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
> > -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
> > -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
> > -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
> > -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
> > 
> > The performance benefits can be realized by providing the NUMA node 
> > distances
> > appropriately (through libvirt tags or Qemu params). The admin can get the
> > distance among nodes in hardware using `numactl -H`.
> > 
> > This series goes along with the recenty added vfio-pci variant driver [3].
> > 
> > Applied over v8.2.2
> > base commit: 11aa0b1ff115b86160c4d37e7c37e6a6b13b77ea
> > 
> > [1] ACPI Spec 6.3, Section 5.2.16.6
> > Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [2]
> > Link: https://lore.kernel.org/all/20240220115055.23546-1-ank...@nvidia.com/ 
> > [3]
> > 
> > Link for v8:
> > Link: https://lore.kernel.org/all/20240306123317.4691-1-ank...@nvidia.com/
> 
> v9 looks ready for QEMU 9.0. An Ack from the ACPI supporters is missing
> though.
> 
> Michal, Igor, Ani,
> 
> Did you have time to take a look ?
> 
> Thanks
> 
> C.

I tagged it already.

> 
> 
> > v8 -> v9
> > - Removed unused included headers based on Jonathan's suggestion.
> > - Collected Reviewed-by from Jonathan.
> > - Added acpi-generic-initiator support for i386
> > - Moved HMAT change from patch 1/2 to 2/3.
> > - Fixed nits.
> > 
> > v7 -> v8
> > - Replaced the code to collect the acpi-generic-initiator objects
> >    with the code to use recursive helper object_child_foreach_recursive
> >    based on suggestion from Jonathan Cameron.
> > - Added sanity check for the node id passed to the
> >    acpi-generic-initiator object.
> > - Added change to use GI as HMAT initiator as per Jonathan's suggestion.
> > - Fixed nits pointed by Marcus and Jonathan.
> > - Collected Marcus' Acked-by.
> > - Rebased to v8.2.2.
> > 
> > v6 -> v7
> > - Updated code and the commit message to make acpi-generic-initiator
> >    define a 1:1 relationship between device and node based on
> >    Jonathan Cameron's suggestion.
> > - Updated commit message to include the decoded GI entry in the SRAT.
> > - Rebased to v8.2.1.
> > 
> > v5 -> v6
> > - Updated commit message for the [1/2] and the cover letter.
> > - Updated the acpi-generic-initiator object comment description for
> >    clarity on the input host-nodes.
> > - Rebased to v8.2.0-rc4.
> > 
> > v4 -> v5
> > - Removed acpi-dev option until full support.
> > - The NUMA nodes are saved as bitmap instead of uint16List.
> > - Replaced asserts to exit calls.
> > - Addressed other miscellaneous comments.
> > 
> > v3 -> v4
> > - changed the ':' delimited way to a uint16 array to communicate the
> > nodes associated with the device.
> > - added asserts to handle invalid inputs.
> > - addressed other miscellaneous v3 comments.
> > 
> > v2 -> v3
> > - changed param to accept a ':' delimited list of NUMA nodes, instead
> > of a range.
> > - Removed nvidia-acpi-generic-initiator object.
> > - Addressed miscellaneous comments in v2.
> > 
> > v1 -> v2
> > - Removed dependency on sysfs to communicate the feature with variant 
> > module.
> > - Use GI Affinity SRAT structure instead of Memory Affinity.
> > - No DSDT entries needed to communicate the PXM for the device. SRAT GI
> > structure is used instead.
> > - New objects introduced to establish link between device and nodes.
> > 
> > Ankit Agrawal (3):
> >    qom: new object to associate device to NUMA node
> >    hw/acpi: Implement the SRAT GI affinity structure
> >    hw/i386/acpi-build: Add support for SRAT Generic Initiator structures
> > 
> >   hw/acpi/acpi_generic_initiator.c         | 148 +++++++++++++++++++++++
> >   hw/acpi/hmat.c                           |   2 +-
> >   hw/acpi/meson.build                      |   1 +
> >   hw/arm/virt-acpi-build.c                 |   3 +
> >   hw/core/numa.c                           |   3 +-
> >   hw/i386/acpi-build.c                     |   3 +
> >   include/hw/acpi/acpi_generic_initiator.h |  47 +++++++
> >   include/sysemu/numa.h                    |   1 +
> >   qapi/qom.json                            |  17 +++
> >   9 files changed, 223 insertions(+), 2 deletions(-)
> >   create mode 100644 hw/acpi/acpi_generic_initiator.c
> >   create mode 100644 include/hw/acpi/acpi_generic_initiator.h
> > 


Reply via email to