On Thu, Feb 06, 2025 at 10:02:25AM +0000, Shameerali Kolothum Thodi wrote:
> Hi Daniel,
> 
> > -----Original Message-----
> > From: Daniel P. Berrangé <berra...@redhat.com>
> > Sent: Friday, January 31, 2025 9:42 PM
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.th...@huawei.com>
> > Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
> > eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
> > nicol...@nvidia.com; ddut...@redhat.com; Linuxarm
> > <linux...@huawei.com>; Wangzhou (B) <wangzh...@hisilicon.com>;
> > jiangkunkun <jiangkun...@huawei.com>; Jonathan Cameron
> > <jonathan.came...@huawei.com>; zhangfei....@linaro.org
> > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> > nested SMMUv3
> > 
> > On Thu, Jan 30, 2025 at 06:09:24PM +0000, Shameerali Kolothum Thodi
> > wrote:
> > >
> > > Each "arm-smmuv3-nested" instance, when the first device gets attached
> > > to it, will create a S2 HWPT and a corresponding SMMUv3 domain in
> > kernel
> > > SMMUv3 driver. This domain will have a pointer representing the physical
> > > SMMUv3 that the device belongs. And any other device which belongs to
> > > the same physical SMMUv3 can share this S2 domain.
> > 
> > Ok, so given two guest SMMUv3s,   A and B, and two host SMMUv3s,
> > C and D, we could end up with A&C and B&D paired, or we could
> > end up with A&D and B&C paired, depending on whether we plug
> > the first VFIO device into guest SMMUv3  A or B.
> > 
> > This is bad.  Behaviour must not vary depending on the order
> > in which we create devices.
> > 
> > An guest SMMUv3 is paired to a guest PXB. A guest PXB is liable
> > to be paired to a guest NUMA node. A guest NUMA node is liable
> > to be paired to host NUMA node. The guest/host SMMU pairing
> > must be chosen such that it makes conceptual sense wrt to the
> > guest PXB NUMA to host NUMA pairing.
> > 
> > If the kernel picks guest<->host SMMU pairings on a first-device
> > first-paired basis, this can end up with incorrect guest NUMA
> > configurations.
> 
> Ok. I am trying to understand how this can happen as I assume the
> Guest PXB numa node is picked up by whatever device we are
> attaching to it and based on which numa_id that device belongs to
> in physical host.
> 
> And the physical smmuv3 numa id will be the same to that of the
> device numa_id  it is associated with. Isn't it?
> 
> For example I have a system here, that has 8 phys SMMUv3s and numa
> assignments on this is something like below,
> 
> Phys SMMUv3.0 --> node 0
>   \..dev1 --> node0
> Phys SMMUv3.1 --> node 0
> \..dev2 -->node0
> Phys SMMUv3.2 --> node 0
> Phys SMMUv3.3 --> node 0
> 
> Phys SMMUv3.4 --> node 1
> Phys SMMUv3.5 --> node 1
> \..dev5 --> node1
> Phys SMMUv3.6 --> node 1
> Phys SMMUv3.7 --> node 1
> 
> 
> If I have to assign say dev 1, 2 and 5 to a Guest, we need to specify 3
>  "arm-smmuv3-accel" instances as they belong to different phys SMMUv3s.
> 
> -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0 \
> -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=0 \
> -device pxb-pcie,id=pcie.3,bus_nr=3,bus=pcie.0,numa_id=1 \
> -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \
> -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 \
> -device arm-smmuv3-accel,id=smmuv3,bus=pcie.3 \
> -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \
> -device pcie-root-port,id=pcie.port2,bus=pcie.3,chassis=2 \
> -device pcie-root-port,id=pcie.port3,bus=pcie.2,chassis=3 \
> -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0 \
> -device vfio-pci,host=0000: dev2,bus=pcie.port2,iommufd=iommufd0 \
> -device vfio-pci,host=0000: dev5,bus=pcie.port3,iommufd=iommufd0
> 
> So I guess even if we don't specify the physical SMMUv3 association
> explicitly, the kernel will check that based on the devices the Guest
> SMMUv3 is attached to (and hence the Numa association), right?

It isn't about checking the devices, it is about the guest SMMU
getting differing host SMMU associations.

> In other words how an explicit association helps us here?
> 
> Or is it that the Guest PXB numa_id allocation is not always based
> on device numa_id?

Lets simplify to 2 SMMUs for shorter CLIs.

So to start with we assume physical host with two SMMUs, and
two PCI devices we want to assign

  0000:dev1 - associated with host SMMU 1, and host NUMA node 0
  0000:dev2 - associated with host SMMU 2, and host NUMA node 1

So now we configure QEMU like this:

 -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0
 -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=1
 -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1
 -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
 -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1
 -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2
 -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0
 -device vfio-pci,host=0000:dev2,bus=pcie.port2,iommufd=iommufd0

For brevity I'm not going to show the config for host/guest NUMA mappings,
but assume that guest NUMA node 0 has been configured to map to host NUMA
node 0 and guest node 1 to host node 1.

In this order of QEMU CLI args we get

  VFIO device 0000:dev1 causes the kernel to associate guest smmuv1 with
  host SSMU 1.

  VFIO device 0000:dev2 causes the kernel to associate guest smmuv2 with
  host SSMU 2.

Now consider we swap the ordering of the VFIO Devices on the QEMU cli


 -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0
 -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=1
 -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1
 -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
 -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1
 -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2
 -device vfio-pci,host=0000:dev2,bus=pcie.port2,iommufd=iommufd0
 -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0

In this order of QEMU CLI args we get

  VFIO device 0000:dev2 causes the kernel to associate guest smmuv1 with
  host SSMU 2.

  VFIO device 0000:dev1 causes the kernel to associate guest smmuv2 with
  host SSMU 1.

This is broken, as now we have inconsistent NUMA mappings between host
and guest. 0000:dev2 is associated with a PXB on NUMA node 1, but
associated with a guest SMMU that was paired with a PXB on NUMA node 0.

This is because the kernel is doing first-come first-matched logic for
mapping guest and host SMMUs, and thus is sensitive to ordering of the
VFIO devices on the CLI. We need to be ordering invariant, which means
libvirt must tell  QEMU which host + guest SMMUs to pair together, and
QEMU must in turn tell the kernel.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Reply via email to