virt: Add support for user-creatable nested SMMUv3

Shameerali Kolothum Thodi via Thu, 06 Feb 2025 05:52:10 -0800


> -----Original Message-----
> From: Daniel P. Berrangé <berra...@redhat.com>
> Sent: Thursday, February 6, 2025 10:37 AM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.th...@huawei.com>
> Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
> eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
> nicol...@nvidia.com; ddut...@redhat.com; Linuxarm
> <linux...@huawei.com>; Wangzhou (B) <wangzh...@hisilicon.com>;
> jiangkunkun <jiangkun...@huawei.com>; Jonathan Cameron
> <jonathan.came...@huawei.com>; zhangfei....@linaro.org;
> nath...@nvidia.com
> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> nested SMMUv3
> 
> On Thu, Feb 06, 2025 at 10:02:25AM +0000, Shameerali Kolothum Thodi
> wrote:
> > Hi Daniel,
> >
> > > -----Original Message-----
> > > From: Daniel P. Berrangé <berra...@redhat.com>
> > > Sent: Friday, January 31, 2025 9:42 PM
> > > To: Shameerali Kolothum Thodi
> <shameerali.kolothum.th...@huawei.com>
> > > Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
> > > eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
> > > nicol...@nvidia.com; ddut...@redhat.com; Linuxarm
> > > <linux...@huawei.com>; Wangzhou (B) <wangzh...@hisilicon.com>;
> > > jiangkunkun <jiangkun...@huawei.com>; Jonathan Cameron
> > > <jonathan.came...@huawei.com>; zhangfei....@linaro.org
> > > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-
> creatable
> > > nested SMMUv3
> > >
> > > On Thu, Jan 30, 2025 at 06:09:24PM +0000, Shameerali Kolothum Thodi
> > > wrote:
> > > >
> > > > Each "arm-smmuv3-nested" instance, when the first device gets
> attached
> > > > to it, will create a S2 HWPT and a corresponding SMMUv3 domain in
> > > kernel
> > > > SMMUv3 driver. This domain will have a pointer representing the
> physical
> > > > SMMUv3 that the device belongs. And any other device which belongs
> to
> > > > the same physical SMMUv3 can share this S2 domain.
> > >
> > > Ok, so given two guest SMMUv3s,   A and B, and two host SMMUv3s,
> > > C and D, we could end up with A&C and B&D paired, or we could
> > > end up with A&D and B&C paired, depending on whether we plug
> > > the first VFIO device into guest SMMUv3  A or B.
> > >
> > > This is bad.  Behaviour must not vary depending on the order
> > > in which we create devices.
> > >
> > > An guest SMMUv3 is paired to a guest PXB. A guest PXB is liable
> > > to be paired to a guest NUMA node. A guest NUMA node is liable
> > > to be paired to host NUMA node. The guest/host SMMU pairing
> > > must be chosen such that it makes conceptual sense wrt to the
> > > guest PXB NUMA to host NUMA pairing.
> > >
> > > If the kernel picks guest<->host SMMU pairings on a first-device
> > > first-paired basis, this can end up with incorrect guest NUMA
> > > configurations.
> >
> > Ok. I am trying to understand how this can happen as I assume the
> > Guest PXB numa node is picked up by whatever device we are
> > attaching to it and based on which numa_id that device belongs to
> > in physical host.
> >
> > And the physical smmuv3 numa id will be the same to that of the
> > device numa_id  it is associated with. Isn't it?
> >
> > For example I have a system here, that has 8 phys SMMUv3s and numa
> > assignments on this is something like below,
> >
> > Phys SMMUv3.0 --> node 0
> >   \..dev1 --> node0
> > Phys SMMUv3.1 --> node 0
> > \..dev2 -->node0
> > Phys SMMUv3.2 --> node 0
> > Phys SMMUv3.3 --> node 0
> >
> > Phys SMMUv3.4 --> node 1
> > Phys SMMUv3.5 --> node 1
> > \..dev5 --> node1
> > Phys SMMUv3.6 --> node 1
> > Phys SMMUv3.7 --> node 1
> >
> >
> > If I have to assign say dev 1, 2 and 5 to a Guest, we need to specify 3
> >  "arm-smmuv3-accel" instances as they belong to different phys
> SMMUv3s.
> >
> > -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0 \
> > -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=0 \
> > -device pxb-pcie,id=pcie.3,bus_nr=3,bus=pcie.0,numa_id=1 \
> > -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \
> > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 \
> > -device arm-smmuv3-accel,id=smmuv3,bus=pcie.3 \
> > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \
> > -device pcie-root-port,id=pcie.port2,bus=pcie.3,chassis=2 \
> > -device pcie-root-port,id=pcie.port3,bus=pcie.2,chassis=3 \
> > -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0 \
> > -device vfio-pci,host=0000: dev2,bus=pcie.port2,iommufd=iommufd0 \
> > -device vfio-pci,host=0000: dev5,bus=pcie.port3,iommufd=iommufd0
> >
> > So I guess even if we don't specify the physical SMMUv3 association
> > explicitly, the kernel will check that based on the devices the Guest
> > SMMUv3 is attached to (and hence the Numa association), right?
> 
> It isn't about checking the devices, it is about the guest SMMU
> getting differing host SMMU associations.
> 
> > In other words how an explicit association helps us here?
> >
> > Or is it that the Guest PXB numa_id allocation is not always based
> > on device numa_id?
> 
> Lets simplify to 2 SMMUs for shorter CLIs.
> 
> So to start with we assume physical host with two SMMUs, and
> two PCI devices we want to assign
> 
>   0000:dev1 - associated with host SMMU 1, and host NUMA node 0
>   0000:dev2 - associated with host SMMU 2, and host NUMA node 1
> 
> So now we configure QEMU like this:
> 
>  -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0
>  -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=1
>  -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1
>  -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
>  -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1
>  -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2
>  -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0
>  -device vfio-pci,host=0000:dev2,bus=pcie.port2,iommufd=iommufd0
> 
> For brevity I'm not going to show the config for host/guest NUMA
> mappings,
> but assume that guest NUMA node 0 has been configured to map to host
> NUMA
> node 0 and guest node 1 to host node 1.
> 
> In this order of QEMU CLI args we get
> 
>   VFIO device 0000:dev1 causes the kernel to associate guest smmuv1 with
>   host SSMU 1.
> 
>   VFIO device 0000:dev2 causes the kernel to associate guest smmuv2 with
>   host SSMU 2.
> 
> Now consider we swap the ordering of the VFIO Devices on the QEMU cli
> 
> 
>  -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0
>  -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=1
>  -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1
>  -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
>  -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1
>  -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2
>  -device vfio-pci,host=0000:dev2,bus=pcie.port2,iommufd=iommufd0
>  -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0
> 
> In this order of QEMU CLI args we get
> 
>   VFIO device 0000:dev2 causes the kernel to associate guest smmuv1 with
>   host SSMU 2.
> 
>   VFIO device 0000:dev1 causes the kernel to associate guest smmuv2 with
>   host SSMU 1.
> 
> This is broken, as now we have inconsistent NUMA mappings between host
> and guest. 0000:dev2 is associated with a PXB on NUMA node 1, but
> associated with a guest SMMU that was paired with a PXB on NUMA node
> 0.


Hmm..I don’t think just swapping the order will change the association with
Guest SMMU here. Because, we have,

>  -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2

During smmuv3-accel realize time, this will result in, 
 pci_setup_iommu(primary_bus, ops, smmu_state);

And when the vfio dev realization happens,
 set_iommu_device() 
   smmu_dev_set_iommu_device(bus, smmu_state, ,)
      --> this is where the guest smmuv3-->host smmuv3 association is first
            established. And any further vfio dev to this Guest SMMU will
            only succeeds if it belongs to the same phys SMMU.

ie, the Guest SMMU to pci bus association, actually make sure you have the
same Guest SMMU for the device.

smmuv2 --> pcie.2 --> (pxb-pcie, numa_id = 1)
0000:dev2 -->  pcie.port2 --> pcie.2 --> smmuv2 (pxb-pcie, numa_id = 1)

Hence the association of 0000:dev2 to Guest SMMUv2 remain same.

I hope this is clear. And I am not sure the association will be broken in any
other way unless Qemu CLI specify the dev to a different PXB.

May be it is that one of my earlier replies caused this confusion that 
ordering of the VFIO Devices on the QEMU cli will affect the association.

Thanks,
Shameer

RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable nested SMMUv3

Reply via email to