Hi Eric,

On Mon, Jan 16, 2023 at 07:47:09AM -0500, Eric Auger wrote:
[...]
> once we attempt to plug such devices downstream to the pcie-to-pci
> bridge, those devices are put in a singleton group. The pcie-to-pci
> bridge disappears from the radar (not attached to any group), and the
> pcie root port the pcie-to-pci bridge is plugged onto is put in a
> separate singleton group. So the topology is wrong and definitively
> different from the one observed with the intel iommu.
> 
> I suspect some other issue in the viot/pci probing on guest kernel
> side. I would appreciate on that kernel part.
> 
> qemu command excerpt:
> for x86_64:
> 
> -device ioh3420,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0
> -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1
> -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown
> -device e1000,netdev=mynet0,bus=pcie_pci_bridge1
> 
> guest pci topology:
> 
> -[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
>            +-01.0  Device 1234:1111
>            +-02.0-[01-02]----00.0-[02]----00.0  Intel Corporation 82540EM 
> Gigabit Ethernet Controller
> 
> wrong guest topology:
> /sys/kernel/iommu_groups/2/devices/0000:00:02.0 (pcie root port)
> /sys/kernel/iommu_groups/1/devices/0000:02:00.0 (E1000)
> no group for 0000:01:00:0 (the pcie-to-pci bridge)

This highlights several problems in the kernel:


(1) The pcie-to-pci bridge doesn't get an IOMMU group. That's a bug in the
VIOT driver, where we use the wrong struct device when dealing with PCI
aliases. I'll send a fix.


(2) e1000 gets a different group than the pcie-to-pci bridge, and than
other devices on that bus. This wrongly enables assigning aliased devices
to different VMs.

It's not specific to virtio-iommu, I can get the same result with SMMUv3,
both DT and ACPI:

qemu-system-aarch64 -M virt,gic-version=3,its=on,iommu=smmuv3 -cpu max -m 4G 
-smp 8 -kernel Image -initrd initrd -append console=ttyAMA0 -nographic \
        -device pcie-root-port,chassis=1,id=pcie.1,bus=pcie.0 \
        -device pcie-pci-bridge,id=pcie.2,bus=pcie.1 \
        -device e1000,netdev=net0,bus=pcie.2,addr=1.0 -netdev user,id=net0 \
        -device e1000,netdev=net1,bus=pcie.2,addr=2.0 -netdev user,id=net1

I think the logic in pci_device_group() expects the devices to be probed
in a specific top-down order, so that when we get to a device, all its
parents already have a group. Starting with arbitrary devices would
require walking down and across the tree to find aliases which is too
complex.

As you noted Intel IOMMU doesn't have this problem, because probing
happens in the expected order. The driver loads at the right time and
bus_iommu_probe() ends up calling pci_device_group() in the right order.
For virtio-iommu and SMMU, the drivers are subject to probe deferral, and
devices drivers are bound kinda randomly by the driver core. In that case
it's acpi/of_iommu_configure() that calls pci_device_group().

I'm not sure what the best approach is to fix this. It seems wrong to rely
on previous driver probes in pci_device_group() at all, because even if
you probe in the right order, you could build a kernel without the PCIe
port driver and the aliased endpoints will still be put in different
groups. Maybe pci_device_group(), when creating a new group, should
immediately assign that group to all parents that alias the device?
Or maybe we can reuse the RID alias infrastructure used by quirks.


(3) with the SMMU, additional devices on the PCI bus can't get an IOMMU
group:

[    2.019587] e1000 0000:02:01.0: Adding to iommu group 0
[    2.389750] e1000 0000:02:02.0: stream 512 already in tree

Due to cdf315f907d4 ("iommu/arm-smmu-v3: Maintain a SID->device
structure"), the SMMUv3 driver doesn't support RID aliasing anymore.
The structure needs to be extended with multiple devices per SID, in which
case the fault handler will choose an arbitrary device associated to the
faulting SID.


(4) When everything else works, on Arm ACPI the PCIe root port is put in a
separate group due to ACS being enabled, but on other platforms (including
Arm DT), the root port is in the same group. I haven't looked into that.


> with intel iommu we get the following topology:
> /sys/kernel/iommu_groups/2/devices/0000:02:00.0
> /sys/kernel/iommu_groups/2/devices/0000:01:00.0
> /sys/kernel/iommu_groups/2/devices/0000:00:02.0
> 
> on aarch64 we get the same using those devices:
> -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0
> -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1
> -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown
> -device e1000,netdev=mynet0,bus=pcie_pci_bridge1
> 
> Signed-off-by: Eric Auger <eric.au...@redhat.com>
> ---
>  hw/virtio/virtio-iommu.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 23c470977e..58c367b744 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -178,6 +178,21 @@ static IOMMUMemoryRegion *virtio_iommu_mr(VirtIOIOMMU 
> *s, uint32_t sid)
>          dev = iommu_pci_bus->pbdev[devfn];
>          if (dev) {
>              return &dev->iommu_mr;
> +        } else { /* check possible aliasing */
> +            PCIBus *pbus = iommu_pci_bus->bus;
> +
> +            if (!pci_bus_is_express(pbus)) {
> +                PCIDevice *parent = pbus->parent_dev;
> +
> +                if (pci_is_express(parent) &&
> +                    pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> +                    devfn = PCI_DEVFN(0, 0);
> +                    dev = iommu_pci_bus->pbdev[devfn];
> +                    if (dev) {
> +                        return &dev->iommu_mr;
> +                    }
> +                }
> +            }

Yes, I think that makes sense. Maybe the comment should refer to
pci_device_iommu_address_space(), which explains what is happening. Since
this exactly mirrors what that function does, do we also need the case
where the parent is not PCIe?  In that case the bus is the parent and
devfn is that of the bridge

Thanks,
Jean

Reply via email to