Hi Eric, On Mon, Jan 16, 2023 at 07:47:09AM -0500, Eric Auger wrote: [...] > once we attempt to plug such devices downstream to the pcie-to-pci > bridge, those devices are put in a singleton group. The pcie-to-pci > bridge disappears from the radar (not attached to any group), and the > pcie root port the pcie-to-pci bridge is plugged onto is put in a > separate singleton group. So the topology is wrong and definitively > different from the one observed with the intel iommu. > > I suspect some other issue in the viot/pci probing on guest kernel > side. I would appreciate on that kernel part. > > qemu command excerpt: > for x86_64: > > -device ioh3420,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0 > -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1 > -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown > -device e1000,netdev=mynet0,bus=pcie_pci_bridge1 > > guest pci topology: > > -[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller > +-01.0 Device 1234:1111 > +-02.0-[01-02]----00.0-[02]----00.0 Intel Corporation 82540EM > Gigabit Ethernet Controller > > wrong guest topology: > /sys/kernel/iommu_groups/2/devices/0000:00:02.0 (pcie root port) > /sys/kernel/iommu_groups/1/devices/0000:02:00.0 (E1000) > no group for 0000:01:00:0 (the pcie-to-pci bridge)
This highlights several problems in the kernel: (1) The pcie-to-pci bridge doesn't get an IOMMU group. That's a bug in the VIOT driver, where we use the wrong struct device when dealing with PCI aliases. I'll send a fix. (2) e1000 gets a different group than the pcie-to-pci bridge, and than other devices on that bus. This wrongly enables assigning aliased devices to different VMs. It's not specific to virtio-iommu, I can get the same result with SMMUv3, both DT and ACPI: qemu-system-aarch64 -M virt,gic-version=3,its=on,iommu=smmuv3 -cpu max -m 4G -smp 8 -kernel Image -initrd initrd -append console=ttyAMA0 -nographic \ -device pcie-root-port,chassis=1,id=pcie.1,bus=pcie.0 \ -device pcie-pci-bridge,id=pcie.2,bus=pcie.1 \ -device e1000,netdev=net0,bus=pcie.2,addr=1.0 -netdev user,id=net0 \ -device e1000,netdev=net1,bus=pcie.2,addr=2.0 -netdev user,id=net1 I think the logic in pci_device_group() expects the devices to be probed in a specific top-down order, so that when we get to a device, all its parents already have a group. Starting with arbitrary devices would require walking down and across the tree to find aliases which is too complex. As you noted Intel IOMMU doesn't have this problem, because probing happens in the expected order. The driver loads at the right time and bus_iommu_probe() ends up calling pci_device_group() in the right order. For virtio-iommu and SMMU, the drivers are subject to probe deferral, and devices drivers are bound kinda randomly by the driver core. In that case it's acpi/of_iommu_configure() that calls pci_device_group(). I'm not sure what the best approach is to fix this. It seems wrong to rely on previous driver probes in pci_device_group() at all, because even if you probe in the right order, you could build a kernel without the PCIe port driver and the aliased endpoints will still be put in different groups. Maybe pci_device_group(), when creating a new group, should immediately assign that group to all parents that alias the device? Or maybe we can reuse the RID alias infrastructure used by quirks. (3) with the SMMU, additional devices on the PCI bus can't get an IOMMU group: [ 2.019587] e1000 0000:02:01.0: Adding to iommu group 0 [ 2.389750] e1000 0000:02:02.0: stream 512 already in tree Due to cdf315f907d4 ("iommu/arm-smmu-v3: Maintain a SID->device structure"), the SMMUv3 driver doesn't support RID aliasing anymore. The structure needs to be extended with multiple devices per SID, in which case the fault handler will choose an arbitrary device associated to the faulting SID. (4) When everything else works, on Arm ACPI the PCIe root port is put in a separate group due to ACS being enabled, but on other platforms (including Arm DT), the root port is in the same group. I haven't looked into that. > with intel iommu we get the following topology: > /sys/kernel/iommu_groups/2/devices/0000:02:00.0 > /sys/kernel/iommu_groups/2/devices/0000:01:00.0 > /sys/kernel/iommu_groups/2/devices/0000:00:02.0 > > on aarch64 we get the same using those devices: > -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0 > -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1 > -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown > -device e1000,netdev=mynet0,bus=pcie_pci_bridge1 > > Signed-off-by: Eric Auger <eric.au...@redhat.com> > --- > hw/virtio/virtio-iommu.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c > index 23c470977e..58c367b744 100644 > --- a/hw/virtio/virtio-iommu.c > +++ b/hw/virtio/virtio-iommu.c > @@ -178,6 +178,21 @@ static IOMMUMemoryRegion *virtio_iommu_mr(VirtIOIOMMU > *s, uint32_t sid) > dev = iommu_pci_bus->pbdev[devfn]; > if (dev) { > return &dev->iommu_mr; > + } else { /* check possible aliasing */ > + PCIBus *pbus = iommu_pci_bus->bus; > + > + if (!pci_bus_is_express(pbus)) { > + PCIDevice *parent = pbus->parent_dev; > + > + if (pci_is_express(parent) && > + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > + devfn = PCI_DEVFN(0, 0); > + dev = iommu_pci_bus->pbdev[devfn]; > + if (dev) { > + return &dev->iommu_mr; > + } > + } > + } Yes, I think that makes sense. Maybe the comment should refer to pci_device_iommu_address_space(), which explains what is happening. Since this exactly mirrors what that function does, do we also need the case where the parent is not PCIe? In that case the bus is the parent and devfn is that of the bridge Thanks, Jean