Hi Robin, On 3/28/19 11:56 AM, Robin Murphy wrote: > On 28/03/2019 10:38, Auger Eric wrote: >> Hi Alex, >> >> [+ Robin] >> >> On 3/27/19 5:37 PM, Alex Williamson wrote: >>> On Wed, 27 Mar 2019 14:25:00 +0800 >>> Peter Xu <pet...@redhat.com> wrote: >>> >>>> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: >>>>> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >>>>> distinguish by devfn & bus between devices in a conventional PCI >>>>> topology and therefore we cannot assign them separate AddressSpaces. >>>>> By taking this requester ID aliasing into account, QEMU better matches >>>>> the bare metal behavior and restrictions, and enables shared >>>>> AddressSpace configurations that are otherwise not possible with >>>>> guest IOMMU support. >>>>> >>>>> For the latter case, given any example where an IOMMU group on the >>>>> host includes multiple devices: >>>>> >>>>> $ ls /sys/kernel/iommu_groups/1/devices/ >>>>> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >>>> >>>> [1] >>>> >>>>> >>>>> If we incorporate a vIOMMU into the VM configuration, we're restricted >>>>> that we can only assign one of the endpoints to the guest because a >>>>> second endpoint will attempt to use a different AddressSpace. VFIO >>>>> only supports IOMMU group level granularity at the container level, >>>>> preventing this second endpoint from being assigned: >>>>> >>>>> qemu-system-x86_64 -machine q35... \ >>>>> -device intel-iommu,intremap=on \ >>>>> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >>>>> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >>>>> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >>>>> >>>>> qemu-system-x86_64: -device >>>>> vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >>>>> 0000:01:00.1: group 1 used in multiple address spaces >>>>> >>>>> However, when QEMU incorporates proper aliasing, we can make use of a >>>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >>>>> provides the downstream devices with the same AddressSpace, ex: >>>>> >>>>> qemu-system-x86_64 -machine q35... \ >>>>> -device intel-iommu,intremap=on \ >>>>> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >>>>> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >>>>> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 >>>>> >>>>> While the utility of this hack may be limited, this AddressSpace >>>>> aliasing is the correct behavior for QEMU to emulate bare metal. >>>>> >>>>> Signed-off-by: Alex Williamson <alex.william...@redhat.com> >>>> >>>> The patch looks sane to me even as a bug fix since otherwise the DMA >>>> address spaces used under misc kinds of PCI bridges can be wrong, so: >>> >>> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but >>> I'd be cautious about this if so. Eric Auger noted that he's seen an >>> SMMU VM hit a guest kernel bug-on, which needs further >>> investigation. It's not clear if it's just an untested or >>> unimplemented scenario for SMMU to see a conventional PCI bus or if >>> there's something wrong in QEMU. I also haven't tested AMD IOMMU and >>> only VT-d to a very limited degree, thus RFC. >> >> So I have tracked this further and here is what I can see. >> >> On guest side, the 2 assigned devices that I have put downstream to the >> pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one >> corresponding to the requester id of the very device and the second one >> corresponding to the rid matching the same bus number and devfn=0 >> >> dev0 = 0000:02:01.0 >> 0000:02:00.0 >> >> dev1 = 0000:02:01.1 >> 0000:02:00.0 >> >> Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1. >> Each time it iterates over the associated ids and we call add_device >> twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver >> recognizes a context is already alive for 0000:02:00.0 and triggers a >> BUG_ON(). > > Hmm, aliasing bridges are supposed to be handled as of commit > 563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") - > what's changed since then?
I our case, the problem is not we have the same ids[] for a given device but we have 2 functions attached to the pcie-to-pci bridge and their fwspec have an ids[] in common, the one with the subordinate bus and devfn=0. device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 [ 2.509381] ixgbe 0000:02:01.0: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1 [ 2.512095] arm_smmu_install_ste_for_dev sid[0]=520 [ 2.513524] arm_smmu_write_strtab_ent sid=520 val=1 [ 2.514872] arm_smmu_write_strtab_ent sid=520 ABORT [ 2.516266] arm_smmu_install_ste_for_dev sid[1]=512 [ 2.517609] arm_smmu_write_strtab_ent sid=512 val=1 [ 2.518958] arm_smmu_write_strtab_ent sid=512 ABORT [ 3.301316] ixgbe 0000:02:01.1: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1 [ 3.302456] arm_smmu_install_ste_for_dev sid[0]=521 [ 3.303420] arm_smmu_write_strtab_ent sid=521 val=1 [ 3.304408] arm_smmu_write_strtab_ent sid=521 ABORT [ 3.305442] arm_smmu_install_ste_for_dev sid[1]=512 [ 3.306812] arm_smmu_write_strtab_ent sid=512 val=3758686219 [ 3.308390] arm_smmu_write_strtab_ent sid=512 S1 | S2 ste_live Thanks Eric > > Robin. > >> At the origin of the creation of 2 ids for each device, >> iort_iommu_configure is called on each downstream device which calls >> pci_for_each_dma_alias(). We enter the >> pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and >> iort_pci_iommu_init is called with bus number 2 and devfn=0. >> >> Thanks >> >> Eric >>> >>>> Reviewed-by: Peter Xu <pet...@redhat.com> >>>> >>>> Though I have a question that confused me even before: Alex, do you >>>> know why all the context entry of the devices in the IOMMU root table >>>> will be programmed even if the devices are under a pcie-to-pci bridge? >>>> I'm giving an example with above [1] to be clear: in that case IIUC >>>> we'll program context entries for all the three devices (00:01.0, >>>> 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of >>>> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on >>>> bare metal and then why we bother to program the context entry of >>>> 01:00.1? It seems never used. >>>> >>>> (It should be used for current QEMU to work with pcie-to-pci bridges >>>> if without this patch, but I feel like I don't know the real answer >>>> behind) >>> >>> We actually have two different scenarios that could be represented by >>> [1], the group can be formed by lack of isolation or by lack of >>> visibility. In the group above, it's the former, isolation. The PCIe >>> root port does not support ACS, so while the IOMMU has visibility of >>> the individual devices, peer-to-peer between devices may also be >>> possible. Native, trusted, in-kernel drivers for these devices could >>> still make use of separate IOMMU domains per device, but in order to >>> expose the devices to a userspace driver we need to consider them a >>> non-isolated set to prevent side-channel attacks between devices. We >>> therefore consider them as a group within the IOMMU API and it's >>> required that each context entry maps to the same domain as the IOMMU >>> will see transactions for each requester ID. >>> >>> If we had the visibility case, such as if [1] represented a PCIe-to-PCI >>> bridge subgroup, then the IOMMU really does only see the bridge >>> requester ID and there may not be a reason to populate the context >>> entries for the downstream aliased devices. Perhaps the IOMMU might >>> still choose to do so, particularly if the bridge is actually a PCI-X >>> bridge as PCI-X does incorporate a requester ID, but also has strange >>> rules about the bridge being able to claim ownership of the >>> transaction. So it might be paranoia or simplification that causes all >>> the context entries to be programmed, or for alias quirks, uncertainty >>> if a device exclusively uses a quirked requester ID or might sometimes >>> use the proper requester ID. >>> >>> In the example I present, we're taking [1], which could be either case >>> above, and converting it into the visibility case in order to force the >>> IOMMU to handle the devices within a single address space. Thanks, >>> >>> Alex >>>