On Mon, Dec 18, 2017 at 10:22:18PM +1100, David Gibson wrote: > On Mon, Dec 18, 2017 at 05:17:35PM +0800, Liu, Yi L wrote: > > On Mon, Dec 18, 2017 at 05:14:42PM +1100, David Gibson wrote: > > > On Thu, Nov 16, 2017 at 04:57:09PM +0800, Liu, Yi L wrote: > > > > Hi David, > > > > > > > > On Tue, Nov 14, 2017 at 11:59:34AM +1100, David Gibson wrote: > > > > > On Mon, Nov 13, 2017 at 04:28:45PM +0800, Peter Xu wrote: > > > > > > On Mon, Nov 13, 2017 at 04:56:01PM +1100, David Gibson wrote: > > > > > > > On Fri, Nov 03, 2017 at 08:01:52PM +0800, Liu, Yi L wrote: > > > > > > > > From: Peter Xu <pet...@redhat.com> > > > > > > > > > > > > > > > > AddressSpaceOps is similar to MemoryRegionOps, it's just for > > > > > > > > address > > > > > > > > spaces to store arch-specific hooks. > > > > > > > > > > > > > > > > The first hook I would like to introduce is iommu_get(). Return > > > > > > > > an > > > > > > > > IOMMUObject behind the AddressSpace. > > > > > > > > > > > > > > > > For systems that have IOMMUs, we will create a special address > > > > > > > > space per device which is different from system default address > > > > > > > > space for it (please refer to pci_device_iommu_address_space()). > > > > > > > > Normally when that happens, there will be one specific IOMMU (or > > > > > > > > say, translation unit) stands right behind that new address > > > > > > > > space. > > > > > > > > > > > > > > > > This iommu_get() fetches that guy behind the address space. > > > > > > > > Here, > > > > > > > > the guy is defined as IOMMUObject, which includes a > > > > > > > > notifier_list > > > > > > > > so far, may extend in future. Along with IOMMUObject, a new > > > > > > > > iommu > > > > > > > > notifier mechanism is introduced. It would be used for virt-svm. > > > > > > > > Also IOMMUObject can further have a IOMMUObjectOps which is > > > > > > > > similar > > > > > > > > to MemoryRegionOps. The difference is IOMMUObjectOps is not > > > > > > > > relied > > > > > > > > on MemoryRegion. > > > > > > > > > > > > > > > > Signed-off-by: Peter Xu <pet...@redhat.com> > > > > > > > > Signed-off-by: Liu, Yi L <yi.l....@linux.intel.com> > > > > > > > > > > > > > > Hi, sorry I didn't reply to the earlier postings of this after our > > > > > > > discussion in China. I've been sick several times and very busy. > > > > > > > > > > > > > > I still don't feel like there's an adequate explanation of exactly > > > > > > > what an IOMMUObject represents. Obviously it can represent more > > > > > > > than > > > > > > > a single translation window - since that's represented by the > > > > > > > IOMMUMR. But what exactly do all the MRs - or whatever else - > > > > > > > that > > > > > > > are represented by the IOMMUObject have in common, from a > > > > > > > functional > > > > > > > point of view. > > > > > > > > > > > > > > Even understanding the SVM stuff better than I did, I don't > > > > > > > really see > > > > > > > why an AddressSpace is an obvious unit to have an IOMMUObject > > > > > > > associated with it. > > > > > > > > > > > > Here's what I thought about it: IOMMUObject was planned to be the > > > > > > abstraction of the hardware translation unit, which is a higher > > > > > > level > > > > > > of the translated address spaces. Say, for each PCI device, it can > > > > > > have its own translated address space. However for multiple PCI > > > > > > devices, they can be sharing the same translation unit that handles > > > > > > the translation requests from different devices. That's the case > > > > > > for > > > > > > Intel platforms. We introduced this IOMMUObject because sometimes > > > > > > we > > > > > > want to do something with that translation unit rather than a > > > > > > specific > > > > > > device, in which we need a general IOMMU device handle. > > > > > > > > > > Ok, but what does "hardware translation unit" mean in practice. The > > > > > guest neither knows nor cares, which bits of IOMMU translation happen > > > > > to be included in the same bundle of silicon. It only cares what the > > > > > behaviour is. What behavioural characteristics does a single > > > > > IOMMUObject have? > > > > > > > > > > > IIRC one issue left over during last time's discussion was that > > > > > > there > > > > > > could be more complicated IOMMU models. E.g., one device's DMA > > > > > > request > > > > > > can be translated nestedly by two or multiple IOMMUs, and current > > > > > > proposal cannot really handle that complicated hierachy. I'm just > > > > > > thinking whether we can start from a simple model (say, we don't > > > > > > allow > > > > > > nested IOMMUs, and actually we don't even allow multiple IOMMUs so > > > > > > far), then we can evolve from that point in the future. > > > > > > > > > > > > Also, I thought there were something you mentioned that this > > > > > > approach > > > > > > is not correct for Power systems, but I can't really remember the > > > > > > details... Anyways, I think this is not the only approach to solve > > > > > > the problem, and I believe any new better idea would be greatly > > > > > > welcomed as well. :) > > > > > > > > > > So, some of my initial comments were based on a misunderstanding of > > > > > what was proposed here - since discussing this with Yi at LinuxCon > > > > > Beijing, I have a better idea of what's going on. > > > > > > > > > > On POWER - or rather the "pseries" platform, which is paravirtualized. > > > > > We can have multiple vIOMMU windows (usually 2) for a single virtual > > > > > > > > On POWER, the DMA isolation is done by allocating different DMA window > > > > to different isolation domains? And a single isolation domain may > > > > include > > > > multiple dma windows? So with or withou IOMMU, there is only a single > > > > DMA address shared by all the devices in the system? The isolation > > > > mechanism is as what described above? > > > > > > No, the multiple windows are completely unrelated to how things are > > > isolated. > > > > I'm afraid I chose a wrong word by using "DMA window".. > > Actually, when mentioning "DMA window", I mean address ranges in an iova > > address space. > > Yes, so did I. My one window I mean one contiguous range of IOVA addresses. > > > Anyhow, let me re-shape my understanding of POWER IOMMU and > > make sure we are in the same page. > > > > > > > > Just like on x86, each IOMMU domain has independent IOMMU mappings. > > > The only difference is that IBM calls the domains "partitionable > > > endpoints" (PEs) and they tend to be statically created at boot time, > > > rather than runtime generated. > > > > Does POWER IOMMU also have iova concept? Device can use an iova to > > access memory, and IOMMU translates the iova to an address within the > > system physical address? > > Yes. When I say the "PCI address space" I mean the IOVA space. > > > > The windows are about what addresses in PCI space are translated by > > > the IOMMU. If the device generates a PCI cycle, only certain > > > addresses will be mapped by the IOMMU to DMA - other addresses will > > > correspond to other devices MMIOs, MSI vectors, maybe other things. > > > > I guess the windows you mentioned here is the address ranges within the > > system physical address space as you also mentioned MMIOs and etc. > > No. I mean ranges within the PCI space == IOVA space. It's simplest > to understand with traditional PCI. A cycle on the bus doesn't know > whether the destination is a device or memory, it just has an address > - a PCI memory address. Part of that address range is mapped to > system RAM, optionally with an IOMMU translating it. Other parts of > that address space are used for devices. > > With PCI-E things get more complicated, but the conceptual model is > the same. > > > > The set of addresses translated by the IOMMU need not be contiguous. > > > > I suppose you mean the output addresses of the IOMMU need not be > > contiguous? > > No. I mean the input addresses of the IOMMU. > > > > Or, there could be two IOMMUs on the bus, each accepting different > > > address ranges. These two situations are not distinguishable from the > > > guest's point of view. > > > > > > So for a typical PAPR setup, the device can access system RAM either > > > via DMA in the range 0..1GiB (the "32-bit window") or in the range > > > 2^59..2^59+<something> (the "64-bit window"). Typically the 32-bit > > > window has mappings dynamically created by drivers, and the 64-bit > > > window has all of system RAM mapped 1:1, but that's entirely up to the > > > OS, it can map each window however it wants. > > > > > > 32-bit devices (or "64 bit" devices which don't actually implement > > > enough the address bits) will only be able to use the 32-bit window, > > > of course. > > > > > > MMIOs of other devices, the "magic" MSI-X addresses belonging to the > > > host bridge and other things exist outside those ranges. Those are > > > just the ranges which are used to DMA to RAM. > > > > > > Each PE (domain) can see a different version of what's in each > > > window. > > > > If I'm correct so far. PE actually defines a mapping between an address > > range of an address space(aka. iova address space) and an address range > > of the system physical address space. > > No. A PE means several things, but basically it is an isolation > domain, like an Intel IOMMU domain. Each PE has an independent set of > IOMMU mappings which translate part of the PCI address space to system > memory space. > > > Then my question is: does each PE define a separate iova address sapce > > which is flat from 0 - 2^AW -1, AW is address width? As a reference, > > VT-d domain defines a flat address space for each domain. > > Partly. Each PE has an address space which all devices in the PE see. > Only some of that address space is mapped to system memory though, > other parts are occupied by devices, others are unmapped. > > Only the parts mapped by the IOMMU vary between PEs - the other parts > of the address space will be identical for all PEs on the host
Thx, this comment addressed me well. This is different from what we have on VT-d. > bridge. However for POWER guests (not for hosts) there is exactly one > PE for each virtual host bridge. > > > > In fact, if I understand the "IO hole" correctly, the situation on x86 > > > isn't very different. It has a window below the IO hole and a second > > > window above the IO hole. The addresses within the IO hole go to > > > (32-bit) devices on the PCI bus, rather than being translated by the > > > > If you mean the "IO hole" within the system physcial address space, I think > > it's yes. > > Well, really I mean the IO hole in PCI address space. Because system > address space and PCI memory space were traditionally identity mapped > on x86 this is easy to confuse though. > > > > IOMMU to RAM addresses. Because the gap is smaller between the two > > > windows, I think we get away without really modelling this detail in > > > qemu though. > > > > > > > > PCI host bridge. Because of the paravirtualization, the mapping to > > > > > hardware is fuzzy, but for passthrough devices they will both be > > > > > implemented by the IOMMU built into the physical host bridge. That > > > > > isn't importat to the guest, though - all operations happen at the > > > > > window level. > > > > > > > > On VT-d, with IOMMU presented, each isolation domain has its own address > > > > space. That's why we talked more on address space level. And iommu makes > > > > the difference. That's the behavioural characteristics a single iommu > > > > translation unit has. And thus an IOMMUObject going to have. > > > > > > Right, that's the same on POWER. But the IOMMU only translates *some* > > > addresses within the address space, not all of them. The rest will go > > > to other PCI devices or be unmapped, but won't go to RAM. > > > > > > That's why the IOMMU should really be associated with an MR (or > > > several MRs), not an AddressSpace, it only translates some addresses. > > > > If I'm correct so far, I do believe the major difference between VT-d and > > POWER IOMMU is that VT-d isolation domain is a flat address space while > > PE of POWER is something different(need your input here as I'm not sure > > about > > it). Maybe it's like there is a flat address space, each PE takes some > > address > > ranges and maps the address ranges to different system physcial address > > ranges. > > No, it's really not that different. In both cases (without virt-SVM) > there's a system memory address space, and a PCI address space for > each domain / PE. There are one or more "outbound" windows in system > memory space that map system memory cycles to PCI cycles (used by the > CPU to access MMIO) and one or more "inbound" (DMA) windows in PCI > memory space which map PCI cycles onto system memory cycles (used by > devices to access system memory). > > On old-style PCs, both inbound and outbound windows were (mostly) > identity maps. On POWER they are not. > > > > > > The other thing that bothers me here is the way it's attached to an > > > > > AddressSpace. > > > > > > > > My consideration is iommu handles AddressSpaces. dma address space is > > > > also > > > > an address space managed by iommu. > > > > > > No, it's not. It's a region (or several) within the overall PCI > > > address space. Other things in the addressspace, such as other > > > device's BARs exist independent of the IOMMU. > > > > > > It's not something that could really work with PCI-E, I think, but > > > with a more traditional PCI bus there's no reason you couldn't have > > > multiple IOMMUs listening on different regions of the PCI address > > > space. > > > > I think the point here is on POWER, the input addresses of IOMMUs are > > actaully > > in the same address space? > > I'm not sure what you mean, but I don't think so. Each PE has its own > IOMMU input address space. > > > What IOMMU does is mapping the different ranges to > > different system physcial address ranges. So it's as you mentioned, multiple > > IOMMUs listen on different regions of a PCI address space. > > No. That could be the case in theory, but it's not the usual case. > > Or rather it depends what you mean by "an IOMMU". For PAPR guests, > both IOVA 0..1GiB and 2^59..(somewhere) are mapped to system memory, > but with separate page tables. You could consider that two IOMMUs (we > mostly treat it that way in qemu). However, all the mapping is > handled by the same host bridge with 2 sets of page tables per PE, so > you could also call it one IOMMU. > > This is what I'm getting at when I say that "one IOMMU" is not a > clearly defined unit. > > > While for VT-d, it's not the case. The input addresses of IOMMUs may not > > in the same address sapce. As I mentioned, each IOMMU domain on VT-d is a > > separate address space. So for VT-d, IOMMUs are actually listening to > > different > > address spaces. That's the point why we want to have address space level > > abstraction of IOMMU. > > > > > > > > > That's why we believe it is fine to > > > > associate dma address space with an IOMMUObject. > > > > > > > > IIUC how SVM works, the whole point is that the device > > > > > no longer writes into a specific PCI address space. Instead, it > > > > > writes directly into a process address space. So it seems to me more > > > > > that SVM should operate at the PCI level, and disassociate the device > > > > > from the normal PCI address space entirely, rather than hooking up > > > > > something via that address space. After thinking more, I agree that it is not suitable to hook up something for 1st level via the PCI address space. In the time 1st and 2nd level translation is exposed to guest, a device would write to multiple address spaces. PCI address space is only one of them. I think your reply in another email is a good start, let me reply my thoughts under that email. Regards, Yi L > > > > > > > > As Peter replied, we still need the PCI address space, it would be used > > > > to build up the 2nd level page table which would be used in nested > > > > translation. > > > > > > > > Thanks, > > > > Yi L > > > > > > > > > > > > > > > > > > > > Regards, > > Yi L > > > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ > _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson