On Tue, Jan 14, 2014 at 09:39:24AM -0700, Alex Williamson wrote: > On Tue, 2014-01-14 at 18:18 +0200, Michael S. Tsirkin wrote: > > On Tue, Jan 14, 2014 at 09:15:14AM -0700, Alex Williamson wrote: > > > On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote: > > > > On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote: > > > > > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote: > > > > > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote: > > > > > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote: > > > > > > > > > > > > > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson > > > > > > > > > <alex.william...@redhat.com>: > > > > > > > > > > > > > > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote: > > > > > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin > > > > > > > > >>> <m...@redhat.com> wrote: > > > > > > > > >>> > > > > > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson > > > > > > > > >>>> wrote: > > > > > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin > > > > > > > > >>>>> wrote: > > > > > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex > > > > > > > > >>>>>> Williamson wrote: > > > > > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin > > > > > > > > >>>>>>> wrote: > > > > > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex > > > > > > > > >>>>>>>> Williamson wrote: > > > > > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson > > > > > > > > >>>>>>>>> wrote: > > > > > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. > > > > > > > > >>>>>>>>>> Tsirkin wrote: > > > > > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex > > > > > > > > >>>>>>>>>>> Williamson wrote: > > > > > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. > > > > > > > > >>>>>>>>>>>> Tsirkin wrote: > > > > > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonz...@redhat.com> > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit > > > > > > > > >>>>>>>>>>>> system memory > > > > > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address > > > > > > > > >>>>>>>>>>>> spaces 64-bit wide. > > > > > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find > > > > > > > > >>>>>>>>>>>> ignoring bits above > > > > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and > > > > > > > > >>>>>>>>>>>> address_space_translate_internal > > > > > > > > >>>>>>>>>>>> consequently messing up the computations. > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts > > > > > > > > >>>>>>>>>>>> to read from address > > > > > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff > > > > > > > > >>>>>>>>>>>> inclusive. The region it gets > > > > > > > > >>>>>>>>>>>> is the newly introduced master abort region, which > > > > > > > > >>>>>>>>>>>> is as big as the PCI > > > > > > > > >>>>>>>>>>>> address space (see pci_bus_init). Due to a typo > > > > > > > > >>>>>>>>>>>> that's only 2^63-1, > > > > > > > > >>>>>>>>>>>> not 2^64. But we get it anyway because > > > > > > > > >>>>>>>>>>>> phys_page_find ignores the upper > > > > > > > > >>>>>>>>>>>> bits of the physical address. In > > > > > > > > >>>>>>>>>>>> address_space_translate_internal then > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> diff = int128_sub(section->mr->size, > > > > > > > > >>>>>>>>>>>> int128_make64(addr)); > > > > > > > > >>>>>>>>>>>> *plen = int128_get64(int128_min(diff, > > > > > > > > >>>>>>>>>>>> int128_make64(*plen))); > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms. > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> The size of the PCI address space region should be > > > > > > > > >>>>>>>>>>>> fixed anyway. > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino > > > > > > > > >>>>>>>>>>>> <lcapitul...@redhat.com> > > > > > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > > > > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > > > > > > > > >>>>>>>>>>>> --- > > > > > > > > >>>>>>>>>>>> exec.c | 8 ++------ > > > > > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-) > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c > > > > > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644 > > > > > > > > >>>>>>>>>>>> --- a/exec.c > > > > > > > > >>>>>>>>>>>> +++ b/exec.c > > > > > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry { > > > > > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6) > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables. */ > > > > > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS > > > > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS > > > > > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64 > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> #define P_L2_BITS 10 > > > > > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS) > > > > > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void > > > > > > > > >>>>>>>>>>>> memory_map_init(void) > > > > > > > > >>>>>>>>>>>> { > > > > > > > > >>>>>>>>>>>> system_memory = > > > > > > > > >>>>>>>>>>>> g_malloc(sizeof(*system_memory)); > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> - assert(ADDR_SPACE_BITS <= 64); > > > > > > > > >>>>>>>>>>>> - > > > > > > > > >>>>>>>>>>>> - memory_region_init(system_memory, NULL, > > > > > > > > >>>>>>>>>>>> "system", > > > > > > > > >>>>>>>>>>>> - ADDR_SPACE_BITS == 64 ? > > > > > > > > >>>>>>>>>>>> - UINT64_MAX : (0x1ULL << > > > > > > > > >>>>>>>>>>>> ADDR_SPACE_BITS)); > > > > > > > > >>>>>>>>>>>> + memory_region_init(system_memory, NULL, > > > > > > > > >>>>>>>>>>>> "system", UINT64_MAX); > > > > > > > > >>>>>>>>>>>> address_space_init(&address_space_memory, > > > > > > > > >>>>>>>>>>>> system_memory, "memory"); > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> system_io = g_malloc(sizeof(*system_io)); > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>> This seems to have some unexpected consequences > > > > > > > > >>>>>>>>>>> around sizing 64bit PCI > > > > > > > > >>>>>>>>>>> BARs that I'm not sure how to handle. > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you > > > > > > > > >>>>>>>>>> don't detect BAR being disabled? > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>> See the trace below, the BARs are not disabled. QEMU > > > > > > > > >>>>>>>>> pci-core is doing > > > > > > > > >>>>>>>>> the sizing an memory region updates for the BARs, > > > > > > > > >>>>>>>>> vfio is just a > > > > > > > > >>>>>>>>> pass-through here. > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing > > > > > > > > >>>>>>>> seems to be happening > > > > > > > > >>>>>>>> while I/O & memory are enabled int he command > > > > > > > > >>>>>>>> register. Thanks, > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> Alex > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at > > > > > > > > >>>>>>> all. > > > > > > > > >>>>>> > > > > > > > > >>>>>> Unfortunately > > > > > > > > >>>>>> > > > > > > > > >>>>>>>>>>> After this patch I get vfio > > > > > > > > >>>>>>>>>>> traces like this: > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, > > > > > > > > >>>>>>>>>>> len=0x4) febe0004 > > > > > > > > >>>>>>>>>>> (save lower 32bits of BAR) > > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, > > > > > > > > >>>>>>>>>>> 0xffffffff, len=0x4) > > > > > > > > >>>>>>>>>>> (write mask to BAR) > > > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff > > > > > > > > >>>>>>>>>>> (memory region gets unmapped) > > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, > > > > > > > > >>>>>>>>>>> len=0x4) ffffc004 > > > > > > > > >>>>>>>>>>> (read size mask) > > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, > > > > > > > > >>>>>>>>>>> 0xfebe0004, len=0x4) > > > > > > > > >>>>>>>>>>> (restore BAR) > > > > > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff > > > > > > > > >>>>>>>>>>> [0x7fcf3654d000] > > > > > > > > >>>>>>>>>>> (memory region re-mapped) > > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, > > > > > > > > >>>>>>>>>>> len=0x4) 0 > > > > > > > > >>>>>>>>>>> (save upper 32bits of BAR) > > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, > > > > > > > > >>>>>>>>>>> 0xffffffff, len=0x4) > > > > > > > > >>>>>>>>>>> (write mask to BAR) > > > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff > > > > > > > > >>>>>>>>>>> (memory region gets unmapped) > > > > > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - > > > > > > > > >>>>>>>>>>> fffffffffebe3fff [0x7fcf3654d000] > > > > > > > > >>>>>>>>>>> (memory region gets re-mapped with new address) > > > > > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, > > > > > > > > >>>>>>>>>>> 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 > > > > > > > > >>>>>>>>>>> (Bad address) > > > > > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit > > > > > > > > >>>>>>>>>>> physical addresses) > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma > > > > > > > > >>>>>>>>>> in the iommu? > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>> Two reasons, first I can't tell the difference > > > > > > > > >>>>>>>>> between RAM and MMIO. > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> Why can't you? Generally memory core let you find out > > > > > > > > >>>>>>> easily. > > > > > > > > >>>>>> > > > > > > > > >>>>>> My MemoryListener is setup for &address_space_memory and > > > > > > > > >>>>>> I then filter > > > > > > > > >>>>>> out anything that's not memory_region_is_ram(). This > > > > > > > > >>>>>> still gets > > > > > > > > >>>>>> through, so how do I easily find out? > > > > > > > > >>>>>> > > > > > > > > >>>>>>> But in this case it's vfio device itself that is sized > > > > > > > > >>>>>>> so for sure you > > > > > > > > >>>>>>> know it's MMIO. > > > > > > > > >>>>>> > > > > > > > > >>>>>> How so? I have a MemoryListener as described above and > > > > > > > > >>>>>> pass everything > > > > > > > > >>>>>> through to the IOMMU. I suppose I could look through > > > > > > > > >>>>>> all the > > > > > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but > > > > > > > > >>>>>> that seems really > > > > > > > > >>>>>> ugly. > > > > > > > > >>>>>> > > > > > > > > >>>>>>> Maybe you will have same issue if there's another > > > > > > > > >>>>>>> device with a 64 bit > > > > > > > > >>>>>>> bar though, like ivshmem? > > > > > > > > >>>>>> > > > > > > > > >>>>>> Perhaps, I suspect I'll see anything that registers > > > > > > > > >>>>>> their BAR > > > > > > > > >>>>>> MemoryRegion from memory_region_init_ram or > > > > > > > > >>>>>> memory_region_init_ram_ptr. > > > > > > > > >>>>> > > > > > > > > >>>>> Must be a 64 bit BAR to trigger the issue though. > > > > > > > > >>>>> > > > > > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, > > > > > > > > >>>>>>>>> which is something > > > > > > > > >>>>>>>>> that we might be able to take advantage of with GPU > > > > > > > > >>>>>>>>> passthrough. > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the > > > > > > > > >>>>>>>>>>> fffffffffebe0000 > > > > > > > > >>>>>>>>>>> address, presumably because it was beyond the > > > > > > > > >>>>>>>>>>> address space of the PCI > > > > > > > > >>>>>>>>>>> window. This address is clearly not in a PCI MMIO > > > > > > > > >>>>>>>>>>> space, so why are we > > > > > > > > >>>>>>>>>>> allowing it to be realized in the system address > > > > > > > > >>>>>>>>>>> space at this location? > > > > > > > > >>>>>>>>>>> Thanks, > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>> Alex > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space? > > > > > > > > >>>>>>>>>> True, CPU can't access this address but other pci > > > > > > > > >>>>>>>>>> devices can. > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>> What happens on real hardware when an address like > > > > > > > > >>>>>>>>> this is programmed to > > > > > > > > >>>>>>>>> a device? The CPU doesn't have the physical bits to > > > > > > > > >>>>>>>>> access it. I have > > > > > > > > >>>>>>>>> serious doubts that another PCI device would be able > > > > > > > > >>>>>>>>> to access it > > > > > > > > >>>>>>>>> either. Maybe in some limited scenario where the > > > > > > > > >>>>>>>>> devices are on the > > > > > > > > >>>>>>>>> same conventional PCI bus. In the typical case, PCI > > > > > > > > >>>>>>>>> addresses are > > > > > > > > >>>>>>>>> always limited by some kind of aperture, whether > > > > > > > > >>>>>>>>> that's explicit in > > > > > > > > >>>>>>>>> bridge windows or implicit in hardware design (and > > > > > > > > >>>>>>>>> perhaps made explicit > > > > > > > > >>>>>>>>> in ACPI). Even if I wanted to filter these out as > > > > > > > > >>>>>>>>> noise in vfio, how > > > > > > > > >>>>>>>>> would I do it in a way that still allows real 64bit > > > > > > > > >>>>>>>>> MMIO to be > > > > > > > > >>>>>>>>> programmed. PCI has this knowledge, I hope. VFIO > > > > > > > > >>>>>>>>> doesn't. Thanks, > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>> Alex > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec > > > > > > > > >>>>>>> is explicit that > > > > > > > > >>>>>>> full 64 bit addresses must be allowed and hardware > > > > > > > > >>>>>>> validation > > > > > > > > >>>>>>> test suites normally check that it actually does work > > > > > > > > >>>>>>> if it happens. > > > > > > > > >>>>>> > > > > > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically > > > > > > > > >>>>>> has defined > > > > > > > > >>>>>> routing, that's more what I'm referring to. There are > > > > > > > > >>>>>> generally only > > > > > > > > >>>>>> fixed address windows for RAM vs MMIO. > > > > > > > > >>>>> > > > > > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU. > > > > > > > > >>>>> Without that, devices can talk to each other without going > > > > > > > > >>>>> through chipset, and bridge spec is very explicit that > > > > > > > > >>>>> full 64 bit addressing must be supported. > > > > > > > > >>>>> > > > > > > > > >>>>> So as long as we don't emulate an IOMMU, > > > > > > > > >>>>> guest will normally think it's okay to use any address. > > > > > > > > >>>>> > > > > > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that > > > > > > > > >>>>>>> bridge's > > > > > > > > >>>>>>> windows would protect you, but pci already does this > > > > > > > > >>>>>>> filtering: > > > > > > > > >>>>>>> if you see this address in the memory map this means > > > > > > > > >>>>>>> your virtual device is on root bus. > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> So I think it's the other way around: if VFIO requires > > > > > > > > >>>>>>> specific > > > > > > > > >>>>>>> address ranges to be assigned to devices, it should > > > > > > > > >>>>>>> give this > > > > > > > > >>>>>>> info to qemu and qemu can give this to guest. > > > > > > > > >>>>>>> Then anything outside that range can be ignored by VFIO. > > > > > > > > >>>>>> > > > > > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe > > > > > > > > >>>>>> VFIO. There's > > > > > > > > >>>>>> currently no way to find out the address width of the > > > > > > > > >>>>>> IOMMU. We've been > > > > > > > > >>>>>> getting by because it's safely close enough to the CPU > > > > > > > > >>>>>> address width to > > > > > > > > >>>>>> not be a concern until we start exposing things at the > > > > > > > > >>>>>> top of the 64bit > > > > > > > > >>>>>> address space. Maybe I can safely ignore anything above > > > > > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks, > > > > > > > > >>>>>> > > > > > > > > >>>>>> Alex > > > > > > > > >>>>> > > > > > > > > >>>>> I think it's not related to target CPU at all - it's a > > > > > > > > >>>>> host limitation. > > > > > > > > >>>>> So just make up your own constant, maybe depending on > > > > > > > > >>>>> host architecture. > > > > > > > > >>>>> Long term add an ioctl to query it. > > > > > > > > >>>> > > > > > > > > >>>> It's a hardware limitation which I'd imagine has some > > > > > > > > >>>> loose ties to the > > > > > > > > >>>> physical address bits of the CPU. > > > > > > > > >>>> > > > > > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it > > > > > > > > >>>>> should avoid > > > > > > > > >>>>> placing BARs above some address. > > > > > > > > >>>> > > > > > > > > >>>> That doesn't help this case, it's a spurious mapping > > > > > > > > >>>> caused by sizing > > > > > > > > >>>> the BARs with them enabled. We may still want such a > > > > > > > > >>>> thing to feed into > > > > > > > > >>>> building ACPI tables though. > > > > > > > > >>> > > > > > > > > >>> Well the point is that if you want BIOS to avoid > > > > > > > > >>> specific addresses, you need to tell it what to avoid. > > > > > > > > >>> But neither BIOS nor ACPI actually cover the range above > > > > > > > > >>> 2^48 ATM so it's not a high priority. > > > > > > > > >>> > > > > > > > > >>>>> Since it's a vfio limitation I think it should be a vfio > > > > > > > > >>>>> API, along the > > > > > > > > >>>>> lines of vfio_get_addr_space_bits(void). > > > > > > > > >>>>> (Is this true btw? legacy assignment doesn't have this > > > > > > > > >>>>> problem?) > > > > > > > > >>>> > > > > > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has > > > > > > > > >>>> the same > > > > > > > > >>>> problem. It looks like legacy will abort() in QEMU for > > > > > > > > >>>> the failed > > > > > > > > >>>> mapping and I'm planning to tighten vfio to also kill the > > > > > > > > >>>> VM for failed > > > > > > > > >>>> mappings. In the short term, I think I'll ignore any > > > > > > > > >>>> mappings above > > > > > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS, > > > > > > > > >>> > > > > > > > > >>> That seems very wrong. It will still fail on an x86 host if > > > > > > > > >>> we are > > > > > > > > >>> emulating a CPU with full 64 bit addressing. The limitation > > > > > > > > >>> is on the > > > > > > > > >>> host side there's no real reason to tie it to the target. > > > > > > > > > > > > > > > > > > I doubt vfio would be the only thing broken in that case. > > > > > > > > > > > > > > > > > >>>> long term vfio already has an IOMMU info > > > > > > > > >>>> ioctl that we could use to return this information, but > > > > > > > > >>>> we'll need to > > > > > > > > >>>> figure out how to get it out of the IOMMU driver first. > > > > > > > > >>>> Thanks, > > > > > > > > >>>> > > > > > > > > >>>> Alex > > > > > > > > >>> > > > > > > > > >>> Short term, just assume 48 bits on x86. > > > > > > > > > > > > > > > > > > I hate to pick an arbitrary value since we have a very > > > > > > > > > specific mapping > > > > > > > > > we're trying to avoid. Perhaps a better option is to skip > > > > > > > > > anything > > > > > > > > > where: > > > > > > > > > > > > > > > > > > MemoryRegionSection.offset_within_address_space > > > > > > > > > > ~MemoryRegionSection.offset_within_address_space > > > > > > > > > > > > > > > > > >>> We need to figure out what's the limitation on ppc and arm - > > > > > > > > >>> maybe there's none and it can address full 64 bit range. > > > > > > > > >> > > > > > > > > >> IIUC on PPC and ARM you always have BAR windows where things > > > > > > > > >> can get mapped into. Unlike x86 where the full phyiscal > > > > > > > > >> address range can be overlayed by BARs. > > > > > > > > >> > > > > > > > > >> Or did I misunderstand the question? > > > > > > > > > > > > > > > > > > Sounds right, if either BAR mappings outside the window will > > > > > > > > > not be > > > > > > > > > realized in the memory space or the IOMMU has a full 64bit > > > > > > > > > address > > > > > > > > > space, there's no problem. Here we have an intermediate step > > > > > > > > > in the BAR > > > > > > > > > sizing producing a stray mapping that the IOMMU hardware > > > > > > > > > can't handle. > > > > > > > > > Even if we could handle it, it's not clear that we want to. > > > > > > > > > On AMD-Vi > > > > > > > > > the IOMMU pages tables can grow to 6-levels deep. A stray > > > > > > > > > mapping like > > > > > > > > > this then causes space and time overhead until the tables are > > > > > > > > > pruned > > > > > > > > > back down. Thanks, > > > > > > > > > > > > > > > > I thought sizing is hard defined as a set to > > > > > > > > -1? Can't we check for that one special case and treat it as > > > > > > > > "not mapped, but tell the guest the size in config space"? > > > > > > > > > > > > > > PCI doesn't want to handle this as anything special to > > > > > > > differentiate a > > > > > > > sizing mask from a valid BAR address. I agree though, I'd prefer > > > > > > > to > > > > > > > never see a spurious address like this in my MemoryListener. > > > > > > > > > > > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not > > > > > > set to all ones atomically. > > > > > > > > > > > > Also, while it doesn't address this fully (same issue can happen > > > > > > e.g. with ivshmem), do you think we should distinguish these BARs > > > > > > mapped > > > > > > from vfio / device assignment in qemu somehow? > > > > > > > > > > > > In particular, even when it has sane addresses: > > > > > > device really can not DMA into its own BAR, that's a spec violation > > > > > > so in theory can do anything including crashing the system. > > > > > > I don't know what happens in practice but > > > > > > if you are programming IOMMU to forward transactions back to > > > > > > device that originated it, you are not doing it any favors. > > > > > > > > > > I might concede that peer-to-peer is more trouble than it's worth if I > > > > > had a convenient way to ignore MMIO mappings in my MemoryListener, > > > > > but I > > > > > don't. > > > > > > > > Well for VFIO devices you are creating these mappings so we surely > > > > can find a way for you to check that. > > > > Doesn't each segment point back at the memory region that created it? > > > > Then you can just check that. > > > > > > It's a fairly heavy-weight search and it only avoid vfio devices, so it > > > feels like it's just delaying a real solution. > > > > Well there are several problems. > > > > That device get its own BAR programmed > > as a valid target in IOMMU is in my opinion a separate bug, > > and for *that* it's a real solution. > > Except the side-effect of that solution is that it also disables > peer-to-peer since we do not use separate IOMMU domains per device. In > fact, we can't guarantee that it's possible to use separate IOMMU > domains per device.
Interesting. I guess we can make it work if there's a single device, this will cover many users though not all of them. > So, the cure is worse than the disease. Worth checking what's worse. Want to try making device DMA into its own BAR and see what crashes? It's a spec violation so all bets are off but we can try to see at least some systems. > > > > > Self-DMA is really not the intent of doing the mapping, but > > > > > peer-to-peer does have merit. > > > > > > > > > > > I also note that if someone tries zero copy transmit out of such an > > > > > > address, get user pages will fail. > > > > > > I think this means tun zero copy transmit needs to fall-back > > > > > > on copy from user on get user pages failure. > > > > > > > > > > > > Jason, what's tour thinking on this? > > > > > > > > > > > > > > > > > > > > > > > >