On Mon, Jan 13, 2014 at 02:39:04PM -0700, Alex Williamson wrote: > On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote: > > On 12.01.2014, at 08:54, Michael S. Tsirkin <m...@redhat.com> wrote: > > > > > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote: > > >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote: > > >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote: > > >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote: > > >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote: > > >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote: > > >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote: > > >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote: > > >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote: > > >>>>>>>>>> From: Paolo Bonzini <pbonz...@redhat.com> > > >>>>>>>>>> > > >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory > > >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide. > > >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above > > >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal > > >>>>>>>>>> consequently messing up the computations. > > >>>>>>>>>> > > >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from > > >>>>>>>>>> address > > >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The region > > >>>>>>>>>> it gets > > >>>>>>>>>> is the newly introduced master abort region, which is as big as > > >>>>>>>>>> the PCI > > >>>>>>>>>> address space (see pci_bus_init). Due to a typo that's only > > >>>>>>>>>> 2^63-1, > > >>>>>>>>>> not 2^64. But we get it anyway because phys_page_find ignores > > >>>>>>>>>> the upper > > >>>>>>>>>> bits of the physical address. In > > >>>>>>>>>> address_space_translate_internal then > > >>>>>>>>>> > > >>>>>>>>>> diff = int128_sub(section->mr->size, int128_make64(addr)); > > >>>>>>>>>> *plen = int128_get64(int128_min(diff, int128_make64(*plen))); > > >>>>>>>>>> > > >>>>>>>>>> diff becomes negative, and int128_get64 booms. > > >>>>>>>>>> > > >>>>>>>>>> The size of the PCI address space region should be fixed anyway. > > >>>>>>>>>> > > >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitul...@redhat.com> > > >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > > >>>>>>>>>> --- > > >>>>>>>>>> exec.c | 8 ++------ > > >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-) > > >>>>>>>>>> > > >>>>>>>>>> diff --git a/exec.c b/exec.c > > >>>>>>>>>> index 7e5ce93..f907f5f 100644 > > >>>>>>>>>> --- a/exec.c > > >>>>>>>>>> +++ b/exec.c > > >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry { > > >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6) > > >>>>>>>>>> > > >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables. */ > > >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS > > >>>>>>>>>> +#define ADDR_SPACE_BITS 64 > > >>>>>>>>>> > > >>>>>>>>>> #define P_L2_BITS 10 > > >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS) > > >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void) > > >>>>>>>>>> { > > >>>>>>>>>> system_memory = g_malloc(sizeof(*system_memory)); > > >>>>>>>>>> > > >>>>>>>>>> - assert(ADDR_SPACE_BITS <= 64); > > >>>>>>>>>> - > > >>>>>>>>>> - memory_region_init(system_memory, NULL, "system", > > >>>>>>>>>> - ADDR_SPACE_BITS == 64 ? > > >>>>>>>>>> - UINT64_MAX : (0x1ULL << > > >>>>>>>>>> ADDR_SPACE_BITS)); > > >>>>>>>>>> + memory_region_init(system_memory, NULL, "system", > > >>>>>>>>>> UINT64_MAX); > > >>>>>>>>>> address_space_init(&address_space_memory, system_memory, > > >>>>>>>>>> "memory"); > > >>>>>>>>>> > > >>>>>>>>>> system_io = g_malloc(sizeof(*system_io)); > > >>>>>>>>> > > >>>>>>>>> This seems to have some unexpected consequences around sizing > > >>>>>>>>> 64bit PCI > > >>>>>>>>> BARs that I'm not sure how to handle. > > >>>>>>>> > > >>>>>>>> BARs are often disabled during sizing. Maybe you > > >>>>>>>> don't detect BAR being disabled? > > >>>>>>> > > >>>>>>> See the trace below, the BARs are not disabled. QEMU pci-core is > > >>>>>>> doing > > >>>>>>> the sizing an memory region updates for the BARs, vfio is just a > > >>>>>>> pass-through here. > > >>>>>> > > >>>>>> Sorry, not in the trace below, but yes the sizing seems to be > > >>>>>> happening > > >>>>>> while I/O & memory are enabled int he command register. Thanks, > > >>>>>> > > >>>>>> Alex > > >>>>> > > >>>>> OK then from QEMU POV this BAR value is not special at all. > > >>>> > > >>>> Unfortunately > > >>>> > > >>>>>>>>> After this patch I get vfio > > >>>>>>>>> traces like this: > > >>>>>>>>> > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004 > > >>>>>>>>> (save lower 32bits of BAR) > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, > > >>>>>>>>> len=0x4) > > >>>>>>>>> (write mask to BAR) > > >>>>>>>>> vfio: region_del febe0000 - febe3fff > > >>>>>>>>> (memory region gets unmapped) > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004 > > >>>>>>>>> (read size mask) > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, > > >>>>>>>>> len=0x4) > > >>>>>>>>> (restore BAR) > > >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000] > > >>>>>>>>> (memory region re-mapped) > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0 > > >>>>>>>>> (save upper 32bits of BAR) > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, > > >>>>>>>>> len=0x4) > > >>>>>>>>> (write mask to BAR) > > >>>>>>>>> vfio: region_del febe0000 - febe3fff > > >>>>>>>>> (memory region gets unmapped) > > >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff > > >>>>>>>>> [0x7fcf3654d000] > > >>>>>>>>> (memory region gets re-mapped with new address) > > >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, > > >>>>>>>>> 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address) > > >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses) > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu? > > >>>>>>> > > >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO. > > >>>>> > > >>>>> Why can't you? Generally memory core let you find out easily. > > >>>> > > >>>> My MemoryListener is setup for &address_space_memory and I then filter > > >>>> out anything that's not memory_region_is_ram(). This still gets > > >>>> through, so how do I easily find out? > > >>>> > > >>>>> But in this case it's vfio device itself that is sized so for sure you > > >>>>> know it's MMIO. > > >>>> > > >>>> How so? I have a MemoryListener as described above and pass everything > > >>>> through to the IOMMU. I suppose I could look through all the > > >>>> VFIODevices and check if the MemoryRegion matches, but that seems > > >>>> really > > >>>> ugly. > > >>>> > > >>>>> Maybe you will have same issue if there's another device with a 64 bit > > >>>>> bar though, like ivshmem? > > >>>> > > >>>> Perhaps, I suspect I'll see anything that registers their BAR > > >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr. > > >>> > > >>> Must be a 64 bit BAR to trigger the issue though. > > >>> > > >>>>>>> Second, it enables peer-to-peer DMA between devices, which is > > >>>>>>> something > > >>>>>>> that we might be able to take advantage of with GPU passthrough. > > >>>>>>> > > >>>>>>>>> Prior to this change, there was no re-map with the > > >>>>>>>>> fffffffffebe0000 > > >>>>>>>>> address, presumably because it was beyond the address space of > > >>>>>>>>> the PCI > > >>>>>>>>> window. This address is clearly not in a PCI MMIO space, so why > > >>>>>>>>> are we > > >>>>>>>>> allowing it to be realized in the system address space at this > > >>>>>>>>> location? > > >>>>>>>>> Thanks, > > >>>>>>>>> > > >>>>>>>>> Alex > > >>>>>>>> > > >>>>>>>> Why do you think it is not in PCI MMIO space? > > >>>>>>>> True, CPU can't access this address but other pci devices can. > > >>>>>>> > > >>>>>>> What happens on real hardware when an address like this is > > >>>>>>> programmed to > > >>>>>>> a device? The CPU doesn't have the physical bits to access it. I > > >>>>>>> have > > >>>>>>> serious doubts that another PCI device would be able to access it > > >>>>>>> either. Maybe in some limited scenario where the devices are on the > > >>>>>>> same conventional PCI bus. In the typical case, PCI addresses are > > >>>>>>> always limited by some kind of aperture, whether that's explicit in > > >>>>>>> bridge windows or implicit in hardware design (and perhaps made > > >>>>>>> explicit > > >>>>>>> in ACPI). Even if I wanted to filter these out as noise in vfio, > > >>>>>>> how > > >>>>>>> would I do it in a way that still allows real 64bit MMIO to be > > >>>>>>> programmed. PCI has this knowledge, I hope. VFIO doesn't. Thanks, > > >>>>>>> > > >>>>>>> Alex > > >>>>>> > > >>>>> > > >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit > > >>>>> that > > >>>>> full 64 bit addresses must be allowed and hardware validation > > >>>>> test suites normally check that it actually does work > > >>>>> if it happens. > > >>>> > > >>>> Sure, PCI devices themselves, but the chipset typically has defined > > >>>> routing, that's more what I'm referring to. There are generally only > > >>>> fixed address windows for RAM vs MMIO. > > >>> > > >>> The physical chipset? Likely - in the presence of IOMMU. > > >>> Without that, devices can talk to each other without going > > >>> through chipset, and bridge spec is very explicit that > > >>> full 64 bit addressing must be supported. > > >>> > > >>> So as long as we don't emulate an IOMMU, > > >>> guest will normally think it's okay to use any address. > > >>> > > >>>>> Yes, if there's a bridge somewhere on the path that bridge's > > >>>>> windows would protect you, but pci already does this filtering: > > >>>>> if you see this address in the memory map this means > > >>>>> your virtual device is on root bus. > > >>>>> > > >>>>> So I think it's the other way around: if VFIO requires specific > > >>>>> address ranges to be assigned to devices, it should give this > > >>>>> info to qemu and qemu can give this to guest. > > >>>>> Then anything outside that range can be ignored by VFIO. > > >>>> > > >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO. There's > > >>>> currently no way to find out the address width of the IOMMU. We've > > >>>> been > > >>>> getting by because it's safely close enough to the CPU address width to > > >>>> not be a concern until we start exposing things at the top of the 64bit > > >>>> address space. Maybe I can safely ignore anything above > > >>>> TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks, > > >>>> > > >>>> Alex > > >>> > > >>> I think it's not related to target CPU at all - it's a host limitation. > > >>> So just make up your own constant, maybe depending on host architecture. > > >>> Long term add an ioctl to query it. > > >> > > >> It's a hardware limitation which I'd imagine has some loose ties to the > > >> physical address bits of the CPU. > > >> > > >>> Also, we can add a fwcfg interface to tell bios that it should avoid > > >>> placing BARs above some address. > > >> > > >> That doesn't help this case, it's a spurious mapping caused by sizing > > >> the BARs with them enabled. We may still want such a thing to feed into > > >> building ACPI tables though. > > > > > > Well the point is that if you want BIOS to avoid > > > specific addresses, you need to tell it what to avoid. > > > But neither BIOS nor ACPI actually cover the range above > > > 2^48 ATM so it's not a high priority. > > > > > >>> Since it's a vfio limitation I think it should be a vfio API, along the > > >>> lines of vfio_get_addr_space_bits(void). > > >>> (Is this true btw? legacy assignment doesn't have this problem?) > > >> > > >> It's an IOMMU hardware limitation, legacy assignment has the same > > >> problem. It looks like legacy will abort() in QEMU for the failed > > >> mapping and I'm planning to tighten vfio to also kill the VM for failed > > >> mappings. In the short term, I think I'll ignore any mappings above > > >> TARGET_PHYS_ADDR_SPACE_BITS, > > > > > > That seems very wrong. It will still fail on an x86 host if we are > > > emulating a CPU with full 64 bit addressing. The limitation is on the > > > host side there's no real reason to tie it to the target. > > I doubt vfio would be the only thing broken in that case.
A bit cryptic. target-s390x/cpu.h:#define TARGET_PHYS_ADDR_SPACE_BITS 64 So qemu does emulate at least one full-64 bit CPU. It's possible that something limits PCI BAR address there, it might or might not be architectural. > > >> long term vfio already has an IOMMU info > > >> ioctl that we could use to return this information, but we'll need to > > >> figure out how to get it out of the IOMMU driver first. > > >> Thanks, > > >> > > >> Alex > > > > > > Short term, just assume 48 bits on x86. > > I hate to pick an arbitrary value since we have a very specific mapping > we're trying to avoid. Well it's not a specific mapping really. Any mapping outside host IOMMU would not work. guests happen to trigger it while sizing but again they are allowed to write anything into BARs really. > Perhaps a better option is to skip anything > where: > > MemoryRegionSection.offset_within_address_space > > ~MemoryRegionSection.offset_within_address_space This merely checks that high bit is 1, doesn't it? So this equivalently assumes 63 bits on x86, if you prefer 63 and not 48, that's fine with me. > > > We need to figure out what's the limitation on ppc and arm - > > > maybe there's none and it can address full 64 bit range. > > > > IIUC on PPC and ARM you always have BAR windows where things can get mapped > > into. Unlike x86 where the full phyiscal address range can be overlayed by > > BARs. > > > > Or did I misunderstand the question? > > Sounds right, if either BAR mappings outside the window will not be > realized in the memory space or the IOMMU has a full 64bit address > space, there's no problem. Here we have an intermediate step in the BAR > sizing producing a stray mapping that the IOMMU hardware can't handle. > Even if we could handle it, it's not clear that we want to. On AMD-Vi > the IOMMU pages tables can grow to 6-levels deep. A stray mapping like > this then causes space and time overhead until the tables are pruned > back down. Thanks, > > Alex In the common case of a single VFIO device per IOMMU, you really should not add its own BARs in the IOMMU. That's not a complete fix but it addresses the overhead concern that you mention here. -- MST