> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.william...@redhat.com>: > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote: >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <m...@redhat.com> wrote: >>> >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote: >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote: >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote: >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote: >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote: >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote: >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote: >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote: >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote: >>>>>>>>>>>> From: Paolo Bonzini <pbonz...@redhat.com> >>>>>>>>>>>> >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide. >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal >>>>>>>>>>>> consequently messing up the computations. >>>>>>>>>>>> >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from >>>>>>>>>>>> address >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The region it >>>>>>>>>>>> gets >>>>>>>>>>>> is the newly introduced master abort region, which is as big as >>>>>>>>>>>> the PCI >>>>>>>>>>>> address space (see pci_bus_init). Due to a typo that's only >>>>>>>>>>>> 2^63-1, >>>>>>>>>>>> not 2^64. But we get it anyway because phys_page_find ignores the >>>>>>>>>>>> upper >>>>>>>>>>>> bits of the physical address. In address_space_translate_internal >>>>>>>>>>>> then >>>>>>>>>>>> >>>>>>>>>>>> diff = int128_sub(section->mr->size, int128_make64(addr)); >>>>>>>>>>>> *plen = int128_get64(int128_min(diff, int128_make64(*plen))); >>>>>>>>>>>> >>>>>>>>>>>> diff becomes negative, and int128_get64 booms. >>>>>>>>>>>> >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway. >>>>>>>>>>>> >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitul...@redhat.com> >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com> >>>>>>>>>>>> --- >>>>>>>>>>>> exec.c | 8 ++------ >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/exec.c b/exec.c >>>>>>>>>>>> index 7e5ce93..f907f5f 100644 >>>>>>>>>>>> --- a/exec.c >>>>>>>>>>>> +++ b/exec.c >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry { >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6) >>>>>>>>>>>> >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables. */ >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64 >>>>>>>>>>>> >>>>>>>>>>>> #define P_L2_BITS 10 >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS) >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void) >>>>>>>>>>>> { >>>>>>>>>>>> system_memory = g_malloc(sizeof(*system_memory)); >>>>>>>>>>>> >>>>>>>>>>>> - assert(ADDR_SPACE_BITS <= 64); >>>>>>>>>>>> - >>>>>>>>>>>> - memory_region_init(system_memory, NULL, "system", >>>>>>>>>>>> - ADDR_SPACE_BITS == 64 ? >>>>>>>>>>>> - UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS)); >>>>>>>>>>>> + memory_region_init(system_memory, NULL, "system", UINT64_MAX); >>>>>>>>>>>> address_space_init(&address_space_memory, system_memory, >>>>>>>>>>>> "memory"); >>>>>>>>>>>> >>>>>>>>>>>> system_io = g_malloc(sizeof(*system_io)); >>>>>>>>>>> >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit >>>>>>>>>>> PCI >>>>>>>>>>> BARs that I'm not sure how to handle. >>>>>>>>>> >>>>>>>>>> BARs are often disabled during sizing. Maybe you >>>>>>>>>> don't detect BAR being disabled? >>>>>>>>> >>>>>>>>> See the trace below, the BARs are not disabled. QEMU pci-core is >>>>>>>>> doing >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a >>>>>>>>> pass-through here. >>>>>>>> >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening >>>>>>>> while I/O & memory are enabled int he command register. Thanks, >>>>>>>> >>>>>>>> Alex >>>>>>> >>>>>>> OK then from QEMU POV this BAR value is not special at all. >>>>>> >>>>>> Unfortunately >>>>>> >>>>>>>>>>> After this patch I get vfio >>>>>>>>>>> traces like this: >>>>>>>>>>> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004 >>>>>>>>>>> (save lower 32bits of BAR) >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, >>>>>>>>>>> len=0x4) >>>>>>>>>>> (write mask to BAR) >>>>>>>>>>> vfio: region_del febe0000 - febe3fff >>>>>>>>>>> (memory region gets unmapped) >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004 >>>>>>>>>>> (read size mask) >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, >>>>>>>>>>> len=0x4) >>>>>>>>>>> (restore BAR) >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000] >>>>>>>>>>> (memory region re-mapped) >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0 >>>>>>>>>>> (save upper 32bits of BAR) >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, >>>>>>>>>>> len=0x4) >>>>>>>>>>> (write mask to BAR) >>>>>>>>>>> vfio: region_del febe0000 - febe3fff >>>>>>>>>>> (memory region gets unmapped) >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff >>>>>>>>>>> [0x7fcf3654d000] >>>>>>>>>>> (memory region gets re-mapped with new address) >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, >>>>>>>>>>> 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address) >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses) >>>>>>>>>> >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu? >>>>>>>>> >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO. >>>>>>> >>>>>>> Why can't you? Generally memory core let you find out easily. >>>>>> >>>>>> My MemoryListener is setup for &address_space_memory and I then filter >>>>>> out anything that's not memory_region_is_ram(). This still gets >>>>>> through, so how do I easily find out? >>>>>> >>>>>>> But in this case it's vfio device itself that is sized so for sure you >>>>>>> know it's MMIO. >>>>>> >>>>>> How so? I have a MemoryListener as described above and pass everything >>>>>> through to the IOMMU. I suppose I could look through all the >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really >>>>>> ugly. >>>>>> >>>>>>> Maybe you will have same issue if there's another device with a 64 bit >>>>>>> bar though, like ivshmem? >>>>>> >>>>>> Perhaps, I suspect I'll see anything that registers their BAR >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr. >>>>> >>>>> Must be a 64 bit BAR to trigger the issue though. >>>>> >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is >>>>>>>>> something >>>>>>>>> that we might be able to take advantage of with GPU passthrough. >>>>>>>>> >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000 >>>>>>>>>>> address, presumably because it was beyond the address space of the >>>>>>>>>>> PCI >>>>>>>>>>> window. This address is clearly not in a PCI MMIO space, so why >>>>>>>>>>> are we >>>>>>>>>>> allowing it to be realized in the system address space at this >>>>>>>>>>> location? >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Alex >>>>>>>>>> >>>>>>>>>> Why do you think it is not in PCI MMIO space? >>>>>>>>>> True, CPU can't access this address but other pci devices can. >>>>>>>>> >>>>>>>>> What happens on real hardware when an address like this is programmed >>>>>>>>> to >>>>>>>>> a device? The CPU doesn't have the physical bits to access it. I >>>>>>>>> have >>>>>>>>> serious doubts that another PCI device would be able to access it >>>>>>>>> either. Maybe in some limited scenario where the devices are on the >>>>>>>>> same conventional PCI bus. In the typical case, PCI addresses are >>>>>>>>> always limited by some kind of aperture, whether that's explicit in >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made >>>>>>>>> explicit >>>>>>>>> in ACPI). Even if I wanted to filter these out as noise in vfio, how >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be >>>>>>>>> programmed. PCI has this knowledge, I hope. VFIO doesn't. Thanks, >>>>>>>>> >>>>>>>>> Alex >>>>>>> >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that >>>>>>> full 64 bit addresses must be allowed and hardware validation >>>>>>> test suites normally check that it actually does work >>>>>>> if it happens. >>>>>> >>>>>> Sure, PCI devices themselves, but the chipset typically has defined >>>>>> routing, that's more what I'm referring to. There are generally only >>>>>> fixed address windows for RAM vs MMIO. >>>>> >>>>> The physical chipset? Likely - in the presence of IOMMU. >>>>> Without that, devices can talk to each other without going >>>>> through chipset, and bridge spec is very explicit that >>>>> full 64 bit addressing must be supported. >>>>> >>>>> So as long as we don't emulate an IOMMU, >>>>> guest will normally think it's okay to use any address. >>>>> >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's >>>>>>> windows would protect you, but pci already does this filtering: >>>>>>> if you see this address in the memory map this means >>>>>>> your virtual device is on root bus. >>>>>>> >>>>>>> So I think it's the other way around: if VFIO requires specific >>>>>>> address ranges to be assigned to devices, it should give this >>>>>>> info to qemu and qemu can give this to guest. >>>>>>> Then anything outside that range can be ignored by VFIO. >>>>>> >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO. There's >>>>>> currently no way to find out the address width of the IOMMU. We've been >>>>>> getting by because it's safely close enough to the CPU address width to >>>>>> not be a concern until we start exposing things at the top of the 64bit >>>>>> address space. Maybe I can safely ignore anything above >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks, >>>>>> >>>>>> Alex >>>>> >>>>> I think it's not related to target CPU at all - it's a host limitation. >>>>> So just make up your own constant, maybe depending on host architecture. >>>>> Long term add an ioctl to query it. >>>> >>>> It's a hardware limitation which I'd imagine has some loose ties to the >>>> physical address bits of the CPU. >>>> >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid >>>>> placing BARs above some address. >>>> >>>> That doesn't help this case, it's a spurious mapping caused by sizing >>>> the BARs with them enabled. We may still want such a thing to feed into >>>> building ACPI tables though. >>> >>> Well the point is that if you want BIOS to avoid >>> specific addresses, you need to tell it what to avoid. >>> But neither BIOS nor ACPI actually cover the range above >>> 2^48 ATM so it's not a high priority. >>> >>>>> Since it's a vfio limitation I think it should be a vfio API, along the >>>>> lines of vfio_get_addr_space_bits(void). >>>>> (Is this true btw? legacy assignment doesn't have this problem?) >>>> >>>> It's an IOMMU hardware limitation, legacy assignment has the same >>>> problem. It looks like legacy will abort() in QEMU for the failed >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed >>>> mappings. In the short term, I think I'll ignore any mappings above >>>> TARGET_PHYS_ADDR_SPACE_BITS, >>> >>> That seems very wrong. It will still fail on an x86 host if we are >>> emulating a CPU with full 64 bit addressing. The limitation is on the >>> host side there's no real reason to tie it to the target. > > I doubt vfio would be the only thing broken in that case. > >>>> long term vfio already has an IOMMU info >>>> ioctl that we could use to return this information, but we'll need to >>>> figure out how to get it out of the IOMMU driver first. >>>> Thanks, >>>> >>>> Alex >>> >>> Short term, just assume 48 bits on x86. > > I hate to pick an arbitrary value since we have a very specific mapping > we're trying to avoid. Perhaps a better option is to skip anything > where: > > MemoryRegionSection.offset_within_address_space > > ~MemoryRegionSection.offset_within_address_space > >>> We need to figure out what's the limitation on ppc and arm - >>> maybe there's none and it can address full 64 bit range. >> >> IIUC on PPC and ARM you always have BAR windows where things can get mapped >> into. Unlike x86 where the full phyiscal address range can be overlayed by >> BARs. >> >> Or did I misunderstand the question? > > Sounds right, if either BAR mappings outside the window will not be > realized in the memory space or the IOMMU has a full 64bit address > space, there's no problem. Here we have an intermediate step in the BAR > sizing producing a stray mapping that the IOMMU hardware can't handle. > Even if we could handle it, it's not clear that we want to. On AMD-Vi > the IOMMU pages tables can grow to 6-levels deep. A stray mapping like > this then causes space and time overhead until the tables are pruned > back down. Thanks,
I thought sizing is hard defined as a set to -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"? Alex > > Alex >