Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide

Michael S. Tsirkin Tue, 14 Jan 2014 04:23:05 -0800

On Mon, Jan 13, 2014 at 02:39:04PM -0700, Alex Williamson wrote:
> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > On 12.01.2014, at 08:54, Michael S. Tsirkin <m...@redhat.com> wrote:
> > 
> > > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>> From: Paolo Bonzini <pbonz...@redhat.com>
> > >>>>>>>>>> 
> > >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > >>>>>>>>>> consequently messing up the computations.
> > >>>>>>>>>> 
> > >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from 
> > >>>>>>>>>> address
> > >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region 
> > >>>>>>>>>> it gets
> > >>>>>>>>>> is the newly introduced master abort region, which is as big as 
> > >>>>>>>>>> the PCI
> > >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 
> > >>>>>>>>>> 2^63-1,
> > >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores 
> > >>>>>>>>>> the upper
> > >>>>>>>>>> bits of the physical address.  In 
> > >>>>>>>>>> address_space_translate_internal then
> > >>>>>>>>>> 
> > >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > >>>>>>>>>> 
> > >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > >>>>>>>>>> 
> > >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > >>>>>>>>>> 
> > >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitul...@redhat.com>
> > >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> > >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> > >>>>>>>>>> ---
> > >>>>>>>>>> exec.c | 8 ++------
> > >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > >>>>>>>>>> 
> > >>>>>>>>>> diff --git a/exec.c b/exec.c
> > >>>>>>>>>> index 7e5ce93..f907f5f 100644
> > >>>>>>>>>> --- a/exec.c
> > >>>>>>>>>> +++ b/exec.c
> > >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >>>>>>>>>> 
> > >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > >>>>>>>>>> 
> > >>>>>>>>>> #define P_L2_BITS 10
> > >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >>>>>>>>>> {
> > >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > >>>>>>>>>> 
> > >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > >>>>>>>>>> -
> > >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << 
> > >>>>>>>>>> ADDR_SPACE_BITS));
> > >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", 
> > >>>>>>>>>> UINT64_MAX);
> > >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, 
> > >>>>>>>>>> "memory");
> > >>>>>>>>>> 
> > >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > >>>>>>>>> 
> > >>>>>>>>> This seems to have some unexpected consequences around sizing 
> > >>>>>>>>> 64bit PCI
> > >>>>>>>>> BARs that I'm not sure how to handle.
> > >>>>>>>> 
> > >>>>>>>> BARs are often disabled during sizing. Maybe you
> > >>>>>>>> don't detect BAR being disabled?
> > >>>>>>> 
> > >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is 
> > >>>>>>> doing
> > >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > >>>>>>> pass-through here.
> > >>>>>> 
> > >>>>>> Sorry, not in the trace below, but yes the sizing seems to be 
> > >>>>>> happening
> > >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > >>>>>> 
> > >>>>>> Alex
> > >>>>> 
> > >>>>> OK then from QEMU POV this BAR value is not special at all.
> > >>>> 
> > >>>> Unfortunately
> > >>>> 
> > >>>>>>>>> After this patch I get vfio
> > >>>>>>>>> traces like this:
> > >>>>>>>>> 
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > >>>>>>>>> (save lower 32bits of BAR)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, 
> > >>>>>>>>> len=0x4)
> > >>>>>>>>> (write mask to BAR)
> > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > >>>>>>>>> (read size mask)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, 
> > >>>>>>>>> len=0x4)
> > >>>>>>>>> (restore BAR)
> > >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > >>>>>>>>> (memory region re-mapped)
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > >>>>>>>>> (save upper 32bits of BAR)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, 
> > >>>>>>>>> len=0x4)
> > >>>>>>>>> (write mask to BAR)
> > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff 
> > >>>>>>>>> [0x7fcf3654d000]
> > >>>>>>>>> (memory region gets re-mapped with new address)
> > >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 
> > >>>>>>>>> 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > >>>>>>>>> 
> > >>>>>>>> 
> > >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > >>>>>>> 
> > >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > >>>>> 
> > >>>>> Why can't you? Generally memory core let you find out easily.
> > >>>> 
> > >>>> My MemoryListener is setup for &address_space_memory and I then filter
> > >>>> out anything that's not memory_region_is_ram().  This still gets
> > >>>> through, so how do I easily find out?
> > >>>> 
> > >>>>> But in this case it's vfio device itself that is sized so for sure you
> > >>>>> know it's MMIO.
> > >>>> 
> > >>>> How so?  I have a MemoryListener as described above and pass everything
> > >>>> through to the IOMMU.  I suppose I could look through all the
> > >>>> VFIODevices and check if the MemoryRegion matches, but that seems 
> > >>>> really
> > >>>> ugly.
> > >>>> 
> > >>>>> Maybe you will have same issue if there's another device with a 64 bit
> > >>>>> bar though, like ivshmem?
> > >>>> 
> > >>>> Perhaps, I suspect I'll see anything that registers their BAR
> > >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > >>> 
> > >>> Must be a 64 bit BAR to trigger the issue though.
> > >>> 
> > >>>>>>> Second, it enables peer-to-peer DMA between devices, which is 
> > >>>>>>> something
> > >>>>>>> that we might be able to take advantage of with GPU passthrough.
> > >>>>>>> 
> > >>>>>>>>> Prior to this change, there was no re-map with the 
> > >>>>>>>>> fffffffffebe0000
> > >>>>>>>>> address, presumably because it was beyond the address space of 
> > >>>>>>>>> the PCI
> > >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why 
> > >>>>>>>>> are we
> > >>>>>>>>> allowing it to be realized in the system address space at this 
> > >>>>>>>>> location?
> > >>>>>>>>> Thanks,
> > >>>>>>>>> 
> > >>>>>>>>> Alex
> > >>>>>>>> 
> > >>>>>>>> Why do you think it is not in PCI MMIO space?
> > >>>>>>>> True, CPU can't access this address but other pci devices can.
> > >>>>>>> 
> > >>>>>>> What happens on real hardware when an address like this is 
> > >>>>>>> programmed to
> > >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I 
> > >>>>>>> have
> > >>>>>>> serious doubts that another PCI device would be able to access it
> > >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > >>>>>>> always limited by some kind of aperture, whether that's explicit in
> > >>>>>>> bridge windows or implicit in hardware design (and perhaps made 
> > >>>>>>> explicit
> > >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, 
> > >>>>>>> how
> > >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > >>>>>>> 
> > >>>>>>> Alex
> > >>>>>> 
> > >>>>> 
> > >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit 
> > >>>>> that
> > >>>>> full 64 bit addresses must be allowed and hardware validation
> > >>>>> test suites normally check that it actually does work
> > >>>>> if it happens.
> > >>>> 
> > >>>> Sure, PCI devices themselves, but the chipset typically has defined
> > >>>> routing, that's more what I'm referring to.  There are generally only
> > >>>> fixed address windows for RAM vs MMIO.
> > >>> 
> > >>> The physical chipset? Likely - in the presence of IOMMU.
> > >>> Without that, devices can talk to each other without going
> > >>> through chipset, and bridge spec is very explicit that
> > >>> full 64 bit addressing must be supported.
> > >>> 
> > >>> So as long as we don't emulate an IOMMU,
> > >>> guest will normally think it's okay to use any address.
> > >>> 
> > >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > >>>>> windows would protect you, but pci already does this filtering:
> > >>>>> if you see this address in the memory map this means
> > >>>>> your virtual device is on root bus.
> > >>>>> 
> > >>>>> So I think it's the other way around: if VFIO requires specific
> > >>>>> address ranges to be assigned to devices, it should give this
> > >>>>> info to qemu and qemu can give this to guest.
> > >>>>> Then anything outside that range can be ignored by VFIO.
> > >>>> 
> > >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > >>>> currently no way to find out the address width of the IOMMU.  We've 
> > >>>> been
> > >>>> getting by because it's safely close enough to the CPU address width to
> > >>>> not be a concern until we start exposing things at the top of the 64bit
> > >>>> address space.  Maybe I can safely ignore anything above
> > >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > >>>> 
> > >>>> Alex
> > >>> 
> > >>> I think it's not related to target CPU at all - it's a host limitation.
> > >>> So just make up your own constant, maybe depending on host architecture.
> > >>> Long term add an ioctl to query it.
> > >> 
> > >> It's a hardware limitation which I'd imagine has some loose ties to the
> > >> physical address bits of the CPU.
> > >> 
> > >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > >>> placing BARs above some address.
> > >> 
> > >> That doesn't help this case, it's a spurious mapping caused by sizing
> > >> the BARs with them enabled.  We may still want such a thing to feed into
> > >> building ACPI tables though.
> > > 
> > > Well the point is that if you want BIOS to avoid
> > > specific addresses, you need to tell it what to avoid.
> > > But neither BIOS nor ACPI actually cover the range above
> > > 2^48 ATM so it's not a high priority.
> > > 
> > >>> Since it's a vfio limitation I think it should be a vfio API, along the
> > >>> lines of vfio_get_addr_space_bits(void).
> > >>> (Is this true btw? legacy assignment doesn't have this problem?)
> > >> 
> > >> It's an IOMMU hardware limitation, legacy assignment has the same
> > >> problem.  It looks like legacy will abort() in QEMU for the failed
> > >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > >> mappings.  In the short term, I think I'll ignore any mappings above
> > >> TARGET_PHYS_ADDR_SPACE_BITS,
> > > 
> > > That seems very wrong. It will still fail on an x86 host if we are
> > > emulating a CPU with full 64 bit addressing. The limitation is on the
> > > host side there's no real reason to tie it to the target.
> 
> I doubt vfio would be the only thing broken in that case.


A bit cryptic.
target-s390x/cpu.h:#define TARGET_PHYS_ADDR_SPACE_BITS 64
So qemu does emulate at least one full-64 bit CPU.

It's possible that something limits PCI BAR address
there, it might or might not be architectural.

> > >> long term vfio already has an IOMMU info
> > >> ioctl that we could use to return this information, but we'll need to
> > >> figure out how to get it out of the IOMMU driver first.
> > >> Thanks,
> > >> 
> > >> Alex
> > > 
> > > Short term, just assume 48 bits on x86.
> 
> I hate to pick an arbitrary value since we have a very specific mapping
> we're trying to avoid.

Well it's not a specific mapping really.

Any mapping outside host IOMMU would not work.
guests happen to trigger it while sizing but again
they are allowed to write anything into BARs really.

>  Perhaps a better option is to skip anything
> where:
> 
>         MemoryRegionSection.offset_within_address_space >
>         ~MemoryRegionSection.offset_within_address_space


This merely checks that high bit is 1, doesn't it?

So this equivalently assumes 63 bits on x86, if you prefer
63 and not 48, that's fine with me.




> > > We need to figure out what's the limitation on ppc and arm -
> > > maybe there's none and it can address full 64 bit range.
> > 
> > IIUC on PPC and ARM you always have BAR windows where things can get mapped 
> > into. Unlike x86 where the full phyiscal address range can be overlayed by 
> > BARs.
> > 
> > Or did I misunderstand the question?
> 
> Sounds right, if either BAR mappings outside the window will not be
> realized in the memory space or the IOMMU has a full 64bit address
> space, there's no problem.  Here we have an intermediate step in the BAR
> sizing producing a stray mapping that the IOMMU hardware can't handle.
> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> this then causes space and time overhead until the tables are pruned
> back down.  Thanks,
> 
> Alex

In the common case of a single VFIO device per IOMMU, you really should not
add its own BARs in the IOMMU. That's not a complete fix
but it addresses the overhead concern that you mention here.

-- 
MST

Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide

Reply via email to