Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)

Alex Williamson Wed, 08 Jul 2015 07:52:45 -0700

On Wed, 2015-07-08 at 16:26 +1000, Alexey Kardashevskiy wrote:
> On 07/08/2015 02:24 AM, Alex Williamson wrote:
> > On Tue, 2015-07-07 at 22:11 +1000, Alexey Kardashevskiy wrote:
> >> On 07/07/2015 02:13 AM, Alex Williamson wrote:
> >>> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
> >>>> On 07/06/2015 11:42 PM, Alex Williamson wrote:
> >>>>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
> >>>>>> This makes use of the new "memory registering" feature. The idea is
> >>>>>> to provide the userspace ability to notify the host kernel about pages
> >>>>>> which are going to be used for DMA. Having this information, the host
> >>>>>> kernel can pin them all once per user process, do locked pages
> >>>>>> accounting (once) and not spent time on doing that in real time with
> >>>>>> possible failures which cannot be handled nicely in some cases.
> >>>>>>
> >>>>>> This adds a guest RAM memory listener which notifies a VFIO container
> >>>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> >>>>>> (i.e. "skip dump" regions) are skipped.
> >>>>>>
> >>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this 
> >>>>>> does
> >>>>>> not call it when v2 is detected and enabled.
> >>>>>>
> >>>>>> This does not change the guest visible interface.
> >>>>>>
> >>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> >>>>>> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
> >>>>>> ---
> >>>>>> Changes:
> >>>>>> v9:
> >>>>>> * since there is no more SPAPR-specific data in container::iommu_data,
> >>>>>> the memory preregistration fields are common and potentially can be 
> >>>>>> used
> >>>>>> by other architectures
> >>>>>>
> >>>>>> v7:
> >>>>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
> >>>>>> * s'ramlistener'register_listener'
> >>>>>>
> >>>>>> v6:
> >>>>>> * fixed commit log (s/guest/userspace/), added note about no guest 
> >>>>>> visible
> >>>>>> change
> >>>>>> * fixed error checking if ram registration failed
> >>>>>> * added alignment check for section->offset_within_region
> >>>>>>
> >>>>>> v5:
> >>>>>> * simplified the patch
> >>>>>> * added trace points
> >>>>>> * added round_up() for the size
> >>>>>> * SPAPR IOMMU v2 used
> >>>>>> ---
> >>>>>>     hw/vfio/common.c              | 109 
> >>>>>> ++++++++++++++++++++++++++++++++++++++----
> >>>>>>     include/hw/vfio/vfio-common.h |   3 ++
> >>>>>>     trace-events                  |   1 +
> >>>>>>     3 files changed, 104 insertions(+), 9 deletions(-)
> >>>>>>
> >>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>> index 8eacfd7..0c7ba8c 100644
> >>>>>> --- a/hw/vfio/common.c
> >>>>>> +++ b/hw/vfio/common.c
> >>>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer 
> >>>>>> *container)
> >>>>>>         
> >>>>>> memory_listener_unregister(&container->iommu_data.type1.listener);
> >>>>>>     }
> >>>>>>
> >>>>>> +static void vfio_ram_do_region(VFIOContainer *container,
> >>>>>> +                              MemoryRegionSection *section, unsigned 
> >>>>>> long req)
> >>>>>> +{
> >>>>>> +    int ret;
> >>>>>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = 
> >>>>>> sizeof(reg) };
> >>>>>
> >>>>> This function is not as general as the name would imply, it's spapr
> >>>>> specific due to this.  How about vfio_spapr_register_memory() with a
> >>>>> bool parameter toggling register vs unregister so we're not passing an
> >>>>> arbitrary ioctl number?
> >>>>
> >>>> Ok. Although I am quite often asked not to do such a thing and rather 
> >>>> add 2
> >>>> helpers (reg/unreg, do/undo, etc) instead and reuse common bits.
> >>>
> >>> I'm not a fan of functions that do the reverse process based on a bool
> >>> arg either, but I dislike them less than passing an arbitrary ioctl
> >>> number for a parameter.  The former is ugly, but the latter is difficult
> >>> to use and difficult to maintain because it would be subtle later to
> >>> spot an unsupported ioctl being passed to the function.
> >>>
> >>>>>> +
> >>>>>> +    if (!memory_region_is_ram(section->mr) ||
> >>>>>> +        memory_region_is_skip_dump(section->mr)) {
> >>>>>> +        return;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    if (unlikely((section->offset_within_region & (getpagesize() - 
> >>>>>> 1)))) {
> >>>>>
> >>>>> s/getpagesize()/qemu_real_host_page_size/?
> >>>>
> >>>>
> >>>> Oh, right, I guess it reached upstream now.
> >>>>
> >>>>
> >>>>>> +        error_report("%s received unaligned region", __func__);
> >>>>>> +        return;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >>>>>> +        section->offset_within_region;
> >>>>>> +    reg.size = ROUND_UP(int128_get64(section->size), 
> >>>>>> TARGET_PAGE_SIZE);
> >>>>>> +
> >>>>>> +    ret = ioctl(container->fd, req, &reg);
> >>>>>> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, 
> >>>>>> reg.size,
> >>>>>> +            ret ? -errno : 0);
> >>>>>> +    if (!ret) {
> >>>>>> +        return;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    /*
> >>>>>> +     * On the initfn path, store the first error in the container so 
> >>>>>> we
> >>>>>> +     * can gracefully fail.  Runtime, there's not much we can do other
> >>>>>> +     * than throw a hardware error.
> >>>>>> +     */
> >>>>>> +    if (!container->iommu_data.ram_reg_initialized) {
> >>>>>> +        if (!container->iommu_data.ram_reg_error) {
> >>>>>> +            container->iommu_data.ram_reg_error = -errno;
> >>>>>> +        }
> >>>>>> +    } else {
> >>>>>> +        hw_error("vfio: RAM registering failed, unable to continue");
> >>>>>> +    }
> >>>>>
> >>>>> I'd rather see:
> >>>>>
> >>>>> if (ret) {
> >>>>>      if (!container...) {
> >>>>>        ...
> >>>>>      } else {
> >>>>>        ...
> >>>>>      }
> >>>>> }
> >>>>>
> >>>>> Exiting early on success and otherwise falling into error handling is a
> >>>>> strange code flow.
> >>>>
> >>>> Ok... vfio_dma_map() does not follow this rule so I thought it is not 
> >>>> that
> >>>> strict :)
> >>>
> >>> It would be nice to clean it up there too.
> >>>
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
> >>>>>> +                                         MemoryRegionSection *section)
> >>>>>> +{
> >>>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >>>>>> +                                            
> >>>>>> iommu_data.register_listener);
> >>>>>> +    memory_region_ref(section->mr);
> >>>>>> +    vfio_ram_do_region(container, section, 
> >>>>>> VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
> >>>>>
> >>>>> vfio_spapr_register_memory(container, section, true);
> >>>>>
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
> >>>>>> +                                         MemoryRegionSection *section)
> >>>>>> +{
> >>>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >>>>>> +                                            
> >>>>>> iommu_data.register_listener);
> >>>>>> +    vfio_ram_do_region(container, section, 
> >>>>>> VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
> >>>>>
> >>>>> vfio_spapr_register_memory(container, section, false);
> >>>>>
> >>>>>> +    memory_region_unref(section->mr);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static const MemoryListener vfio_ram_memory_listener = {
> >>>>>> +    .region_add = vfio_ram_listener_region_add,
> >>>>>> +    .region_del = vfio_ram_listener_region_del,
> >>>>>> +};
> >>>>>
> >>>>> These are all spapr specific, please reflect that in the name;
> >>>>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
> >>>>
> >>>> ok.
> >>>>
> >>>>
> >>>>> Actually, can't we determine what type of IOMMU we have and make the
> >>>>> existing MemoryListener handle either type1 or spapr or spapr-v2?
> >>>>
> >>>>
> >>>> Sorry, I do not follow you here. How? The existing listener listens on 
> >>>> PCI
> >>>> address space (at least, on pseries), new one listens on RAM address 
> >>>> space
> >>>> (address_space_memory). What do I miss?
> >>>
> >>> Isn't that simply a difference of the address space the listener is
> >>> attached to?  Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
> >>> are already both handled by the same listener.
> >>
> >>
> >> Ok, I tried merging 2 listeners and realized that the PCI listener works
> >> with TARGET_PAGE_SIZE granularity (which is 4K and actually it should be
> >> using an IOMMU page size which is not easily available there but this is a
> >> different story) and RAM listener with the qemu_real_host_page_size
> >> granularity (64K for my case) so depending on the address space type,
> >> vfio_listener_region_add() will have to use different page sizes. I like
> >> the idea of merging less now...
> >
> > Sounds like you're already solving something that needs to be fixed for
> > both.  The type1 VFIO_IOMMU_GET_INFO ioctl does actually give us a
> > bitmap of supported iommu page sizes.  It's really all but useless for
> > anything except determining the minimum page size.
> 
> btw what sizes can really come from there?


I think it was originally intended to be a bitmap of native IOMMU page
sizes.  I think AMD-Vi still does this, and reports essentially
PAGE_MASK minus a few bits that the hardware doesn't support for
whatever reason.  Not to be outdone, VT-d reports PAGE_MASK even though
their hardware supports a small set of discrete page sizes.  I think the
theory there was that software can breakdown any mapping to supported
sizes.  The result is that we have no idea whether a bit in the bitmap
means native support or not, so we ignore it and assume host page size
is the minimum alignment.

> >  For the most part we
> > just assume that it's the same as the host page size, so those existing
> > checks could actually change to host page alignment pretty safely.  I
> > think we both actually want pages that are both host and target aligned,
> > don't we?  What would you do on a 64k host if the guest tried to map a
> > region that only had 4k alignment?
> 
> I will get_user_pages_fast(va & PAGE_MASK) and then write 
> (gpa_to_hpa(va&PAGE_MASK)|(va & ~PAGE_MASK)) to the table, this is what we 
> do now as our typical host uses 64k pages and default 32bit window always 
> uses 4K (irrelevant to the guest page size).

Ok, so the windowed IOMMU with native 4k pages prevents you from
allowing access to more than the guest mapped.

> >  Anyway, if that's the only problem,
> > it looks more like an opportunity than a barrier.
> 
> Oh. Ok :)

Sorry ;)  Thanks,

Alex

Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)

Reply via email to