On 08/15/2014 10:09 AM, David Gibson wrote: > On Thu, Aug 14, 2014 at 03:38:45PM +0200, Alexander Graf wrote: >> >> On 13.08.14 02:18, Alexey Kardashevskiy wrote: >>> On 08/13/2014 01:28 AM, Alexander Graf wrote: >>>> On 12.08.14 17:10, Alexey Kardashevskiy wrote: >>>>> On 08/12/2014 07:37 PM, Alexander Graf wrote: >>>>>> On 12.08.14 02:03, Alexey Kardashevskiy wrote: >>>>>>> On 08/12/2014 03:30 AM, Alexander Graf wrote: >>>>>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote: >>>>>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote: >>>>>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote: >>>>>>>>>>> This implements DDW for VFIO. Host kernel support is required for >>>>>>>>>>> this. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>>>>>>> --- >>>>>>>>>>> hw/ppc/spapr_pci_vfio.c | 75 >>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>>> 1 file changed, 75 insertions(+) >>>>>>>>>>> >>>>>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c >>>>>>>>>>> index d3bddf2..dc443e2 100644 >>>>>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c >>>>>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c >>>>>>>>>>> @@ -69,6 +69,77 @@ static void >>>>>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp) >>>>>>>>>>> /* Register default 32bit DMA window */ >>>>>>>>>>> memory_region_add_subregion(&sphb->iommu_root, >>>>>>>>>>> tcet->bus_offset, >>>>>>>>>>> spapr_tce_get_iommu(tcet)); >>>>>>>>>>> + >>>>>>>>>>> + sphb->ddw_supported = !!(info.flags & >>>>>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW); >>>>>>>>>>> +} >>>>>>>>>>> + >>>>>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb, >>>>>>>>>>> + uint32_t *windows_available, >>>>>>>>>>> + uint32_t *page_size_mask) >>>>>>>>>>> +{ >>>>>>>>>>> + sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb); >>>>>>>>>>> + struct vfio_iommu_spapr_tce_query query = { .argsz = >>>>>>>>>>> sizeof(query) }; >>>>>>>>>>> + int ret; >>>>>>>>>>> + >>>>>>>>>>> + ret = vfio_container_ioctl(&sphb->iommu_as, >>>>>>>>>>> svphb->iommugroupid, >>>>>>>>>>> + VFIO_IOMMU_SPAPR_TCE_QUERY, &query); >>>>>>>>>>> + if (ret) { >>>>>>>>>>> + return ret; >>>>>>>>>>> + } >>>>>>>>>>> + >>>>>>>>>>> + *windows_available = query.windows_available; >>>>>>>>>>> + *page_size_mask = query.page_size_mask; >>>>>>>>>>> + >>>>>>>>>>> + return ret; >>>>>>>>>>> +} >>>>>>>>>>> + >>>>>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t >>>>>>>>>>> page_shift, >>>>>>>>>>> + uint32_t window_shift, >>>>>>>>>>> uint32_t >>>>>>>>>>> liobn, >>>>>>>>>>> + sPAPRTCETable **ptcet) >>>>>>>>>>> +{ >>>>>>>>>>> + sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb); >>>>>>>>>>> + struct vfio_iommu_spapr_tce_create create = { >>>>>>>>>>> + .argsz = sizeof(create), >>>>>>>>>>> + .page_shift = page_shift, >>>>>>>>>>> + .window_shift = window_shift, >>>>>>>>>>> + .start_addr = 0 >>>>>>>>>>> + }; >>>>>>>>>>> + int ret; >>>>>>>>>>> + >>>>>>>>>>> + ret = vfio_container_ioctl(&sphb->iommu_as, >>>>>>>>>>> svphb->iommugroupid, >>>>>>>>>>> + VFIO_IOMMU_SPAPR_TCE_CREATE, >>>>>>>>>>> &create); >>>>>>>>>>> + if (ret) { >>>>>>>>>>> + return ret; >>>>>>>>>>> + } >>>>>>>>>>> + >>>>>>>>>>> + *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, >>>>>>>>>>> create.start_addr, >>>>>>>>>>> + page_shift, 1 << (window_shift - >>>>>>>>>>> page_shift), >>>>>>>>>> I spot a 1 without ULL again - this time it might work out ok, but >>>>>>>>>> please >>>>>>>>>> just always use ULL when you pass around addresses. >>>>>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :) >>>>>>>>> >>>>>>>>> >>>>>>>>>> Please walk me though the abstraction levels on what each page size >>>>>>>>>> honoration means. If I use THP, what page size granularity can I use >>>>>>>>>> for >>>>>>>>>> TCE entries? >>>>>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls >>>>>>>>> support >>>>>>>>> >>>>>>>>> + const struct { int shift; uint32_t mask; } masks[] = { >>>>>>>>> + { 12, DDW_PGSIZE_4K }, >>>>>>>>> + { 16, DDW_PGSIZE_64K }, >>>>>>>>> + { 24, DDW_PGSIZE_16M }, >>>>>>>>> + { 25, DDW_PGSIZE_32M }, >>>>>>>>> + { 26, DDW_PGSIZE_64M }, >>>>>>>>> + { 27, DDW_PGSIZE_128M }, >>>>>>>>> + { 28, DDW_PGSIZE_256M }, >>>>>>>>> + { 34, DDW_PGSIZE_16G }, >>>>>>>>> + }; >>>>>>>>> >>>>>>>>> >>>>>>>>> Supported page sizes are returned by the host kernel via "query". For >>>>>>>>> 16MB >>>>>>>>> pages, page shift will return >>>>>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M. >>>>>>>>> Or I did not understand the question... >>>>>>>> Why do we care about the sizes? Anything bigger than what we support >>>>>>>> should >>>>>>>> always work, no? What happens if the guest creates a 16MB map but my >>>>>>>> pages >>>>>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages? >>>>>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K >>>>>>> pages, I have to make sure these 16M are continuous - there will be one >>>>>>> TCE >>>>>>> entry for it and no more translations besides IOMMU. What do I miss now? >>>>>> Who does the shadow translation where? Does it exist at all? >>>>> IOMMU? I am not sure I am following you... This IOMMU will look as direct >>>>> DMA for the guest but the real IOMMU table is sparse and it is populated >>>>> via a bunch of H_PUT_TCE calls as the default small window. >>>>> >>>>> There is a direct mapping in the host called "bypass window" but it is not >>>>> used here as sPAPR does not define that for paravirtualization. >>>> Ok, imagine I have 16MB of guest physical memory that is in reality backed >>>> by 256 64k pages on the host. The guest wants to create a 16M TCE entry for >>>> this (from its point of view contiguous) chunk of memory. >>>> >>>> Do we allow this? >>> No, we do not. We tell the guest what it can use. >>> >>>> Or do we force the guest to create 64k TCE entries? >>> 16MB TCE pages are only allowed if qemu is running with hugepages. >> >> That's unfortunate ;) but as long as we have to pin TCEd memory anyway, I >> guess it doesn't hurt as badly. >> >>> >>> >>>> If we allow it, why would we ever put any restriction at the upper end of >>>> TCE entry sizes? If we already implement enough logic to map things lazily >>>> around, we could as well have the guest create a 256M TCE entry and just >>>> split it on the host view to 64k TCE entries. >>> Oh, thiiiiiis is what you meant... >>> >>> Well, we could, just for now current linux guests support 4K/64K/16M only >>> and they choose depending on what hypervisor supports - look at >>> enable_ddw() in the guest. What you suggest seems to be an unnecessary code >>> duplication for 16MB pages case. For bigger page sizes - for example, for >>> 64GB guest, a TCE table with 16MB TCEs will be 32KB which is already >>> awesome enough, no? >> >> In "normal" invironments guests won't be backed by 16M pages, but by 64k >> pages with the occasional THP huge page merge that you can't rely on. >> >> That's why I figured it'd be smart to support 16MB TCEs even when the >> underlying memory is only backed by 64k pages. > > That could work for emulated PCI devices, but not for VFIO. With VFIO > the TCEs get passed through to the hardware, and so the pages mapped > must be physically contiguous, which can only happen if the guest is > backed by hugepages. > > Well.. I guess you *could* fake it for VFIO, by making each guest > H_PUT_TCE result in many real TCEs being created. But I think it's a > bad idea, because it would trigger the guest to map all RAM when not > hugepage backed, and that would mean the translated (host) TCE table > would be inordinately large.
inordinately? :) 64GB of ram, 64K pages = 1 million TCEs, 8 bytes each so the whole table would be 8MB (which is 0.01% of 64GB). Well, allocating 8MB in one continuous chunk might be a problem - but P8 allows splitting tables to up to 4 level trees. Or it is just a single 16MB page. -- Alexey