On Fri, 2016-01-22 at 15:14 -0700, Alex Williamson wrote: > On Thu, 2016-01-21 at 14:15 +0100, Pierre Morel wrote: > > > > On 01/20/2016 04:46 PM, Alex Williamson wrote: > > > On Wed, 2016-01-20 at 16:14 +0100, Pierre Morel wrote: > > > > On 01/12/2016 07:16 PM, Alex Williamson wrote: > > > > > On Tue, 2016-01-12 at 16:11 +0100, Pierre Morel wrote: > > > > > > In vfio_listener_region_add(), we try to validate that the region > > > > > > is > > > > > > not > > > > > > zero sized and hasn't overflowed the addresses space. > > > > > > > > > > > > But the calculation uses the size of the region instead of > > > > > > using the region's limit (size - 1). > > > > > > > > > > > > This leads to Int128 overflow when the region has > > > > > > been initialized to UINT64_MAX because in this case > > > > > > memory_region_init() transform the size from UINT64_MAX > > > > > > to int128_2_64(). > > > > > > > > > > > > Let's really use the limit by sustracting one to the size > > > > > > and take care to use the limit for functions using limit > > > > > > and size to call functions which need size. > > > > > > > > > > > > Signed-off-by: Pierre Morel <pmo...@linux.vnet.ibm.com> > > > > > > --- > > > > > > > > > > > > Changes from v2: > > > > > > - all, just ignore v2, sorry about this, > > > > > > this is build after v1 > > > > > > > > > > > > Changes from v1: > > > > > > - adjust the tests by knowing we already substracted one to > > > > > > end. > > > > > > > > > > > > hw/vfio/common.c | 14 +++++++------- > > > > > > 1 files changed, 7 insertions(+), 7 deletions(-) > > > > > > > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > > > > > index 6797208..a5f6643 100644 > > > > > > --- a/hw/vfio/common.c > > > > > > +++ b/hw/vfio/common.c > > > > > > @@ -348,12 +348,12 @@ static void > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > if (int128_ge(int128_make64(iova), llend)) { > > > > > > return; > > > > > > } > > > > > > - end = int128_get64(llend); > > > > > > + end = int128_get64(int128_sub(llend, int128_one())); > > > > > > > > > > > > - if ((iova < container->min_iova) || ((end - 1) > container- > > > > > > > max_iova)) { > > > > > > + if ((iova < container->min_iova) || (end > container- > > > > > > > max_iova)) { > > > > > > error_report("vfio: IOMMU container %p can't map guest > > > > > > IOVA > > > > > > region" > > > > > > " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, > > > > > > - container, iova, end - 1); > > > > > > + container, iova, end); > > > > > > ret = -EFAULT; > > > > > > goto fail; > > > > > > } > > > > > > @@ -363,7 +363,7 @@ static void > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > if (memory_region_is_iommu(section->mr)) { > > > > > > VFIOGuestIOMMU *giommu; > > > > > > > > > > > > - trace_vfio_listener_region_add_iommu(iova, end - 1); > > > > > > + trace_vfio_listener_region_add_iommu(iova, end); > > > > > > /* > > > > > > * FIXME: We should do some checking to see if the > > > > > > * capabilities of the host VFIO IOMMU are adequate to > > > > > > model > > > > > > @@ -394,13 +394,13 @@ static void > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > section->offset_within_region + > > > > > > (iova - section->offset_within_address_space); > > > > > > > > > > > > - trace_vfio_listener_region_add_ram(iova, end - 1, vaddr); > > > > > > + trace_vfio_listener_region_add_ram(iova, end, vaddr); > > > > > > > > > > > > - ret = vfio_dma_map(container, iova, end - iova, vaddr, > > > > > > section- > > > > > > > readonly); > > > > > > + ret = vfio_dma_map(container, iova, end - iova + 1, vaddr, > > > > > > section->readonly); > > > > > > if (ret) { > > > > > > error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > > > > > > "0x%"HWADDR_PRIx", %p) = %d (%m)", > > > > > > - container, iova, end - iova, vaddr, ret); > > > > > > + container, iova, end - iova + 1, vaddr, > > > > > > ret); > > > > > > goto fail; > > > > > > } > > > > > > > > > > > Hmm, did we just push the overflow from one place to another? If > > > > > we're > > > > > mapping a full region of size int128_2_64() starting at iova zero, > > > > > then > > > > > this becomes (0xffff_ffff_ffff_ffff - 0 + 1) = 0. So I think we > > > > > need > > > > > to calculate size with 128bit arithmetic too and let it assert if > > > > > we > > > > > overflow, ie: > > > > > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > > > > index a5f6643..13ad90b 100644 > > > > > --- a/hw/vfio/common.c > > > > > +++ b/hw/vfio/common.c > > > > > @@ -321,7 +321,7 @@ static void > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > MemoryRegionSection > > > > > *section) > > > > > { > > > > > VFIOContainer *container = container_of(listener, > > > > > VFIOContainer, listener); > > > > > - hwaddr iova, end; > > > > > + hwaddr iova, end, size; > > > > > Int128 llend; > > > > > void *vaddr; > > > > > int ret; > > > > > @@ -348,7 +348,9 @@ static void > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > if (int128_ge(int128_make64(iova), llend)) { > > > > > return; > > > > > } > > > > > + > > > > > end = int128_get64(int128_sub(llend, int128_one())); > > > > > + size = int128_get64(int128_sub(llend, int128_make64(iova))); > > > > here again, if iova is null, since llend is section->size (2^64) ... > > > > > > > > > > > > > > if ((iova < container->min_iova) || (end > container- > > > > > > max_iova)) { > > > > > error_report("vfio: IOMMU container %p can't map guest > > > > > IOVA region" > > > > > @@ -396,11 +398,11 @@ static void > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > > > trace_vfio_listener_region_add_ram(iova, end, vaddr); > > > > > > > > > > - ret = vfio_dma_map(container, iova, end - iova + 1, vaddr, > > > > > section->readonly); > > > > > + ret = vfio_dma_map(container, iova, size, vaddr, section- > > > > > > readonly); > > > > > if (ret) { > > > > > error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > > > > > "0x%"HWADDR_PRIx", %p) = %d (%m)", > > > > > - container, iova, end - iova + 1, vaddr, ret); > > > > > + container, iova, size, vaddr, ret); > > > > > goto fail; > > > > > } > > > > > > > > > > > > > > > Does that still solve your scenario? Perhaps vfio-iommu-type1 > > > > > should > > > > > have used first/last rather than start/size for mapping since we > > > > > seem > > > > > to have an off-by-one for mapping a full 64bit space. Seems like > > > > > we > > > > > could do it with two calls to vfio_dma_map if we really wanted to. > > > > > Thanks, > > > > > > > > > > Alex > > > > > > > > > You are right, every try to solve this will push the overflow > > > > somewhere > > > > else. > > > > > > > > There is just no way to express 2^64 with 64 bits, we have the > > > > int128() > > > > solution, > > > > but if we solve it here, we fall in the linux ioctl call anyway. > > > > > > > > Intuitively, making two calls do not seem right to me. > > > > > > > > But, what do you think of something like: > > > > > > > > - creating a new VFIO extention > > > > > > > > - and in ioctl(), since we have a flag entry in the > > > > vfio_iommu_type1_dma_map, > > > > may be adding a new flag meaning "map all virtual memory" ? > > > > or meaning "use first/last" ? > > > > I think this would break existing code unless we add a new VFIO > > > > extension. > > > Backup, is there ever a case where we actually need to map the entire > > > 64bit address space? This is fairly well impossible on x86. I'm > > > pointing out an issue, but I don't know that we need to solve it with > > > more than an assert since it's never likely to happen. Thanks, > > > > > > Alex > > > > > > > If I understood right, IOVA is the IO virtual address, > > it is then possible to map the virtual address page 0xffff_ffff_ffff_f000 > > to something reasonable inside the real memory. > > It is. > > > Eventual we do not need to map the last virtual page but > > I think that in a general case the all virtual memory, as viewed by the > > device through the IOMMU should be mapped to avoid any uninitialized > > virtual memory access. > > When using vfio, a device only has access to the IOVA space which has > been explicitly mapped. This would be a security issue otherwise since > kernel vfio can't rely on userspace to wipe the device IOVA space. > > > It is the same reason that make us map the all virtual memory for the > > CPU MMU. > > We don't really do that either, CPU mapping works based on page tables > and non-existent entries simply don't exist. We don't fully populate > the page tables in advance, this would be a horrible waste of memory. > > > May be I missed something, or may be I worry too much, > > but I see this as a restriction on the supported hardware > > if we compare host and guest hardware support compatibility. > > I don't see the issue, there's arguably a bug in the API that doesn't > allow us to map the full 64bit IOVA space of a device in a single > mapping, but we can do it in two. Besides, there's really no case > where a device needs a fully populated IOTLB unless you're actually > giving the device access to 16 EMB of memory.
s/EMB/EB/ Or I suppose technically EiB > > We can live with it, because in fact you are right and today > > I am not aware of a hardware wanting to access this page but a > > hardware designers knowing having a IOMMU may want to access exactly > > this kind of strange virtual page for special features and this would work > > on the host but not inside of the guest. > > The API issue is not that we can't map 0xffff_ffff_ffff_f000, it's that > we can't map 0x0 through 0xffff_ffff_ffff_ffff in a single mapping > because we pass the size instead of the end address (where size here > would be 2^64). We can map 0x0 through 0xffff_ffff_ffff_efff, followed > by 0xffff_ffff_ffff_f000 through 0xffff_ffff_ffff_ffff, but again, why > would you ever need to do this? Thanks, > > Alex