On 29/08/16 23:27, David Gibson wrote: > On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote: >> On 18/08/16 10:22, Alexey Kardashevskiy wrote: >>> On 17/08/16 13:17, David Gibson wrote: >>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote: >>>>> On Fri, 12 Aug 2016 15:46:01 +1000 >>>>> David Gibson <da...@gibson.dropbear.id.au> wrote: >>>>> >>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote: >>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000 >>>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>>>>>> >>>>>>>> On 09/08/16 22:16, Alex Williamson wrote: >>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000 >>>>>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>>>>>>>> >>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote: >>>>>>>>>>> On Wed, 3 Aug 2016 18:40:55 +1000 >>>>>>>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>>>>>>>>>> >>>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in >>>>>>>>>>>> memory while there are external users such as KVM. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>>>>>>>> --- >>>>>>>>>>>> drivers/vfio/vfio.c | 30 >>>>>>>>>>>> ++++++++++++++++++++++++++++++ >>>>>>>>>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++- >>>>>>>>>>>> include/linux/vfio.h | 6 ++++++ >>>>>>>>>>>> 3 files changed, 51 insertions(+), 1 deletion(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c >>>>>>>>>>>> index d1d70e0..baf6a9c 100644 >>>>>>>>>>>> --- a/drivers/vfio/vfio.c >>>>>>>>>>>> +++ b/drivers/vfio/vfio.c >>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct >>>>>>>>>>>> vfio_group *group, unsigned long arg) >>>>>>>>>>>> EXPORT_SYMBOL_GPL(vfio_external_check_extension); >>>>>>>>>>>> >>>>>>>>>>>> /** >>>>>>>>>>>> + * External user API for containers, exported by symbols to be >>>>>>>>>>>> linked >>>>>>>>>>>> + * dynamically. >>>>>>>>>>>> + * >>>>>>>>>>>> + */ >>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep) >>>>>>>>>>>> +{ >>>>>>>>>>>> + struct vfio_container *container = filep->private_data; >>>>>>>>>>>> + >>>>>>>>>>>> + if (filep->f_op != &vfio_fops) >>>>>>>>>>>> + return ERR_PTR(-EINVAL); >>>>>>>>>>>> + >>>>>>>>>>>> + vfio_container_get(container); >>>>>>>>>>>> + >>>>>>>>>>>> + return container; >>>>>>>>>>>> +} >>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext); >>>>>>>>>>>> + >>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container) >>>>>>>>>>>> +{ >>>>>>>>>>>> + vfio_container_put(container); >>>>>>>>>>>> +} >>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext); >>>>>>>>>>>> + >>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container >>>>>>>>>>>> *container) >>>>>>>>>>>> +{ >>>>>>>>>>>> + return container->iommu_data; >>>>>>>>>>>> +} >>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext); >>>>>>>>>>>> + >>>>>>>>>>>> +/** >>>>>>>>>>>> * Sub-module support >>>>>>>>>>>> */ >>>>>>>>>>>> /* >>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c >>>>>>>>>>>> b/drivers/vfio/vfio_iommu_spapr_tce.c >>>>>>>>>>>> index 3594ad3..fceea3d 100644 >>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c >>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c >>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops >>>>>>>>>>>> tce_iommu_driver_ops = { >>>>>>>>>>>> .detach_group = tce_iommu_detach_group, >>>>>>>>>>>> }; >>>>>>>>>>>> >>>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void >>>>>>>>>>>> *iommu_data, >>>>>>>>>>>> + u64 offset) >>>>>>>>>>>> +{ >>>>>>>>>>>> + struct tce_container *container = iommu_data; >>>>>>>>>>>> + struct iommu_table *tbl = NULL; >>>>>>>>>>>> + >>>>>>>>>>>> + if (tce_iommu_find_table(container, offset, &tbl) < 0) >>>>>>>>>>>> + return NULL; >>>>>>>>>>>> + >>>>>>>>>>>> + iommu_table_get(tbl); >>>>>>>>>>>> + >>>>>>>>>>>> + return tbl; >>>>>>>>>>>> +} >>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext); >>>>>>>>>>>> + >>>>>>>>>>>> static int __init tce_iommu_init(void) >>>>>>>>>>>> { >>>>>>>>>>>> return vfio_register_iommu_driver(&tce_iommu_driver_ops); >>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION); >>>>>>>>>>>> MODULE_LICENSE("GPL v2"); >>>>>>>>>>>> MODULE_AUTHOR(DRIVER_AUTHOR); >>>>>>>>>>>> MODULE_DESCRIPTION(DRIVER_DESC); >>>>>>>>>>>> - >>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h >>>>>>>>>>>> index 0ecae0b..1c2138a 100644 >>>>>>>>>>>> --- a/include/linux/vfio.h >>>>>>>>>>>> +++ b/include/linux/vfio.h >>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct >>>>>>>>>>>> vfio_group *group); >>>>>>>>>>>> extern int vfio_external_user_iommu_id(struct vfio_group *group); >>>>>>>>>>>> extern long vfio_external_check_extension(struct vfio_group >>>>>>>>>>>> *group, >>>>>>>>>>>> unsigned long arg); >>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file >>>>>>>>>>>> *filep); >>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container >>>>>>>>>>>> *container); >>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext( >>>>>>>>>>>> + struct vfio_container *container); >>>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext( >>>>>>>>>>>> + void *iommu_data, u64 offset); >>>>>>>>>>>> >>>>>>>>>>>> /* >>>>>>>>>>>> * Sub-module helpers >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I think you need to take a closer look of the lifecycle of a >>>>>>>>>>> container, >>>>>>>>>>> having a reference means the container itself won't go away, but >>>>>>>>>>> only >>>>>>>>>>> having a group set within that container holds the actual IOMMU >>>>>>>>>>> references. container->iommu_data is going to be NULL once the >>>>>>>>>>> groups are lost. Thanks, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Container owns the iommu tables and this is what I care about here, >>>>>>>>>> groups >>>>>>>>>> attached or not - this is handled separately via IOMMU group list in >>>>>>>>>> a >>>>>>>>>> specific iommu_table struct, these groups get detached from >>>>>>>>>> iommu_table >>>>>>>>>> when they are removed from a container. >>>>>>>>> >>>>>>>>> The container doesn't own anything, the container is privileged by the >>>>>>>>> groups being attached to it. When groups are closed, they detach from >>>>>>>>> the container and once the container group list is empty the iommu >>>>>>>>> backend is released and iommu_data is NULL. A container reference >>>>>>>>> doesn't give you what you're looking for. It implies nothing about >>>>>>>>> the >>>>>>>>> iommu backend. >>>>>>>> >>>>>>>> >>>>>>>> Well. Backend is a part of a container and since a backend owns >>>>>>>> tables, a >>>>>>>> container owns them too. >>>>>>> >>>>>>> The IOMMU backend is accessed through the container, but that backend >>>>>>> is privileged by the groups it contains. Once those groups are gone, >>>>>>> the IOMMU backend is released, regardless of whatever reference you >>>>>>> have to the container itself such as you're attempting to do here. In >>>>>>> that sense, the container does not own those tables. >>>>>> >>>>>> So, the thing is that what KVM fundamentally needs is a handle on the >>>>>> container. KVM is essentially modelling the DMA address space of a >>>>>> single guest bus, and the container is what's attached to that. >>>>>> >>>>>> The first part of the problem is that KVM wants to basically invoke >>>>>> vfio_dma_map() operations without bouncing via qemu. Because >>>>>> vfio_dma_map() works on the container level, that's the handle that >>>>>> KVM needs to hold. >>>>>> >>>>>> The second part of the problem is that in order to reduce overhead >>>>>> further, we want to operate in real mode, which means bypassing most >>>>>> of the usual VFIO structure and going directly(ish) from the KVM >>>>>> hcall emulation to the IOMMU backend behind VFIO. This complicates >>>>>> matters a fair bit. Because it is, explicitly, a performance hack, >>>>>> some degree of ugliness is probably inevitable. >>>>>> >>>>>> Alexey - actually implementing this in two stages might make this >>>>>> clearer. The first stage wouldn't allow real mode, and would call >>>>>> through the same vfio_dma_map() path as qemu calls through now. The >>>>>> second stage would then put in place the necessary hacks to add real >>>>>> mode support. >>>>>> >>>>>>>> The problem I am trying to solve here is when KVM may release the >>>>>>>> iommu_table objects. >>>>>>>> >>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really >>>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM >>>>>>>> can >>>>>>>> start using tables (with referencing them). >>>>>>>> >>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it >>>>>>>> from region_del() and this works if QEMU removes a window. However if >>>>>>>> QEMU >>>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not >>>>>>>> get >>>>>>>> notified that it can release the iommu_table's because the >>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is >>>>>>>> still used by emulated devices or other containers). >>>>>>>> >>>>>>>> So it was suggested that we could do such "unset" somehow later >>>>>>>> assuming, >>>>>>>> for example, on every "set" I could check if some of currently attached >>>>>>>> containers are no more used - and this is where being able to know if >>>>>>>> there >>>>>>>> is no backend helps - KVM remembers a container pointer and can check >>>>>>>> this >>>>>>>> via vfio_container_get_iommu_data_ext(). >>>>>>>> >>>>>>>> The other option would be changing vfio_container_get_ext() to take a >>>>>>>> callback+opaque which container would call when it destroys iommu_data. >>>>>>>> This looks more intrusive and not very intuitive how to make it right - >>>>>>>> container would have to keep track of all registered external users and >>>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to >>>>>>>> unregister the exact external user. >>>>>>> >>>>>>> I'm not in favor of anything resembling the code above or extensions >>>>>>> beyond it, the container is the wrong place to do this. >>>>>>> >>>>>>>> Or I could store container file* in KVM. Then iommu_data would never be >>>>>>>> released until KVM-spapr-tce-table is destroyed. >>>>>>> >>>>>>> See above, holding a file pointer to the container doesn't do squat. >>>>>>> The groups that are held by the container empower the IOMMU backend, >>>>>>> references to the container itself don't matter. Those references will >>>>>>> not maintain the IOMMU data. >>>>>>> >>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing >>>>>>>> its fd >>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as >>>>>>>> there >>>>>>>> still may be devices using this KVM-spapr-tce-table. >>>>>>>> >>>>>>>> What obvious and nice solution am I missing here? Thanks. >>>>>>> >>>>>>> The interactions with the IOMMU backend that seem relevant are >>>>>>> vfio_iommu_drivers_ops.{detach_group,release}. The kvm-vfio pseudo >>>>>>> device is also used to tell kvm about groups as they come and go and >>>>>>> has a way to check extensions, and thus properties of the IOMMU >>>>>>> backend. All of these are available for your {ab}use. Thanks, >>>>>> >>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's >>>>>> a really bad fit. As noted above, fundamentally it's a container we >>>>>> need to attach to the kvm-spapr-tce-table object, since what that >>>>>> represents is a guest bus DMA address space, and by definition all the >>>>>> groups in a container must have the same DMA address space. >>>>> >>>>> That's all fine and good, but the point remains that a reference to the >>>>> container is no assurance of the iommu state. The iommu state is >>>>> maintained by the user and the groups attached to the container. If >>>>> the groups are removed, your container reference no long has any iommu >>>>> backing and iommu_data is worthless. The user can do this as well by >>>>> un-setting the iommu. I understand what you're trying to do, it's just >>>>> wrong. Thanks, >>>> >>>> I'm trying to figure out how to do this right, and it's not at all >>>> obvious. The container may be wrong, but that doesn't have the >>>> KVM-VFIO device any more useful. Attempting to do this at the group >>>> level is at least as wrong for the reasons I've mentioned elsewhere. >>>> >>> >>> I could create a new fd, one per iommu_table, the fd would reference the >>> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR >>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which >>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then >>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To >>> release the reference, KVM-spapr-tce-table would have "unset" ioctl() >>> or/and on every "set" I would look if all attached tables have at least one >>> iommu_table_group attached, if none - release the table. >>> >>> This would make no change to generic VFIO code and very little change in >>> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks. >> >> >> Ping? > > I'm still in Toronto after KVM Forum. I had a detailed discussion > about this with Alex W, which I'll write up once I get back. > > The short version is that Alex more-or-less convinced me that we do > need to go back to doing this with an interface based on linking > groups to LIOBNs. That leads to an interface that's kind of weird and > has some fairly counter-intuitive properties, but in the end it works > out better than doing it with containers. >
Soooo? :) -- Alexey
signature.asc
Description: OpenPGP digital signature