>-----Original Message----- >From: Joao Martins <joao.m.mart...@oracle.com> >Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain >creation > >On 17/07/2024 03:52, Duan, Zhenzhong wrote: >> >> >>> -----Original Message----- >>> From: Joao Martins <joao.m.mart...@oracle.com> >>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain >>> creation >>> >>> On 16/07/2024 17:44, Joao Martins wrote: >>>> On 16/07/2024 17:04, Eric Auger wrote: >>>>> Hi Joao, >>>>> >>>>> On 7/12/24 13:46, Joao Martins wrote: >>>>>> There's generally two modes of operation for IOMMUFD: >>>>>> >>>>>> * The simple user API which intends to perform relatively simple >things >>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to >VFIO >>>>> >>>>> It generally creates? can you explicit what is "it" >>>>> >>>> 'It' here refers to the process/API-user >>>> >>>>> I am confused by this automatic terminology again (not your fault). the >>> doc says: >>>>> " >>>>> >>>>> * >>>>> >>>>> Automatic domain - refers to an iommu domain created >automatically >>>>> when attaching a device to an IOAS object. This is compatible to the >>>>> semantics of VFIO type1. >>>>> >>>>> * >>>>> >>>>> Manual domain - refers to an iommu domain designated by the user >as >>>>> the target pagetable to be attached to by a device. Though currently >>>>> there are no uAPIs to directly create such domain, the datastructure >>>>> and algorithms are ready for handling that use case. >>>>> >>>>> " >>>>> >>>>> >>>>> in 1) the device is attached to the ioas id (using the auto domain if I am >>> not wrong) >>>>> Here you attach to an hwpt id. Isn't it a manual domain? >>>>> >>>> >>>> Correct. >>>> >>>> The 'auto domains' generally refers to the kernel-equivalent own >>> automatic >>>> attaching to a new pagetable. >>>> >>>> Here I call 'auto domains' in the userspace version too because we are >>> doing the >>>> exact same but from userspace, using the manual API in IOMMUFD. >>>> >>>>>> and mainly performs IOAS_MAP and UNMAP. >>>>>> >>>>>> * The native IOMMUFD API where you have fine grained control of >the >>>>>> IOMMU domain and model it accordingly. This is where most new >>> feature >>>>>> are being steered to. >>>>>> >>>>>> For dirty tracking 2) is required, as it needs to ensure that >>>>>> the stage-2/parent IOMMU domain will only attach devices >>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely >>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a >>>>>> useful guarantee to VMMs that will refuse incompatible device >>>>>> attachments for IOMMU domains. >>>>>> >>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is >>>>>> responsible for creating an IOMMU domain. This is contrast to the >>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD >>> automatically >>>>>> when it attaches to VFIO (usually referred as autodomains) but it has >>>>>> the needed handling for mdevs. >>>>>> >>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs >>>>>> similar logic, where IOMMU domains are created and devices >attached >>> to >>>>>> compatible domains. Essentially mimmicing kernel >>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no >>> IOMMU domain >>>>>> it falls back to IOAS attach. >>>>>> >>>>>> The auto domain logic allows different IOMMU domains to be created >>> when >>>>>> DMA dirty tracking is not desired (and VF can provide it), and others >>> where >>>>>> it is. Here is not used in this way here given how VFIODevice >migration >>>>> >>>>> Here is not used in this way here ? >>>>> >>>> >>>> I meant, 'Here it is not used in this way given (...)' >>>> >>>>>> state is initialized after the device attachment. But such mixed mode >of >>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that >>> can >>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have >>>>>> been using so far between container vs device dirty tracking. >>>>>> >>>>>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com> >>>>>> --- >>>>>> include/hw/vfio/vfio-common.h | 9 ++++ >>>>>> include/sysemu/iommufd.h | 5 +++ >>>>>> backends/iommufd.c | 30 +++++++++++++ >>>>>> hw/vfio/iommufd.c | 82 >>> +++++++++++++++++++++++++++++++++++ >>>>>> backends/trace-events | 1 + >>>>>> 5 files changed, 127 insertions(+) >>>>>> >>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio- >>> common.h >>>>>> index 7419466bca92..2dd468ce3c02 100644 >>>>>> --- a/include/hw/vfio/vfio-common.h >>>>>> +++ b/include/hw/vfio/vfio-common.h >>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow { >>>>>> >>>>>> typedef struct IOMMUFDBackend IOMMUFDBackend; >>>>>> >>>>>> +typedef struct VFIOIOASHwpt { >>>>>> + uint32_t hwpt_id; >>>>>> + QLIST_HEAD(, VFIODevice) device_list; >>>>>> + QLIST_ENTRY(VFIOIOASHwpt) next; >>>>>> +} VFIOIOASHwpt; >>>>>> + >>>>>> typedef struct VFIOIOMMUFDContainer { >>>>>> VFIOContainerBase bcontainer; >>>>>> IOMMUFDBackend *be; >>>>>> uint32_t ioas_id; >>>>>> + QLIST_HEAD(, VFIOIOASHwpt) hwpt_list; >>>>>> } VFIOIOMMUFDContainer; >>>>>> >>>>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, >>> VFIO_IOMMU_IOMMUFD); >>>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice { >>>>>> HostIOMMUDevice *hiod; >>>>>> int devid; >>>>>> IOMMUFDBackend *iommufd; >>>>>> + VFIOIOASHwpt *hwpt; >>>>>> + QLIST_ENTRY(VFIODevice) hwpt_next; >>>>>> } VFIODevice; >>>>>> >>>>>> struct VFIODeviceOps { >>>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h >>>>>> index 57d502a1c79a..e917e7591d05 100644 >>>>>> --- a/include/sysemu/iommufd.h >>>>>> +++ b/include/sysemu/iommufd.h >>>>>> @@ -50,6 +50,11 @@ int >>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t >ioas_id, >>>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, >>> uint32_t devid, >>>>>> uint32_t *type, void *data, >>>>>> uint32_t len, >>>>>> uint64_t *caps, Error **errp); >>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, >uint32_t >>> dev_id, >>>>>> + uint32_t pt_id, uint32_t flags, >>>>>> + uint32_t data_type, uint32_t data_len, >>>>>> + void *data_ptr, uint32_t *out_hwpt, >>>>>> + Error **errp); >>>>>> >>>>>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD >>> TYPE_HOST_IOMMU_DEVICE "-iommufd" >>>>>> #endif >>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c >>>>>> index 2b3d51af26d2..5d3dfa917415 100644 >>>>>> --- a/backends/iommufd.c >>>>>> +++ b/backends/iommufd.c >>>>>> @@ -208,6 +208,36 @@ int >>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t >ioas_id, >>>>>> return ret; >>>>>> } >>>>>> >>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, >uint32_t >>> dev_id, >>>>>> + uint32_t pt_id, uint32_t flags, >>>>>> + uint32_t data_type, uint32_t data_len, >>>>>> + void *data_ptr, uint32_t *out_hwpt, >>>>>> + Error **errp) >>>>>> +{ >>>>>> + int ret, fd = be->fd; >>>>>> + struct iommu_hwpt_alloc alloc_hwpt = { >>>>>> + .size = sizeof(struct iommu_hwpt_alloc), >>>>>> + .flags = flags, >>>>>> + .dev_id = dev_id, >>>>>> + .pt_id = pt_id, >>>>>> + .data_type = data_type, >>>>>> + .data_len = data_len, >>>>>> + .data_uptr = (uint64_t)data_ptr, >>>>>> + }; >>>>>> + >>>>>> + ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt); >>>>>> + trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, >>> data_type, >>>>>> + data_len, (uint64_t)data_ptr, >>>>>> + alloc_hwpt.out_hwpt_id, ret); >>>>>> + if (ret) { >>>>>> + error_setg_errno(errp, errno, "Failed to allocate hwpt"); >>>>>> + return false; >>>>>> + } >>>>>> + >>>>>> + *out_hwpt = alloc_hwpt.out_hwpt_id; >>>>>> + return true; >>>>>> +} >>>>>> + >>>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, >>> uint32_t devid, >>>>>> uint32_t *type, void *data, >>>>>> uint32_t len, >>>>>> uint64_t *caps, Error **errp) >>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c >>>>>> index 077dea8f1b64..325c7598d5a1 100644 >>>>>> --- a/hw/vfio/iommufd.c >>>>>> +++ b/hw/vfio/iommufd.c >>>>>> @@ -212,10 +212,86 @@ static bool >>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp) >>>>>> return true; >>>>>> } >>>>>> >>>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev, >>>>>> + VFIOIOMMUFDContainer >>>>>> *container, >>>>>> + Error **errp) >>>>>> +{ >>>>>> + IOMMUFDBackend *iommufd = vbasedev->iommufd; >>>>>> + uint32_t flags = 0; >>>>>> + VFIOIOASHwpt *hwpt; >>>>>> + uint32_t hwpt_id; >>>>>> + int ret; >>>>>> + >>>>>> + /* Try to find a domain */ >>>>>> + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { >>>>>> + ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt- >>hwpt_id, >>> errp); >>>>>> + if (ret) { >>>>>> + /* -EINVAL means the domain is incompatible with the device. >>> */ >>>>>> + if (ret == -EINVAL) { >>>>>> + /* >>>>>> + * It is an expected failure and it just means we will >>>>>> try >>>>>> + * another domain, or create one if no existing >>>>>> compatible >>>>>> + * domain is found. Hence why the error is discarded >>>>>> below. >>>>>> + */ >>>>>> + error_free(*errp); >>>>>> + *errp = NULL; >>>>>> + continue; >>>>>> + } >>>>>> + >>>>>> + return false; >>>>>> + } else { >>>>>> + vbasedev->hwpt = hwpt; >>>>>> + QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, >hwpt_next); >>>>>> + return true; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid, >>>>>> + container->ioas_id, flags, >>>>>> + IOMMU_HWPT_DATA_NONE, 0, NULL, >>>>>> + &hwpt_id, errp)) { >>>>>> + return false; >>>>>> + } >>>>>> + >>>>>> + hwpt = g_malloc0(sizeof(*hwpt)); >>>>>> + hwpt->hwpt_id = hwpt_id; >>>>>> + QLIST_INIT(&hwpt->device_list); >>>>>> + >>>>>> + ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt- >>hwpt_id, >>> errp); >>>>>> + if (ret) { >>>>>> + iommufd_backend_free_id(container->be, hwpt->hwpt_id); >>>>>> + g_free(hwpt); >>>>>> + return false; >>>>>> + } >>>>>> + >>>>>> + vbasedev->hwpt = hwpt; >>>>>> + QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next); >>>>>> + QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next); >>>>>> + return true; >>>>>> +} >>>>>> + >>>>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev, >>>>>> + VFIOIOMMUFDContainer >>>>>> *container) >>>>>> +{ >>>>>> + VFIOIOASHwpt *hwpt = vbasedev->hwpt; >>>>>> + >>>>>> + QLIST_REMOVE(vbasedev, hwpt_next); >>>>> don't you want to reset vbasedev->hwpt = NULL too? >>>>> >>>> Yeap, Thanks for catching that >>>> >>>>> >>>>>> + if (QLIST_EMPTY(&hwpt->device_list)) { >>>>>> + QLIST_REMOVE(hwpt, next); >>>>>> + iommufd_backend_free_id(container->be, hwpt->hwpt_id); >>>>>> + g_free(hwpt); >>>>>> + } >>>>>> +} >>>>>> + >>>>>> static bool iommufd_cdev_attach_container(VFIODevice *vbasedev, >>>>>> VFIOIOMMUFDContainer >>>>>> *container, >>>>>> Error **errp) >>>>>> { >>>>>> + /* mdevs aren't physical devices and will fail with auto domains >*/ >>>>>> + if (!vbasedev->mdev) { >>>>>> + return iommufd_cdev_autodomains_get(vbasedev, container, >>> errp); >>>>>> + } >>>>>> + >>>>>> return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container- >>>> ioas_id, errp); >>>>>> } >>>>>> >>>>>> @@ -224,6 +300,11 @@ static void >>> iommufd_cdev_detach_container(VFIODevice *vbasedev, >>>>>> { >>>>>> Error *err = NULL; >>>>>> >>>>>> + if (vbasedev->hwpt) { >>>>>> + iommufd_cdev_autodomains_put(vbasedev, container); >>>>>> + return; >>>>> Where do we detach the device from the hwpt? >>>>> >>>> In iommufd_backend_free_id() for auto domains >>>> >>> >>> to clarify here I meant *userspace* auto domains >>> >>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT >> >> If the device is still attached to the hwpt, will iommufd_backend_free_id() >succeed? >> Have you tried the hot unplug? >> > >I have but I didn't see any errors. But I will check again for v5 as it could >also be my oversight. > >I was thinking about Eric's remark overnight and I think what I am doing is >not >correct regardless of the above. > >I should be calling DETACH_IOMMUFD_PT pairing with >ATTACH_IOMMUFD_PT, and the >iommufd_backend_free_id() is to drop the final reference pairing with >alloc_hwpt() when the device list is empty i.e. when there's no more devices >in >that vdev::hwpt. > >DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't >differentiate >between auto domains vs manual domains.
Yes, missing DETACH_IOMMUFD_PT so ref count isn't decreased. My understanding is freeing hwpt will fails become device is still attached, such as return -EBUSY, But may be I understand wrong as you didn't see that failure. Thanks Zhenzhong > >The code is already there anyhow it just has the order of >iommufd_cdev_autodomains_put vs detach invocation reversed; I'll fix that >for >next version.