Hi Anatoly, > -----Original Message----- > From: Burakov, Anatoly <anatoly.bura...@intel.com> > Sent: Wednesday, July 7, 2021 8:18 PM > To: Ding, Xuan <xuan.d...@intel.com>; Maxime Coquelin > <maxime.coque...@redhat.com>; Xia, Chenbo <chenbo....@intel.com>; > Thomas Monjalon <tho...@monjalon.net>; David Marchand > <david.march...@redhat.com> > Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil > <sunil.pa...@intel.com>; Richardson, Bruce <bruce.richard...@intel.com>; Van > Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong <yong....@intel.com>; > Ma, WenwuX <wenwux...@intel.com> > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost > > On 07-Jul-21 7:25 AM, Ding, Xuan wrote: > > Hi, > > > >> -----Original Message----- > >> From: Maxime Coquelin <maxime.coque...@redhat.com> > >> Sent: Tuesday, July 6, 2021 5:32 PM > >> To: Burakov, Anatoly <anatoly.bura...@intel.com>; Ding, Xuan > >> <xuan.d...@intel.com>; Xia, Chenbo <chenbo....@intel.com>; Thomas > >> Monjalon <tho...@monjalon.net>; David Marchand > >> <david.march...@redhat.com> > >> Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil > >> <sunil.pa...@intel.com>; Richardson, Bruce <bruce.richard...@intel.com>; > Van > >> Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong > <yong....@intel.com>; > >> Ma, WenwuX <wenwux...@intel.com> > >> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost > >> > >> > >> > >> On 7/6/21 11:16 AM, Burakov, Anatoly wrote: > >>> On 06-Jul-21 9:31 AM, Ding, Xuan wrote: > >>>> Hi Maxime, > >>>> > >>>>> -----Original Message----- > >>>>> From: Maxime Coquelin <maxime.coque...@redhat.com> > >>>>> Sent: Monday, July 5, 2021 8:46 PM > >>>>> To: Burakov, Anatoly <anatoly.bura...@intel.com>; Ding, Xuan > >>>>> <xuan.d...@intel.com>; Xia, Chenbo <chenbo....@intel.com>; Thomas > >>>>> Monjalon <tho...@monjalon.net>; David Marchand > >>>>> <david.march...@redhat.com> > >>>>> Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil > >>>>> <sunil.pa...@intel.com>; Richardson, Bruce > <bruce.richard...@intel.com>; > >>>>> Van Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong > >>>>> <yong....@intel.com>; Ma, WenwuX <wenwux...@intel.com> > >>>>> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async > >>>>> vhost > >>>>> > >>>>> > >>>>> > >>>>> On 7/5/21 2:16 PM, Burakov, Anatoly wrote: > >>>>>> On 05-Jul-21 9:40 AM, Xuan Ding wrote: > >>>>>>> The use of IOMMU has many advantages, such as isolation and address > >>>>>>> translation. This patch extends the capbility of DMA engine to use > >>>>>>> IOMMU if the DMA device is bound to vfio. > >>>>>>> > >>>>>>> When set memory table, the guest memory will be mapped > >>>>>>> into the default container of DPDK. > >>>>>>> > >>>>>>> Signed-off-by: Xuan Ding <xuan.d...@intel.com> > >>>>>>> --- > >>>>>>> doc/guides/prog_guide/vhost_lib.rst | 9 ++++++ > >>>>>>> lib/vhost/rte_vhost.h | 1 + > >>>>>>> lib/vhost/socket.c | 9 ++++++ > >>>>>>> lib/vhost/vhost.h | 1 + > >>>>>>> lib/vhost/vhost_user.c | 46 > >>>>>>> ++++++++++++++++++++++++++++- > >>>>>>> 5 files changed, 65 insertions(+), 1 deletion(-) > >>>>>>> > >>>>>>> diff --git a/doc/guides/prog_guide/vhost_lib.rst > >>>>>>> b/doc/guides/prog_guide/vhost_lib.rst > >>>>>>> index 05c42c9b11..c3beda23d9 100644 > >>>>>>> --- a/doc/guides/prog_guide/vhost_lib.rst > >>>>>>> +++ b/doc/guides/prog_guide/vhost_lib.rst > >>>>>>> @@ -118,6 +118,15 @@ The following is an overview of some key > Vhost > >>>>>>> API functions: > >>>>>>> It is disabled by default. > >>>>>>> + - ``RTE_VHOST_USER_ASYNC_USE_VFIO`` > >>>>>>> + > >>>>>>> + In asynchronous data path, vhost liarary is not aware of which > >>>>>>> driver > >>>>>>> + (igb_uio/vfio) the DMA device is bound to. Application should > >>>>>>> pass > >>>>>>> + this flag to tell vhost library whether IOMMU should be > >>>>>>> programmed > >>>>>>> + for guest memory. > >>>>>>> + > >>>>>>> + It is disabled by default. > >>>>>>> + > >>>>>>> - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS`` > >>>>>>> Since v16.04, the vhost library forwards checksum and gso > >>>>>>> requests for > >>>>>>> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h > >>>>>>> index 8d875e9322..a766ea7b6b 100644 > >>>>>>> --- a/lib/vhost/rte_vhost.h > >>>>>>> +++ b/lib/vhost/rte_vhost.h > >>>>>>> @@ -37,6 +37,7 @@ extern "C" { > >>>>>>> #define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6) > >>>>>>> #define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7) > >>>>>>> #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS (1ULL << > 8) > >>>>>>> +#define RTE_VHOST_USER_ASYNC_USE_VFIO (1ULL << 9) > >>>>>>> /* Features. */ > >>>>>>> #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE > >>>>>>> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c > >>>>>>> index 5d0d728d52..77c722c86b 100644 > >>>>>>> --- a/lib/vhost/socket.c > >>>>>>> +++ b/lib/vhost/socket.c > >>>>>>> @@ -42,6 +42,7 @@ struct vhost_user_socket { > >>>>>>> bool extbuf; > >>>>>>> bool linearbuf; > >>>>>>> bool async_copy; > >>>>>>> + bool async_use_vfio; > >>>>>>> bool net_compliant_ol_flags; > >>>>>>> /* > >>>>>>> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct > >>>>>>> vhost_user_socket *vsocket) > >>>>>>> dev->async_copy = 1; > >>>>>>> } > >>>>>>> + if (vsocket->async_use_vfio) { > >>>>>>> + dev = get_device(vid); > >>>>>>> + > >>>>>>> + if (dev) > >>>>>>> + dev->async_use_vfio = 1; > >>>>>>> + } > >>>>>>> + > >>>>>>> VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid); > >>>>>>> if (vsocket->notify_ops->new_connection) { > >>>>>>> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path, > >>>>>>> uint64_t flags) > >>>>>>> vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT; > >>>>>>> vsocket->linearbuf = flags & > >> RTE_VHOST_USER_LINEARBUF_SUPPORT; > >>>>>>> vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY; > >>>>>>> + vsocket->async_use_vfio = flags & > >>>>> RTE_VHOST_USER_ASYNC_USE_VFIO; > >>>>>>> vsocket->net_compliant_ol_flags = flags & > >>>>>>> RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS; > >>>>>>> if (vsocket->async_copy && > >>>>>>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h > >>>>>>> index 8078ddff79..fb775ce4ed 100644 > >>>>>>> --- a/lib/vhost/vhost.h > >>>>>>> +++ b/lib/vhost/vhost.h > >>>>>>> @@ -370,6 +370,7 @@ struct virtio_net { > >>>>>>> int16_t broadcast_rarp; > >>>>>>> uint32_t nr_vring; > >>>>>>> int async_copy; > >>>>>>> + int async_use_vfio; > >>>>>>> int extbuf; > >>>>>>> int linearbuf; > >>>>>>> struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * > >>>>>>> 2]; > >>>>>>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c > >>>>>>> index 8f0eba6412..f3703f2e72 100644 > >>>>>>> --- a/lib/vhost/vhost_user.c > >>>>>>> +++ b/lib/vhost/vhost_user.c > >>>>>>> @@ -45,6 +45,7 @@ > >>>>>>> #include <rte_common.h> > >>>>>>> #include <rte_malloc.h> > >>>>>>> #include <rte_log.h> > >>>>>>> +#include <rte_vfio.h> > >>>>>>> #include "iotlb.h" > >>>>>>> #include "vhost.h" > >>>>>>> @@ -141,6 +142,36 @@ get_blk_size(int fd) > >>>>>>> return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize; > >>>>>>> } > >>>>>>> +static int > >>>>>>> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map) > >>>>>>> +{ > >>>>>>> + int ret = 0; > >>>>>>> + uint64_t host_iova; > >>>>>>> + host_iova = rte_mem_virt2iova((void > >>>>>>> *)(uintptr_t)region->host_user_addr); > >>>>>>> + if (do_map) { > >>>>>>> + /* Add mapped region into the default container of DPDK. */ > >>>>>>> + ret = > >>>>> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, > >>>>>>> + region->host_user_addr, > >>>>>>> + host_iova, > >>>>>>> + region->size); > >>>>>>> + if (ret) { > >>>>>>> + VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n"); > >>>>>>> + return ret; > >>>>>>> + } > >>>>>>> + } else { > >>>>>>> + /* Remove mapped region from the default container of > >>>>>>> DPDK. */ > >>>>>>> + ret = > >>>>>>> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, > >>>>>>> + region->host_user_addr, > >>>>>>> + host_iova, > >>>>>>> + region->size); > >>>>>>> + if (ret) { > >>>>>>> + VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n"); > >>>>>>> + return ret; > >>>>>>> + } > >>>>>>> + } > >>>>>>> + return ret; > >>>>>>> +} > >>>>>> > >>>>>> We've been discussing this off list with Xuan, and unfortunately > >>>>>> this is > >>>>>> a blocker for now. > >>>>>> > >>>>>> Currently, the x86 IOMMU does not support partial unmap - the > segments > >>>>>> have to be unmapped exactly the same addr/len as they were mapped. > We > >>>>>> also concatenate adjacent mappings to prevent filling up the DMA > >>>>>> mapping > >>>>>> entry table with superfluous entries. > >>>>>> > >>>>>> This means that, when two unrelated mappings are contiguous in > memory > >>>>>> (e.g. if you map regions 1 and 2 independently, but they happen to be > >>>>>> sitting right next to each other in virtual memory), we cannot later > >>>>>> unmap one of them because, even though these are two separate > >>>>> mappings > >>>>>> as far as kernel VFIO infrastructure is concerned, the mapping gets > >>>>>> compacted and looks like one single mapping to VFIO, so DPDK API will > >>>>>> not let us unmap region 1 without also unmapping region 2. > >>>>>> > >>>>>> The proper fix for this problem would be to always map memory > >>>>>> page-by-page regardless of where it comes from (we already do that for > >>>>>> internal memory, but not for external). However, the reason this works > >>>>>> for internal memory is because when mapping internal memory > segments, > >>>>>> *we know the page size*. For external memory segments, there is no > such > >>>>>> guarantee, so we cannot deduce page size for a given memory segment, > >>>>> and > >>>>>> thus can't map things page-by-page. > >>>>>> > >>>>>> So, the proper fix for it would be to add page size to the VFIO DMA > >>>>>> API. > >>>>>> Unfortunately, it probably has to wait until 21.11 because it is an API > >>>>>> change. > >>>>>> > >>>>>> The slightly hacky fix for this would be to forego user mem map > >>>>>> concatenation and trust that user is not going to do anything stupid, > >>>>>> and will not spam the VFIO DMA API without reason. I would rather > >>>>>> not go > >>>>>> down this road, but this could be an option in this case. > >>>>>> > >>>>>> Thoughts? > >>>>>> > >>>>> > >>>>> Thanks Anatoly for the detailed description of the issue. > >>>>> It may be possible to either create a versioned symbol for this API > >>>>> change, or maybe even to have a temporary internal API. > >>>>> > >>>>> But I think this series in its current form is not acceptable, so > >>>>> waiting for v21.11 would be the best option (we may want to send the > >>>>> deprecation notice in this release though). > >>>>> > >>>>> In this series, I don't like the user application has to pass a flag to > >>>>> state whether the DMA engine uses VFIO or not. AFAICT, this new revision > >>>>> does not implement what was discussed in the previous one, i.e. > >>>>> supporting both IOVA_AS_VA and IOVA_AS_PA. > >>>> > >>>> Thanks for your comments. Here I hope to explain some questions: > >>>> 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now? > >>>> A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version, > the > >>>> virtual address is replaced with iova address of mapped region, and > >>>> the iova > >>>> address is selected to program the IOMMU instead of virtual address only. > >> > >> Good! > >> > >>>> > >>>> 2. Why a flag is chosen to be passed by application? > >>>> A: Yes, as we discussed before, the rte_eal_iova_mode() API can be > >>>> used to > >>>> get the IOVA mode, so as to determine whether IOMMU should be > >> programmed. > >>>> However, in the implementation process, I found a problem. That is how to > >>>> distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we > should > >>>> always program the IOMMU. While in IGB_UIO cases, it depends on > IOMMU > >>>> capability of platform. > >>> > >>> How does one program IOMMU with igb_uio? I was under impression that > >>> igb_uio (and uio_pci_generic for that matter) does not provide such > >>> facilities. > >> > >> +1 > > > > Maybe some misunderstanding in this sentence here. > > In our design, if rte_eal_vfio_is_enabled("vfio") is true, iommu will be > programmed. > > True means vfio module is modprobed. > > > > But there is an exception here, that is, even if vfio module is modprobed, > > DPDK user still bind all the devices to igb_uio. > > > > This situation can be distinguished in DPDK eal initialization, because the > resource mapping > > is according to the driver loaded by each device(rte_pci_map_device). > > > > But in our scenario, this judgment is somewhat weak. Because we cannot get > > the device driver info in vhost library. I also think it is unreasonable > > for vhost to > > do this. Only trust that users will not use it like this. Thoughts for this > > scenario? > > I don't see how igb_uio would make any difference at all. If you are > using igb_uio, you *don't have DMA mapping at all* and will use raw > physical addresses. Assuming your code supports this, that's all you're > ever going to get. The point of VFIO is to have memory regions that are > mapped for DMA *because real physical addresses are assumed to be not > available*. When you're using igb_uio, you effectively do have DMA > access to the entire memory, and thus can bypass IOMMU altogether > (assuming you're using passthrough mode).
My concern is exactly here. In igb_uio cases, although devices are not added to the default container in eal init, but the "IOMMU programming" actually happens when the rte_vfio_container_dma_map() is called. It is no harm but it is also unnecessary. > > Bottom line: do VFIO DMA mapping unconditionally. If VFIO is active - > great, the memory will be DMA mapped. If it's not active - no harm will > ever be done by mapping the memory for DMA anyway. Do VFIO DMA mapping unconditionally, do you mean the rte_eal_vfio_is_enabled() is unnecessary? What if the platform does not have IOMMU? Thanks very much. Regards, Xuan > > -- > Thanks, > Anatoly