> -----Original Message----- > From: Burakov, Anatoly <anatoly.bura...@intel.com> > Sent: Wednesday, July 7, 2021 10:34 PM > To: Ding, Xuan <xuan.d...@intel.com>; Maxime Coquelin > <maxime.coque...@redhat.com>; Xia, Chenbo <chenbo....@intel.com>; > Thomas Monjalon <tho...@monjalon.net>; David Marchand > <david.march...@redhat.com> > Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil > <sunil.pa...@intel.com>; Richardson, Bruce <bruce.richard...@intel.com>; > Van Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong > <yong....@intel.com>; Ma, WenwuX <wenwux...@intel.com> > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost > > On 07-Jul-21 1:54 PM, Ding, Xuan wrote: > > Hi Anatoly, > > > >> -----Original Message----- > >> From: Burakov, Anatoly <anatoly.bura...@intel.com> > >> Sent: Wednesday, July 7, 2021 8:18 PM > >> To: Ding, Xuan <xuan.d...@intel.com>; Maxime Coquelin > >> <maxime.coque...@redhat.com>; Xia, Chenbo <chenbo....@intel.com>; > >> Thomas Monjalon <tho...@monjalon.net>; David Marchand > >> <david.march...@redhat.com> > >> Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil > >> <sunil.pa...@intel.com>; Richardson, Bruce > <bruce.richard...@intel.com>; Van > >> Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong > <yong....@intel.com>; > >> Ma, WenwuX <wenwux...@intel.com> > >> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async > vhost > >> > >> On 07-Jul-21 7:25 AM, Ding, Xuan wrote: > >>> Hi, > >>> > >>>> -----Original Message----- > >>>> From: Maxime Coquelin <maxime.coque...@redhat.com> > >>>> Sent: Tuesday, July 6, 2021 5:32 PM > >>>> To: Burakov, Anatoly <anatoly.bura...@intel.com>; Ding, Xuan > >>>> <xuan.d...@intel.com>; Xia, Chenbo <chenbo....@intel.com>; Thomas > >>>> Monjalon <tho...@monjalon.net>; David Marchand > >>>> <david.march...@redhat.com> > >>>> Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil > >>>> <sunil.pa...@intel.com>; Richardson, Bruce > <bruce.richard...@intel.com>; > >> Van > >>>> Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong > >> <yong....@intel.com>; > >>>> Ma, WenwuX <wenwux...@intel.com> > >>>> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async > vhost > >>>> > >>>> > >>>> > >>>> On 7/6/21 11:16 AM, Burakov, Anatoly wrote: > >>>>> On 06-Jul-21 9:31 AM, Ding, Xuan wrote: > >>>>>> Hi Maxime, > >>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: Maxime Coquelin <maxime.coque...@redhat.com> > >>>>>>> Sent: Monday, July 5, 2021 8:46 PM > >>>>>>> To: Burakov, Anatoly <anatoly.bura...@intel.com>; Ding, Xuan > >>>>>>> <xuan.d...@intel.com>; Xia, Chenbo <chenbo....@intel.com>; > Thomas > >>>>>>> Monjalon <tho...@monjalon.net>; David Marchand > >>>>>>> <david.march...@redhat.com> > >>>>>>> Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil > >>>>>>> <sunil.pa...@intel.com>; Richardson, Bruce > >> <bruce.richard...@intel.com>; > >>>>>>> Van Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong > >>>>>>> <yong....@intel.com>; Ma, WenwuX <wenwux...@intel.com> > >>>>>>> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for > async > >>>>>>> vhost > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 7/5/21 2:16 PM, Burakov, Anatoly wrote: > >>>>>>>> On 05-Jul-21 9:40 AM, Xuan Ding wrote: > >>>>>>>>> The use of IOMMU has many advantages, such as isolation and > address > >>>>>>>>> translation. This patch extends the capbility of DMA engine to use > >>>>>>>>> IOMMU if the DMA device is bound to vfio. > >>>>>>>>> > >>>>>>>>> When set memory table, the guest memory will be mapped > >>>>>>>>> into the default container of DPDK. > >>>>>>>>> > >>>>>>>>> Signed-off-by: Xuan Ding <xuan.d...@intel.com> > >>>>>>>>> --- > >>>>>>>>> doc/guides/prog_guide/vhost_lib.rst | 9 ++++++ > >>>>>>>>> lib/vhost/rte_vhost.h | 1 + > >>>>>>>>> lib/vhost/socket.c | 9 ++++++ > >>>>>>>>> lib/vhost/vhost.h | 1 + > >>>>>>>>> lib/vhost/vhost_user.c | 46 > >>>>>>>>> ++++++++++++++++++++++++++++- > >>>>>>>>> 5 files changed, 65 insertions(+), 1 deletion(-) > >>>>>>>>> > >>>>>>>>> diff --git a/doc/guides/prog_guide/vhost_lib.rst > >>>>>>>>> b/doc/guides/prog_guide/vhost_lib.rst > >>>>>>>>> index 05c42c9b11..c3beda23d9 100644 > >>>>>>>>> --- a/doc/guides/prog_guide/vhost_lib.rst > >>>>>>>>> +++ b/doc/guides/prog_guide/vhost_lib.rst > >>>>>>>>> @@ -118,6 +118,15 @@ The following is an overview of some > key > >> Vhost > >>>>>>>>> API functions: > >>>>>>>>> It is disabled by default. > >>>>>>>>> + - ``RTE_VHOST_USER_ASYNC_USE_VFIO`` > >>>>>>>>> + > >>>>>>>>> + In asynchronous data path, vhost liarary is not aware of which > >>>>>>>>> driver > >>>>>>>>> + (igb_uio/vfio) the DMA device is bound to. Application should > >>>>>>>>> pass > >>>>>>>>> + this flag to tell vhost library whether IOMMU should be > >>>>>>>>> programmed > >>>>>>>>> + for guest memory. > >>>>>>>>> + > >>>>>>>>> + It is disabled by default. > >>>>>>>>> + > >>>>>>>>> - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS`` > >>>>>>>>> Since v16.04, the vhost library forwards checksum and gso > >>>>>>>>> requests for > >>>>>>>>> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h > >>>>>>>>> index 8d875e9322..a766ea7b6b 100644 > >>>>>>>>> --- a/lib/vhost/rte_vhost.h > >>>>>>>>> +++ b/lib/vhost/rte_vhost.h > >>>>>>>>> @@ -37,6 +37,7 @@ extern "C" { > >>>>>>>>> #define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6) > >>>>>>>>> #define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7) > >>>>>>>>> #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS > (1ULL << > >> 8) > >>>>>>>>> +#define RTE_VHOST_USER_ASYNC_USE_VFIO (1ULL << 9) > >>>>>>>>> /* Features. */ > >>>>>>>>> #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE > >>>>>>>>> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c > >>>>>>>>> index 5d0d728d52..77c722c86b 100644 > >>>>>>>>> --- a/lib/vhost/socket.c > >>>>>>>>> +++ b/lib/vhost/socket.c > >>>>>>>>> @@ -42,6 +42,7 @@ struct vhost_user_socket { > >>>>>>>>> bool extbuf; > >>>>>>>>> bool linearbuf; > >>>>>>>>> bool async_copy; > >>>>>>>>> + bool async_use_vfio; > >>>>>>>>> bool net_compliant_ol_flags; > >>>>>>>>> /* > >>>>>>>>> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, > struct > >>>>>>>>> vhost_user_socket *vsocket) > >>>>>>>>> dev->async_copy = 1; > >>>>>>>>> } > >>>>>>>>> + if (vsocket->async_use_vfio) { > >>>>>>>>> + dev = get_device(vid); > >>>>>>>>> + > >>>>>>>>> + if (dev) > >>>>>>>>> + dev->async_use_vfio = 1; > >>>>>>>>> + } > >>>>>>>>> + > >>>>>>>>> VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", > vid); > >>>>>>>>> if (vsocket->notify_ops->new_connection) { > >>>>>>>>> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char > *path, > >>>>>>>>> uint64_t flags) > >>>>>>>>> vsocket->extbuf = flags & > RTE_VHOST_USER_EXTBUF_SUPPORT; > >>>>>>>>> vsocket->linearbuf = flags & > >>>> RTE_VHOST_USER_LINEARBUF_SUPPORT; > >>>>>>>>> vsocket->async_copy = flags & > RTE_VHOST_USER_ASYNC_COPY; > >>>>>>>>> + vsocket->async_use_vfio = flags & > >>>>>>> RTE_VHOST_USER_ASYNC_USE_VFIO; > >>>>>>>>> vsocket->net_compliant_ol_flags = flags & > >>>>>>>>> RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS; > >>>>>>>>> if (vsocket->async_copy && > >>>>>>>>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h > >>>>>>>>> index 8078ddff79..fb775ce4ed 100644 > >>>>>>>>> --- a/lib/vhost/vhost.h > >>>>>>>>> +++ b/lib/vhost/vhost.h > >>>>>>>>> @@ -370,6 +370,7 @@ struct virtio_net { > >>>>>>>>> int16_t broadcast_rarp; > >>>>>>>>> uint32_t nr_vring; > >>>>>>>>> int async_copy; > >>>>>>>>> + int async_use_vfio; > >>>>>>>>> int extbuf; > >>>>>>>>> int linearbuf; > >>>>>>>>> struct vhost_virtqueue > *virtqueue[VHOST_MAX_QUEUE_PAIRS * > >>>>>>>>> 2]; > >>>>>>>>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c > >>>>>>>>> index 8f0eba6412..f3703f2e72 100644 > >>>>>>>>> --- a/lib/vhost/vhost_user.c > >>>>>>>>> +++ b/lib/vhost/vhost_user.c > >>>>>>>>> @@ -45,6 +45,7 @@ > >>>>>>>>> #include <rte_common.h> > >>>>>>>>> #include <rte_malloc.h> > >>>>>>>>> #include <rte_log.h> > >>>>>>>>> +#include <rte_vfio.h> > >>>>>>>>> #include "iotlb.h" > >>>>>>>>> #include "vhost.h" > >>>>>>>>> @@ -141,6 +142,36 @@ get_blk_size(int fd) > >>>>>>>>> return ret == -1 ? (uint64_t)-1 : > >>>>>>>>> (uint64_t)stat.st_blksize; > >>>>>>>>> } > >>>>>>>>> +static int > >>>>>>>>> +async_dma_map(struct rte_vhost_mem_region *region, bool > do_map) > >>>>>>>>> +{ > >>>>>>>>> + int ret = 0; > >>>>>>>>> + uint64_t host_iova; > >>>>>>>>> + host_iova = rte_mem_virt2iova((void > >>>>>>>>> *)(uintptr_t)region->host_user_addr); > >>>>>>>>> + if (do_map) { > >>>>>>>>> + /* Add mapped region into the default container of DPDK. > */ > >>>>>>>>> + ret = > >>>>>>> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, > >>>>>>>>> + region->host_user_addr, > >>>>>>>>> + host_iova, > >>>>>>>>> + region->size); > >>>>>>>>> + if (ret) { > >>>>>>>>> + VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n"); > >>>>>>>>> + return ret; > >>>>>>>>> + } > >>>>>>>>> + } else { > >>>>>>>>> + /* Remove mapped region from the default container of > >>>>>>>>> DPDK. */ > >>>>>>>>> + ret = > >>>>>>>>> > rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, > >>>>>>>>> + region->host_user_addr, > >>>>>>>>> + host_iova, > >>>>>>>>> + region->size); > >>>>>>>>> + if (ret) { > >>>>>>>>> + VHOST_LOG_CONFIG(ERR, "DMA engine unmap > failed\n"); > >>>>>>>>> + return ret; > >>>>>>>>> + } > >>>>>>>>> + } > >>>>>>>>> + return ret; > >>>>>>>>> +} > >>>>>>>> > >>>>>>>> We've been discussing this off list with Xuan, and unfortunately > >>>>>>>> this is > >>>>>>>> a blocker for now. > >>>>>>>> > >>>>>>>> Currently, the x86 IOMMU does not support partial unmap - the > >> segments > >>>>>>>> have to be unmapped exactly the same addr/len as they were > mapped. > >> We > >>>>>>>> also concatenate adjacent mappings to prevent filling up the DMA > >>>>>>>> mapping > >>>>>>>> entry table with superfluous entries. > >>>>>>>> > >>>>>>>> This means that, when two unrelated mappings are contiguous in > >> memory > >>>>>>>> (e.g. if you map regions 1 and 2 independently, but they happen to > be > >>>>>>>> sitting right next to each other in virtual memory), we cannot later > >>>>>>>> unmap one of them because, even though these are two separate > >>>>>>> mappings > >>>>>>>> as far as kernel VFIO infrastructure is concerned, the mapping gets > >>>>>>>> compacted and looks like one single mapping to VFIO, so DPDK API > will > >>>>>>>> not let us unmap region 1 without also unmapping region 2. > >>>>>>>> > >>>>>>>> The proper fix for this problem would be to always map memory > >>>>>>>> page-by-page regardless of where it comes from (we already do > that for > >>>>>>>> internal memory, but not for external). However, the reason this > works > >>>>>>>> for internal memory is because when mapping internal memory > >> segments, > >>>>>>>> *we know the page size*. For external memory segments, there is > no > >> such > >>>>>>>> guarantee, so we cannot deduce page size for a given memory > segment, > >>>>>>> and > >>>>>>>> thus can't map things page-by-page. > >>>>>>>> > >>>>>>>> So, the proper fix for it would be to add page size to the VFIO DMA > >>>>>>>> API. > >>>>>>>> Unfortunately, it probably has to wait until 21.11 because it is an > API > >>>>>>>> change. > >>>>>>>> > >>>>>>>> The slightly hacky fix for this would be to forego user mem map > >>>>>>>> concatenation and trust that user is not going to do anything > stupid, > >>>>>>>> and will not spam the VFIO DMA API without reason. I would > rather > >>>>>>>> not go > >>>>>>>> down this road, but this could be an option in this case. > >>>>>>>> > >>>>>>>> Thoughts? > >>>>>>>> > >>>>>>> > >>>>>>> Thanks Anatoly for the detailed description of the issue. > >>>>>>> It may be possible to either create a versioned symbol for this API > >>>>>>> change, or maybe even to have a temporary internal API. > >>>>>>> > >>>>>>> But I think this series in its current form is not acceptable, so > >>>>>>> waiting for v21.11 would be the best option (we may want to send > the > >>>>>>> deprecation notice in this release though). > >>>>>>> > >>>>>>> In this series, I don't like the user application has to pass a flag > >>>>>>> to > >>>>>>> state whether the DMA engine uses VFIO or not. AFAICT, this new > revision > >>>>>>> does not implement what was discussed in the previous one, i.e. > >>>>>>> supporting both IOVA_AS_VA and IOVA_AS_PA. > >>>>>> > >>>>>> Thanks for your comments. Here I hope to explain some questions: > >>>>>> 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now? > >>>>>> A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this > version, > >> the > >>>>>> virtual address is replaced with iova address of mapped region, and > >>>>>> the iova > >>>>>> address is selected to program the IOMMU instead of virtual address > only. > >>>> > >>>> Good! > >>>> > >>>>>> > >>>>>> 2. Why a flag is chosen to be passed by application? > >>>>>> A: Yes, as we discussed before, the rte_eal_iova_mode() API can be > >>>>>> used to > >>>>>> get the IOVA mode, so as to determine whether IOMMU should be > >>>> programmed. > >>>>>> However, in the implementation process, I found a problem. That is > how to > >>>>>> distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we > >> should > >>>>>> always program the IOMMU. While in IGB_UIO cases, it depends on > >> IOMMU > >>>>>> capability of platform. > >>>>> > >>>>> How does one program IOMMU with igb_uio? I was under impression > that > >>>>> igb_uio (and uio_pci_generic for that matter) does not provide such > >>>>> facilities. > >>>> > >>>> +1 > >>> > >>> Maybe some misunderstanding in this sentence here. > >>> In our design, if rte_eal_vfio_is_enabled("vfio") is true, iommu will be > >> programmed. > >>> True means vfio module is modprobed. > >>> > >>> But there is an exception here, that is, even if vfio module is modprobed, > >>> DPDK user still bind all the devices to igb_uio. > >>> > >>> This situation can be distinguished in DPDK eal initialization, because > >>> the > >> resource mapping > >>> is according to the driver loaded by each device(rte_pci_map_device). > >>> > >>> But in our scenario, this judgment is somewhat weak. Because we cannot > get > >>> the device driver info in vhost library. I also think it is unreasonable > >>> for > vhost to > >>> do this. Only trust that users will not use it like this. Thoughts for > >>> this > scenario? > >> > >> I don't see how igb_uio would make any difference at all. If you are > >> using igb_uio, you *don't have DMA mapping at all* and will use raw > >> physical addresses. Assuming your code supports this, that's all you're > >> ever going to get. The point of VFIO is to have memory regions that are > >> mapped for DMA *because real physical addresses are assumed to be not > >> available*. When you're using igb_uio, you effectively do have DMA > >> access to the entire memory, and thus can bypass IOMMU altogether > >> (assuming you're using passthrough mode). > > > > My concern is exactly here. > > In igb_uio cases, although devices are not added to the default container in > eal init, > > but the "IOMMU programming" actually happens when the > rte_vfio_container_dma_map() is called. > > It is no harm but it is also unnecessary. > > Yes, it is unnecessary, but it's also not actively harmful, which means > you can still do it without any regard as to whether you do or don't > have IOMMU :) > > Think of a hybrid VFIO/igb_uio setup - some NICs will have VFIO, some > will have igb_uio. The igb_uio-bound NICs will not care if you have > mapped anything for DMA because they don't go through IOMMU, things > will > "just work". The VFIO-bound NICs will get the memory mapped, because > they are the ones who actually need the DMA mapping. > > So, what you get is, if you do VFIO DMA mapping unconditionally, 1) NICs > with igb_uio won't care about this, and 2) NICs with VFIO will benefit. > You're not "mapping" the NICs, you're mapping the memory you're > accessing with those NICs. You need it to be accessible to both, but > since you have no way of knowing whether 1) any of the current HW needs > VFIO, and 2) any of *future hotplugged* HW needs VFIO, the easiest way > to solve this problem is just to map things regardless, and live with > the "unnecessary" but harmless mapping in the worst case.
Get your point! It's just such a worst case bothers me. I have been thinking about how to avoid the igb_uio case programming IOMMU. But I cannot realize this just through a judgement. Since it is harmless in this case, not to mention, a platform without IOMMU won’t do anything useful. I think it works to program IOMMU unconditionally. > > > > >> > >> Bottom line: do VFIO DMA mapping unconditionally. If VFIO is active - > >> great, the memory will be DMA mapped. If it's not active - no harm will > >> ever be done by mapping the memory for DMA anyway. > > > > Do VFIO DMA mapping unconditionally, do you mean the > rte_eal_vfio_is_enabled() is unnecessary? > > What if the platform does not have IOMMU? > > > > Thanks very much. > > > > If the platform has no IOMMU, the API call will just not do anything > useful, so no harm done. So the only thing remained is the API change for page-by-page mapping in next release. Thanks, Xuan > > > Regards, > > Xuan > > > >> > >> -- > >> Thanks, > >> Anatoly > > > -- > Thanks, > Anatoly