On Wed, Jul 07, 2021 at 03:33:44PM +0100, Burakov, Anatoly wrote:
> On 07-Jul-21 1:54 PM, Ding, Xuan wrote:
> > Hi Anatoly,
> > 
> > > -----Original Message-----
> > > From: Burakov, Anatoly <anatoly.bura...@intel.com>
> > > Sent: Wednesday, July 7, 2021 8:18 PM
> > > To: Ding, Xuan <xuan.d...@intel.com>; Maxime Coquelin
> > > <maxime.coque...@redhat.com>; Xia, Chenbo <chenbo....@intel.com>;
> > > Thomas Monjalon <tho...@monjalon.net>; David Marchand
> > > <david.march...@redhat.com>
> > > Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil
> > > <sunil.pa...@intel.com>; Richardson, Bruce <bruce.richard...@intel.com>; 
> > > Van
> > > Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong 
> > > <yong....@intel.com>;
> > > Ma, WenwuX <wenwux...@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
> > > 
> > > On 07-Jul-21 7:25 AM, Ding, Xuan wrote:
> > > > Hi,
> > > > 
> > > > > -----Original Message-----
> > > > > From: Maxime Coquelin <maxime.coque...@redhat.com>
> > > > > Sent: Tuesday, July 6, 2021 5:32 PM
> > > > > To: Burakov, Anatoly <anatoly.bura...@intel.com>; Ding, Xuan
> > > > > <xuan.d...@intel.com>; Xia, Chenbo <chenbo....@intel.com>; Thomas
> > > > > Monjalon <tho...@monjalon.net>; David Marchand
> > > > > <david.march...@redhat.com>
> > > > > Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil
> > > > > <sunil.pa...@intel.com>; Richardson, Bruce 
> > > > > <bruce.richard...@intel.com>;
> > > Van
> > > > > Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong
> > > <yong....@intel.com>;
> > > > > Ma, WenwuX <wenwux...@intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async 
> > > > > vhost
> > > > > 
> > > > > 
> > > > > 
> > > > > On 7/6/21 11:16 AM, Burakov, Anatoly wrote:
> > > > > > On 06-Jul-21 9:31 AM, Ding, Xuan wrote:
> > > > > > > Hi Maxime,
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Maxime Coquelin <maxime.coque...@redhat.com>
> > > > > > > > Sent: Monday, July 5, 2021 8:46 PM
> > > > > > > > To: Burakov, Anatoly <anatoly.bura...@intel.com>; Ding, Xuan
> > > > > > > > <xuan.d...@intel.com>; Xia, Chenbo <chenbo....@intel.com>; 
> > > > > > > > Thomas
> > > > > > > > Monjalon <tho...@monjalon.net>; David Marchand
> > > > > > > > <david.march...@redhat.com>
> > > > > > > > Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil
> > > > > > > > <sunil.pa...@intel.com>; Richardson, Bruce
> > > <bruce.richard...@intel.com>;
> > > > > > > > Van Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong
> > > > > > > > <yong....@intel.com>; Ma, WenwuX <wenwux...@intel.com>
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for 
> > > > > > > > async
> > > > > > > > vhost
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 7/5/21 2:16 PM, Burakov, Anatoly wrote:
> > > > > > > > > On 05-Jul-21 9:40 AM, Xuan Ding wrote:
> > > > > > > > > > The use of IOMMU has many advantages, such as isolation and 
> > > > > > > > > > address
> > > > > > > > > > translation. This patch extends the capbility of DMA engine 
> > > > > > > > > > to use
> > > > > > > > > > IOMMU if the DMA device is bound to vfio.
> > > > > > > > > > 
> > > > > > > > > > When set memory table, the guest memory will be mapped
> > > > > > > > > > into the default container of DPDK.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Xuan Ding <xuan.d...@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >      doc/guides/prog_guide/vhost_lib.rst |  9 ++++++
> > > > > > > > > >      lib/vhost/rte_vhost.h               |  1 +
> > > > > > > > > >      lib/vhost/socket.c                  |  9 ++++++
> > > > > > > > > >      lib/vhost/vhost.h                   |  1 +
> > > > > > > > > >      lib/vhost/vhost_user.c              | 46
> > > > > > > > > > ++++++++++++++++++++++++++++-
> > > > > > > > > >      5 files changed, 65 insertions(+), 1 deletion(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/doc/guides/prog_guide/vhost_lib.rst
> > > > > > > > > > b/doc/guides/prog_guide/vhost_lib.rst
> > > > > > > > > > index 05c42c9b11..c3beda23d9 100644
> > > > > > > > > > --- a/doc/guides/prog_guide/vhost_lib.rst
> > > > > > > > > > +++ b/doc/guides/prog_guide/vhost_lib.rst
> > > > > > > > > > @@ -118,6 +118,15 @@ The following is an overview of some 
> > > > > > > > > > key
> > > Vhost
> > > > > > > > > > API functions:
> > > > > > > > > >            It is disabled by default.
> > > > > > > > > >      +  - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
> > > > > > > > > > +
> > > > > > > > > > +    In asynchronous data path, vhost liarary is not aware 
> > > > > > > > > > of which
> > > > > > > > > > driver
> > > > > > > > > > +    (igb_uio/vfio) the DMA device is bound to. Application 
> > > > > > > > > > should
> > > > > > > > > > pass
> > > > > > > > > > +    this flag to tell vhost library whether IOMMU should be
> > > > > > > > > > programmed
> > > > > > > > > > +    for guest memory.
> > > > > > > > > > +
> > > > > > > > > > +    It is disabled by default.
> > > > > > > > > > +
> > > > > > > > > >        - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
> > > > > > > > > >            Since v16.04, the vhost library forwards 
> > > > > > > > > > checksum and gso
> > > > > > > > > > requests for
> > > > > > > > > > diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
> > > > > > > > > > index 8d875e9322..a766ea7b6b 100644
> > > > > > > > > > --- a/lib/vhost/rte_vhost.h
> > > > > > > > > > +++ b/lib/vhost/rte_vhost.h
> > > > > > > > > > @@ -37,6 +37,7 @@ extern "C" {
> > > > > > > > > >      #define RTE_VHOST_USER_LINEARBUF_SUPPORT    (1ULL << 6)
> > > > > > > > > >      #define RTE_VHOST_USER_ASYNC_COPY    (1ULL << 7)
> > > > > > > > > >      #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS    (1ULL 
> > > > > > > > > > <<
> > > 8)
> > > > > > > > > > +#define RTE_VHOST_USER_ASYNC_USE_VFIO    (1ULL << 9)
> > > > > > > > > >        /* Features. */
> > > > > > > > > >      #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
> > > > > > > > > > diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
> > > > > > > > > > index 5d0d728d52..77c722c86b 100644
> > > > > > > > > > --- a/lib/vhost/socket.c
> > > > > > > > > > +++ b/lib/vhost/socket.c
> > > > > > > > > > @@ -42,6 +42,7 @@ struct vhost_user_socket {
> > > > > > > > > >          bool extbuf;
> > > > > > > > > >          bool linearbuf;
> > > > > > > > > >          bool async_copy;
> > > > > > > > > > +    bool async_use_vfio;
> > > > > > > > > >          bool net_compliant_ol_flags;
> > > > > > > > > >            /*
> > > > > > > > > > @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, 
> > > > > > > > > > struct
> > > > > > > > > > vhost_user_socket *vsocket)
> > > > > > > > > >                  dev->async_copy = 1;
> > > > > > > > > >          }
> > > > > > > > > >      +    if (vsocket->async_use_vfio) {
> > > > > > > > > > +        dev = get_device(vid);
> > > > > > > > > > +
> > > > > > > > > > +        if (dev)
> > > > > > > > > > +            dev->async_use_vfio = 1;
> > > > > > > > > > +    }
> > > > > > > > > > +
> > > > > > > > > >          VHOST_LOG_CONFIG(INFO, "new device, handle is 
> > > > > > > > > > %d\n", vid);
> > > > > > > > > >            if (vsocket->notify_ops->new_connection) {
> > > > > > > > > > @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char 
> > > > > > > > > > *path,
> > > > > > > > > > uint64_t flags)
> > > > > > > > > >          vsocket->extbuf = flags & 
> > > > > > > > > > RTE_VHOST_USER_EXTBUF_SUPPORT;
> > > > > > > > > >          vsocket->linearbuf = flags &
> > > > > RTE_VHOST_USER_LINEARBUF_SUPPORT;
> > > > > > > > > >          vsocket->async_copy = flags & 
> > > > > > > > > > RTE_VHOST_USER_ASYNC_COPY;
> > > > > > > > > > +    vsocket->async_use_vfio = flags &
> > > > > > > > RTE_VHOST_USER_ASYNC_USE_VFIO;
> > > > > > > > > >          vsocket->net_compliant_ol_flags = flags &
> > > > > > > > > > RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
> > > > > > > > > >            if (vsocket->async_copy &&
> > > > > > > > > > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> > > > > > > > > > index 8078ddff79..fb775ce4ed 100644
> > > > > > > > > > --- a/lib/vhost/vhost.h
> > > > > > > > > > +++ b/lib/vhost/vhost.h
> > > > > > > > > > @@ -370,6 +370,7 @@ struct virtio_net {
> > > > > > > > > >          int16_t            broadcast_rarp;
> > > > > > > > > >          uint32_t        nr_vring;
> > > > > > > > > >          int            async_copy;
> > > > > > > > > > +    int            async_use_vfio;
> > > > > > > > > >          int            extbuf;
> > > > > > > > > >          int            linearbuf;
> > > > > > > > > >          struct vhost_virtqueue    
> > > > > > > > > > *virtqueue[VHOST_MAX_QUEUE_PAIRS *
> > > > > > > > > > 2];
> > > > > > > > > > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> > > > > > > > > > index 8f0eba6412..f3703f2e72 100644
> > > > > > > > > > --- a/lib/vhost/vhost_user.c
> > > > > > > > > > +++ b/lib/vhost/vhost_user.c
> > > > > > > > > > @@ -45,6 +45,7 @@
> > > > > > > > > >      #include <rte_common.h>
> > > > > > > > > >      #include <rte_malloc.h>
> > > > > > > > > >      #include <rte_log.h>
> > > > > > > > > > +#include <rte_vfio.h>
> > > > > > > > > >        #include "iotlb.h"
> > > > > > > > > >      #include "vhost.h"
> > > > > > > > > > @@ -141,6 +142,36 @@ get_blk_size(int fd)
> > > > > > > > > >          return ret == -1 ? (uint64_t)-1 : 
> > > > > > > > > > (uint64_t)stat.st_blksize;
> > > > > > > > > >      }
> > > > > > > > > >      +static int
> > > > > > > > > > +async_dma_map(struct rte_vhost_mem_region *region, bool 
> > > > > > > > > > do_map)
> > > > > > > > > > +{
> > > > > > > > > > +    int ret = 0;
> > > > > > > > > > +    uint64_t host_iova;
> > > > > > > > > > +    host_iova = rte_mem_virt2iova((void
> > > > > > > > > > *)(uintptr_t)region->host_user_addr);
> > > > > > > > > > +    if (do_map) {
> > > > > > > > > > +        /* Add mapped region into the default container of 
> > > > > > > > > > DPDK. */
> > > > > > > > > > +        ret =
> > > > > > > > rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > > > > > > > > > +                         region->host_user_addr,
> > > > > > > > > > +                         host_iova,
> > > > > > > > > > +                         region->size);
> > > > > > > > > > +        if (ret) {
> > > > > > > > > > +            VHOST_LOG_CONFIG(ERR, "DMA engine map 
> > > > > > > > > > failed\n");
> > > > > > > > > > +            return ret;
> > > > > > > > > > +        }
> > > > > > > > > > +    } else {
> > > > > > > > > > +        /* Remove mapped region from the default container 
> > > > > > > > > > of
> > > > > > > > > > DPDK. */
> > > > > > > > > > +        ret =
> > > > > > > > > > rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > > > > > > > > > +                           region->host_user_addr,
> > > > > > > > > > +                           host_iova,
> > > > > > > > > > +                           region->size);
> > > > > > > > > > +        if (ret) {
> > > > > > > > > > +            VHOST_LOG_CONFIG(ERR, "DMA engine unmap 
> > > > > > > > > > failed\n");
> > > > > > > > > > +            return ret;
> > > > > > > > > > +        }
> > > > > > > > > > +    }
> > > > > > > > > > +    return ret;
> > > > > > > > > > +}
> > > > > > > > > 
> > > > > > > > > We've been discussing this off list with Xuan, and 
> > > > > > > > > unfortunately
> > > > > > > > > this is
> > > > > > > > > a blocker for now.
> > > > > > > > > 
> > > > > > > > > Currently, the x86 IOMMU does not support partial unmap - the
> > > segments
> > > > > > > > > have to be unmapped exactly the same addr/len as they were 
> > > > > > > > > mapped.
> > > We
> > > > > > > > > also concatenate adjacent mappings to prevent filling up the 
> > > > > > > > > DMA
> > > > > > > > > mapping
> > > > > > > > > entry table with superfluous entries.
> > > > > > > > > 
> > > > > > > > > This means that, when two unrelated mappings are contiguous in
> > > memory
> > > > > > > > > (e.g. if you map regions 1 and 2 independently, but they 
> > > > > > > > > happen to be
> > > > > > > > > sitting right next to each other in virtual memory), we 
> > > > > > > > > cannot later
> > > > > > > > > unmap one of them because, even though these are two separate
> > > > > > > > mappings
> > > > > > > > > as far as kernel VFIO infrastructure is concerned, the 
> > > > > > > > > mapping gets
> > > > > > > > > compacted and looks like one single mapping to VFIO, so DPDK 
> > > > > > > > > API will
> > > > > > > > > not let us unmap region 1 without also unmapping region 2.
> > > > > > > > > 
> > > > > > > > > The proper fix for this problem would be to always map memory
> > > > > > > > > page-by-page regardless of where it comes from (we already do 
> > > > > > > > > that for
> > > > > > > > > internal memory, but not for external). However, the reason 
> > > > > > > > > this works
> > > > > > > > > for internal memory is because when mapping internal memory
> > > segments,
> > > > > > > > > *we know the page size*. For external memory segments, there 
> > > > > > > > > is no
> > > such
> > > > > > > > > guarantee, so we cannot deduce page size for a given memory 
> > > > > > > > > segment,
> > > > > > > > and
> > > > > > > > > thus can't map things page-by-page.
> > > > > > > > > 
> > > > > > > > > So, the proper fix for it would be to add page size to the 
> > > > > > > > > VFIO DMA
> > > > > > > > > API.
> > > > > > > > > Unfortunately, it probably has to wait until 21.11 because it 
> > > > > > > > > is an API
> > > > > > > > > change.
> > > > > > > > > 
> > > > > > > > > The slightly hacky fix for this would be to forego user mem 
> > > > > > > > > map
> > > > > > > > > concatenation and trust that user is not going to do anything 
> > > > > > > > > stupid,
> > > > > > > > > and will not spam the VFIO DMA API without reason. I would 
> > > > > > > > > rather
> > > > > > > > > not go
> > > > > > > > > down this road, but this could be an option in this case.
> > > > > > > > > 
> > > > > > > > > Thoughts?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks Anatoly for the detailed description of the issue.
> > > > > > > > It may be possible to either create a versioned symbol for this 
> > > > > > > > API
> > > > > > > > change, or maybe even to have a temporary internal API.
> > > > > > > > 
> > > > > > > > But I think this series in its current form is not acceptable, 
> > > > > > > > so
> > > > > > > > waiting for v21.11 would be the best option (we may want to 
> > > > > > > > send the
> > > > > > > > deprecation notice in this release though).
> > > > > > > > 
> > > > > > > > In this series, I don't like the user application has to pass a 
> > > > > > > > flag to
> > > > > > > > state whether the DMA engine uses VFIO or not. AFAICT, this new 
> > > > > > > > revision
> > > > > > > > does not implement what was discussed in the previous one, i.e.
> > > > > > > > supporting both IOVA_AS_VA and IOVA_AS_PA.
> > > > > > > 
> > > > > > > Thanks for your comments. Here I hope to explain some questions:
> > > > > > > 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now?
> > > > > > > A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this 
> > > > > > > version,
> > > the
> > > > > > > virtual address is replaced with iova address of mapped region, 
> > > > > > > and
> > > > > > > the iova
> > > > > > > address is selected to program the IOMMU instead of virtual 
> > > > > > > address only.
> > > > > 
> > > > > Good!
> > > > > 
> > > > > > > 
> > > > > > > 2. Why a flag is chosen to be passed by application?
> > > > > > > A: Yes, as we discussed before, the rte_eal_iova_mode() API can be
> > > > > > > used to
> > > > > > > get the IOVA mode, so as to determine whether IOMMU should be
> > > > > programmed.
> > > > > > > However, in the implementation process, I found a problem. That 
> > > > > > > is how to
> > > > > > > distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we
> > > should
> > > > > > > always program the IOMMU. While in IGB_UIO cases, it depends on
> > > IOMMU
> > > > > > > capability of platform.
> > > > > > 
> > > > > > How does one program IOMMU with igb_uio? I was under impression that
> > > > > > igb_uio (and uio_pci_generic for that matter) does not provide such
> > > > > > facilities.
> > > > > 
> > > > > +1
> > > > 
> > > > Maybe some misunderstanding in this sentence here.
> > > > In our design, if rte_eal_vfio_is_enabled("vfio") is true, iommu will be
> > > programmed.
> > > > True means vfio module is modprobed.
> > > > 
> > > > But there is an exception here, that is, even if vfio module is 
> > > > modprobed,
> > > > DPDK user still bind all the devices to igb_uio.
> > > > 
> > > > This situation can be distinguished in DPDK eal initialization, because 
> > > > the
> > > resource mapping
> > > > is according to the driver loaded by each device(rte_pci_map_device).
> > > > 
> > > > But in our scenario, this judgment is somewhat weak. Because we cannot 
> > > > get
> > > > the device driver info in vhost library. I also think it is 
> > > > unreasonable for vhost to
> > > > do this. Only trust that users will not use it like this. Thoughts for 
> > > > this scenario?
> > > 
> > > I don't see how igb_uio would make any difference at all. If you are
> > > using igb_uio, you *don't have DMA mapping at all* and will use raw
> > > physical addresses. Assuming your code supports this, that's all you're
> > > ever going to get. The point of VFIO is to have memory regions that are
> > > mapped for DMA *because real physical addresses are assumed to be not
> > > available*. When you're using igb_uio, you effectively do have DMA
> > > access to the entire memory, and thus can bypass IOMMU altogether
> > > (assuming you're using passthrough mode).
> > 
> > My concern is exactly here.
> > In igb_uio cases, although devices are not added to the default container 
> > in eal init,
> > but the "IOMMU programming" actually happens when the 
> > rte_vfio_container_dma_map() is called.
> > It is no harm but it is also unnecessary.
> 
> Yes, it is unnecessary, but it's also not actively harmful, which means you
> can still do it without any regard as to whether you do or don't have IOMMU
> :)
> 
> Think of a hybrid VFIO/igb_uio setup - some NICs will have VFIO, some will
> have igb_uio. The igb_uio-bound NICs will not care if you have mapped
> anything for DMA because they don't go through IOMMU, things will "just
> work". The VFIO-bound NICs will get the memory mapped, because they are the
> ones who actually need the DMA mapping.
> 
Do we even support a hybrid setup. I would have thought that was just
asking for problems and should be considered an unsupported configuration?

Reply via email to