Hi Maxime, Replies are inline.
> -----Original Message----- > From: Maxime Coquelin <maxime.coque...@redhat.com> > Sent: Saturday, June 19, 2021 12:18 AM > To: Ding, Xuan <xuan.d...@intel.com>; Xia, Chenbo <chenbo....@intel.com> > Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil > <sunil.pa...@intel.com>; Richardson, Bruce <bruce.richard...@intel.com>; Van > Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong <yong....@intel.com> > Subject: Re: [PATCH v3] vhost: enable IOMMU for async vhost > > Hi Xuan, > > On 6/3/21 7:30 PM, xuan.d...@intel.com wrote: > > From: Xuan Ding <xuan.d...@intel.com> > > > > For async copy, it is unsafe to directly use the physical address. > > And current address translation from GPA to HPA via SW also takes > > CPU cycles, these can all benefit from IOMMU. > > > > Since the existing DMA engine supports to use platform IOMMU, > > this patch enables IOMMU for async vhost, which defines IOAT > > devices to use virtual address instead of physical address. > > We have to keep in mind a generic DMA api is coming, and maybe we want > a SW implementation of a dmadev based on memcpy at least for > testing/debugging purpose. I noticed the generic dmadev model is under discussion. To support a SW implementation, the VA mode support is needed, this is also the problem that this patch hopes to solve. Traditionally, DMA engine can only use physical address in PA mode. > > > When set memory table, the frontend's memory will be mapped > > to the default container of DPDK where IOAT devices have been > > added into. When DMA copy fails, the virtual address provided > > to IOAT devices also allow us fallback to SW copy or PA copy. > > > > With IOMMU enabled, to use IOAT devices: > > 1. IOAT devices must be binded to vfio-pci, rather than igb_uio. > > 2. DPDK must use "--iova-mode=va". > > I think this is problematic, at least we need to check the right iova > mode has been selected, but even with doing that it is limiting. As a library, vhost is not aware of the device selected address(PA or VA) and current DPDK iova mode. To some extent, this patch is a proposal. With device fed with VA, SW fallback can be supported. And VA can also be translated to PA through rte_mem_virt2iova(). Finally, the address selected by the device is determined by callback. Not vice versa. If the DMA callback implementer follows this design, SW fallback can be supported. I would be very grateful if you could provide some insights for this design. :) > > What prevent us to reuse add_guest_pages() alogrithm to implement > IOVA_AS_PA? If IOVA is PA, it's not easy to translate PA to VA to support SW implementation. Until now, I don't have any good ideas to be compatible with IOVA_AS_PA and IOVA_AS_VA at the same time, because it requires vhost library to select PA/VA for DMA device according to different DPDK iova mode. Thanks, Xuan > > > > > Signed-off-by: Xuan Ding <xuan.d...@intel.com> > > --- > > > > v3: > > * Fixed some typos. > > > > v2: > > * Fixed a format issue. > > * Added the dma unmap logic when device is closed. > > --- > > doc/guides/prog_guide/vhost_lib.rst | 20 +++++ > > lib/vhost/vhost_user.c | 125 +++++++++------------------- > > lib/vhost/virtio_net.c | 30 +++---- > > 3 files changed, 69 insertions(+), 106 deletions(-) > > > > diff --git a/doc/guides/prog_guide/vhost_lib.rst > b/doc/guides/prog_guide/vhost_lib.rst > > index d18fb98910..5777f0da96 100644 > > --- a/doc/guides/prog_guide/vhost_lib.rst > > +++ b/doc/guides/prog_guide/vhost_lib.rst > > @@ -420,3 +420,23 @@ Finally, a set of device ops is defined for device > specific operations: > > * ``get_notify_area`` > > > > Called to get the notify area info of the queue. > > + > > +Vhost async data path > > +--------------------- > > + > > +* Address mode > > + > > + Modern IOAT devices support to use the IOMMU, which can avoid using > > + the unsafe HPA. Besides, the CPU cycles took by SW to translate from > > + GPA to HPA can also be saved. So IOAT devices are defined to use > > + virtual address instead of physical address. > > + > > + With IOMMU enabled, to use IOAT devices: > > + 1. IOAT devices must be binded to vfio-pci, rather than igb_uio. > > + 2. DPDK must use ``--iova-mode=va``. > > + > > +* Fallback > > + > > + When the DMA copy fails, the user who implements the transfer_data > > + callback can fallback to SW copy or fallback to PA copy through > > + rte_mem_virt2iova(). > > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c > > index 8f0eba6412..c33fa784ff 100644 > > --- a/lib/vhost/vhost_user.c > > +++ b/lib/vhost/vhost_user.c > > @@ -45,6 +45,7 @@ > > #include <rte_common.h> > > #include <rte_malloc.h> > > #include <rte_log.h> > > +#include <rte_vfio.h> > > > > #include "iotlb.h" > > #include "vhost.h" > > @@ -141,6 +142,34 @@ get_blk_size(int fd) > > return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize; > > } > > > > +static int > > +async_dma_map(struct rte_vhost_mem_region *region, bool do_map) > > +{ > > + int ret = 0; > > + if (do_map) { > > + /* Add mapped region into the default container of DPDK. */ > > + ret = > rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, > > + region->host_user_addr, > > + region->host_user_addr, > > + region->size); > > + if (ret) { > > + VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n"); > > + return ret; > > + } > > + } else { > > + /* Remove mapped region from the default container of DPDK. > */ > > + ret = > rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, > > + region->host_user_addr, > > + region->host_user_addr, > > + region->size); > > + if (ret) { > > + VHOST_LOG_CONFIG(ERR, "DMA engine unmap > failed\n"); > > + return ret; > > + } > > + } > > + return ret; > > +} > > + > > static void > > free_mem_region(struct virtio_net *dev) > > { > > @@ -155,6 +184,9 @@ free_mem_region(struct virtio_net *dev) > > if (reg->host_user_addr) { > > munmap(reg->mmap_addr, reg->mmap_size); > > close(reg->fd); > > + > > + if (dev->async_copy) > > + async_dma_map(reg, false); > > } > > } > > } > > @@ -866,87 +898,6 @@ vhost_user_set_vring_base(struct virtio_net **pdev, > > return RTE_VHOST_MSG_RESULT_OK; > > } > > > > -static int > > -add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr, > > - uint64_t host_phys_addr, uint64_t size) > > -{ > > - struct guest_page *page, *last_page; > > - struct guest_page *old_pages; > > - > > - if (dev->nr_guest_pages == dev->max_guest_pages) { > > - dev->max_guest_pages *= 2; > > - old_pages = dev->guest_pages; > > - dev->guest_pages = rte_realloc(dev->guest_pages, > > - dev->max_guest_pages * sizeof(*page), > > - RTE_CACHE_LINE_SIZE); > > - if (dev->guest_pages == NULL) { > > - VHOST_LOG_CONFIG(ERR, "cannot realloc > guest_pages\n"); > > - rte_free(old_pages); > > - return -1; > > - } > > - } > > - > > - if (dev->nr_guest_pages > 0) { > > - last_page = &dev->guest_pages[dev->nr_guest_pages - 1]; > > - /* merge if the two pages are continuous */ > > - if (host_phys_addr == last_page->host_phys_addr + > > - last_page->size) { > > - last_page->size += size; > > - return 0; > > - } > > - } > > - > > - page = &dev->guest_pages[dev->nr_guest_pages++]; > > - page->guest_phys_addr = guest_phys_addr; > > - page->host_phys_addr = host_phys_addr; > > - page->size = size; > > - > > - return 0; > > -} > > - > > -static int > > -add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region *reg, > > - uint64_t page_size) > > -{ > > - uint64_t reg_size = reg->size; > > - uint64_t host_user_addr = reg->host_user_addr; > > - uint64_t guest_phys_addr = reg->guest_phys_addr; > > - uint64_t host_phys_addr; > > - uint64_t size; > > - > > - host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)host_user_addr); > > - size = page_size - (guest_phys_addr & (page_size - 1)); > > - size = RTE_MIN(size, reg_size); > > - > > - if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size) < > 0) > > - return -1; > > - > > - host_user_addr += size; > > - guest_phys_addr += size; > > - reg_size -= size; > > - > > - while (reg_size > 0) { > > - size = RTE_MIN(reg_size, page_size); > > - host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t) > > - host_user_addr); > > - if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, > > - size) < 0) > > - return -1; > > - > > - host_user_addr += size; > > - guest_phys_addr += size; > > - reg_size -= size; > > - } > > - > > - /* sort guest page array if over binary search threshold */ > > - if (dev->nr_guest_pages >= VHOST_BINARY_SEARCH_THRESH) { > > - qsort((void *)dev->guest_pages, dev->nr_guest_pages, > > - sizeof(struct guest_page), guest_page_addrcmp); > > - } > > - > > - return 0; > > -} > > - > > #ifdef RTE_LIBRTE_VHOST_DEBUG > > /* TODO: enable it only in debug mode? */ > > static void > > @@ -1105,6 +1056,7 @@ vhost_user_mmap_region(struct virtio_net *dev, > > uint64_t mmap_size; > > uint64_t alignment; > > int populate; > > + int ret; > > > > /* Check for memory_size + mmap_offset overflow */ > > if (mmap_offset >= -region->size) { > > @@ -1158,12 +1110,13 @@ vhost_user_mmap_region(struct virtio_net *dev, > > region->mmap_size = mmap_size; > > region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + > mmap_offset; > > > > - if (dev->async_copy) > > - if (add_guest_pages(dev, region, alignment) < 0) { > > - VHOST_LOG_CONFIG(ERR, > > - "adding guest pages to region > failed.\n"); > > + if (dev->async_copy) { > > + ret = async_dma_map(region, true); > > + if (ret) { > > + VHOST_LOG_CONFIG(ERR, "Configure IOMMU for DMA > engine failed\n"); > > return -1; > > Maybe we're too late in the init already, but I would think we may want > to fallback to SW implementation insea > > > - } > > + } > > + } > > > > VHOST_LOG_CONFIG(INFO, > > "guest memory region size: 0x%" PRIx64 "\n" > > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c > > index 8da8a86a10..88110d2cb3 100644 > > --- a/lib/vhost/virtio_net.c > > +++ b/lib/vhost/virtio_net.c > > @@ -980,11 +980,9 @@ async_mbuf_to_desc(struct virtio_net *dev, struct > vhost_virtqueue *vq, > > struct batch_copy_elem *batch_copy = vq->batch_copy_elems; > > struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL; > > int error = 0; > > - uint64_t mapped_len; > > > > uint32_t tlen = 0; > > int tvec_idx = 0; > > - void *hpa; > > > > if (unlikely(m == NULL)) { > > error = -1; > > @@ -1074,27 +1072,19 @@ async_mbuf_to_desc(struct virtio_net *dev, > struct vhost_virtqueue *vq, > > > > cpy_len = RTE_MIN(buf_avail, mbuf_avail); > > > > - while (unlikely(cpy_len && cpy_len >= cpy_threshold)) { > > - hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev, > > - buf_iova + buf_offset, > > - cpy_len, &mapped_len); > > - > > - if (unlikely(!hpa || mapped_len < cpy_threshold)) > > - break; > > - > > + if (unlikely(cpy_len >= cpy_threshold)) { > > async_fill_vec(src_iovec + tvec_idx, > > - (void *)(uintptr_t)rte_pktmbuf_iova_offset(m, > > - mbuf_offset), (size_t)mapped_len); > > + rte_pktmbuf_mtod_offset(m, void *, > mbuf_offset), (size_t)cpy_len); > > > > async_fill_vec(dst_iovec + tvec_idx, > > - hpa, (size_t)mapped_len); > > - > > - tlen += (uint32_t)mapped_len; > > - cpy_len -= (uint32_t)mapped_len; > > - mbuf_avail -= (uint32_t)mapped_len; > > - mbuf_offset += (uint32_t)mapped_len; > > - buf_avail -= (uint32_t)mapped_len; > > - buf_offset += (uint32_t)mapped_len; > > + (void *)((uintptr_t)(buf_addr + buf_offset)), > (size_t)cpy_len); > > + > > + tlen += cpy_len; > > + mbuf_avail -= cpy_len; > > + mbuf_offset += cpy_len; > > + buf_avail -= cpy_len; > > + buf_offset += cpy_len; > > + cpy_len = 0; > > tvec_idx++; > > } > > > >