vhost: leverage DMA engines to accelerate Tx operations

Liu, Yong Tue, 17 Mar 2020 18:23:41 -0700

> -----Original Message-----
> From: Hu, Jiayu <jiayu...@intel.com>
> Sent: Tuesday, March 17, 2020 5:31 PM
> To: Liu, Yong <yong....@intel.com>; dev@dpdk.org
> Cc: maxime.coque...@redhat.com; Ye, Xiaolong <xiaolong...@intel.com>;
> Wang, Zhihong <zhihong.w...@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> accelerate Tx operations
> 
> Hi Marvin,
> 
> Thanks for comments. Replies are inline.
> 
> > -----Original Message-----
> > From: Liu, Yong <yong....@intel.com>
> > Sent: Tuesday, March 17, 2020 3:21 PM
> > To: Hu, Jiayu <jiayu...@intel.com>; dev@dpdk.org
> > Cc: maxime.coque...@redhat.com; Ye, Xiaolong <xiaolong...@intel.com>;
> > Wang, Zhihong <zhihong.w...@intel.com>; Hu, Jiayu
> <jiayu...@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > accelerate Tx operations
> >
> > Hi Jiayu,
> > Some comments are inline.
> >
> > Thanks,
> > Marvin
> >
> > > -----Original Message-----
> > > From: dev <dev-boun...@dpdk.org> On Behalf Of Jiayu Hu
> > > Sent: Tuesday, March 17, 2020 5:21 PM
> > > To: dev@dpdk.org
> > > Cc: maxime.coque...@redhat.com; Ye, Xiaolong
> <xiaolong...@intel.com>;
> > > Wang, Zhihong <zhihong.w...@intel.com>; Hu, Jiayu
> <jiayu...@intel.com>
> > > Subject: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > > accelerate Tx operations
> > >
> > >
> > >  int vhost_logtype;
> > > @@ -30,8 +34,12 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > >  #define ETH_VHOST_IOMMU_SUPPORT          "iommu-support"
> > >  #define ETH_VHOST_POSTCOPY_SUPPORT       "postcopy-support"
> > >  #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
> > > +#define ETH_VHOST_DMA_ARG                "dmas"
> > >  #define VHOST_MAX_PKT_BURST 32
> > >
> > > +/* ring size of I/OAT */
> > > +#define IOAT_RING_SIZE 1024
> > > +
> >
> > Jiayu,
> > Configured I/OAT ring size is 1024 here, but do not see in_flight or
> > nr_batching size check in enqueue function.
> > Is there any possibility that IOAT ring exhausted?
> 
> We will wait for IOAT's copy completion, when its ring is full.
> This is to guarantee that all enqueue to IOAT can success.
> 
> > > +struct dma_info_input {
> > > + struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> > > + uint16_t nr;
> > > +};
> > > +
> > > +static inline int
> > > +open_dma(const char *key __rte_unused, const char *value, void
> > > *extra_args)
> > > +{
> > > + struct dma_info_input *dma_info = extra_args;
> > > + char *input = strndup(value, strlen(value) + 1);
> > > + char *addrs = input;
> > > + char *ptrs[2];
> > > + char *start, *end, *substr;
> > > + int64_t qid, vring_id;
> > > + struct rte_ioat_rawdev_config config;
> > > + struct rte_rawdev_info info = { .dev_private = &config };
> > > + char name[32];
> > > + int dev_id;
> > > + int ret = 0;
> > > +
> > > + while (isblank(*addrs))
> > > +         addrs++;
> > > + if (addrs == '\0') {
> > > +         VHOST_LOG(ERR, "No input DMA addresses\n");
> > > +         ret = -1;
> > > +         goto out;
> > > + }
> > > +
> > > + /* process DMA devices within bracket. */
> > > + addrs++;
> > > + substr = strtok(addrs, ";]");
> > > + if (!substr) {
> > > +         VHOST_LOG(ERR, "No input DMA addresse\n");
> > > +         ret = -1;
> > > +         goto out;
> > > + }
> > > +
> > > + do {
> > > +         rte_strsplit(substr, strlen(substr), ptrs, 2, '@');
> > > +
> > Function rte_strsplit can be failed. Need to check return value.
> 
> Thanks. Will check it later.
> 
> >
> > > +         start = strstr(ptrs[0], "txq");
> > > +         if (start == NULL) {
> > > +                 VHOST_LOG(ERR, "Illegal queue\n");
> > > +                 ret = -1;
> > > +                 goto out;
> > > +         }
> > > +
> > > +         start += 3;
> >
> > It's better not use hardcode value.
> >
> > > +         qid = strtol(start, &end, 0);
> > > +         if (end == start) {
> > > +                 VHOST_LOG(ERR, "No input queue ID\n");
> > > +                 ret = -1;
> > > +                 goto out;
> > > +         }
> > > +
> > > +         vring_id = qid * 2 + VIRTIO_RXQ;
> > > +         if (rte_pci_addr_parse(ptrs[1],
> > > +                                &dma_info->dmas[vring_id].addr) < 0) {
> > > +                 VHOST_LOG(ERR, "Invalid DMA address %s\n",
> > > ptrs[1]);
> > > +                 ret = -1;
> > > +                 goto out;
> > > +         }
> > > +
> > > +         rte_pci_device_name(&dma_info->dmas[vring_id].addr,
> > > +                             name, sizeof(name));
> > > +         dev_id = rte_rawdev_get_dev_id(name);
> > > +         if (dev_id == (uint16_t)(-ENODEV) ||
> > > +             dev_id == (uint16_t)(-EINVAL)) {
> > > +                 VHOST_LOG(ERR, "Cannot find device %s.\n", name);
> > > +                 ret = -1;
> > > +                 goto out;
> > > +         }
> > > +
> > Multiple queues can't share one IOAT device. Check should be here as it is
> > not allowed.
> 
> I just claim it in the doc. Will add the check later.
> 
> > > +
> > > +/* notify front-end of enqueued packets */
> > > +static __rte_always_inline void
> > > +vhost_dma_vring_call(struct pmd_internal *dev, struct dma_vring
> > > *dma_vr)
> > > +{
> > > + vhost_vring_call_split(dev, dma_vr);
> > > +}
> > > +
> > > +int
> > > +free_dma_done(void *dev, void *dma_vr)
> > > +{
> > > + uintptr_t flags[255], tmps[255];
> >
> > Please add meaningful macro for 255, not sure why limitation is 255 not
> 256.
> 
> The second parameter of rte_ioat_completed_copies() is uint8_t, so the
> max
> value can only be 255. I can replace it with UINT8_MAX later.
> 
> >
> > > + int dma_done, i;
> > > + uint16_t used_idx;
> > > + struct pmd_internal *device = dev;
> > > + struct dma_vring *dma_vring = dma_vr;
> > > +
> > > + dma_done = rte_ioat_completed_copies(dma_vring->dev_id, 255,
> > > flags,
> > > +                                      tmps);
> > > + if (unlikely(dma_done <= 0))
> > > +         return dma_done;
> > > +
> > > + dma_vring->nr_inflight -= dma_done;
> >
> > Not sure whether DMA engine will return completion as input sequence,
> > mbuf free should after index update done.
> 
> IMO, pktmbuf can be freed once the IOAT doesn't occupy it.
> We don't need to wait for the update of used index.
> 
> This is achieved by using mbuf's refcnt. We increase mbuf's refcnt by 1,
> once submit its copy job to the IOAT. If it has N copies that are all 
> offloaded
> to the IOAT, the refcnt will be increased by N. On completion of a IOAT copy,
> if it's a pktmbuf copy, we decrease its refcnt by rte_pktmbuf_free_seg().
> When the value of refcnt reaches 1, which means all its copies are
> completed,
> the mbuf will be freed.
> 

Thanks for reply. 
My concern is that whether IOAT can preserve order as input sequence. If 
hardware can guarantee that,  no more question here.

> >
> > > + for (i = 0; i < dma_done; i++) {
> > > +         if ((uint64_t)flags[i] >= dma_vring->max_indices) {
> > > +                 struct rte_mbuf *pkt = (struct rte_mbuf *)flags[i];
> > > +
> > > +                 /**
> > > +                  * the DMA completes a packet copy job, we
> > > +                  * decrease the refcnt or free the mbuf segment.
> > > +                  */
> > > +                 rte_pktmbuf_free_seg(pkt);
> > > +         } else {
> > > +                 uint16_t id = flags[i];
> > > +
> > > +                 /**
> > > +                  * the DMA completes updating index of the
> > > +                  * used ring.
> > > +                  */
> > > +                 used_idx = dma_vring->indices[id].data;
> > > +                 VHOST_LOG(DEBUG, "The DMA finishes updating
> > > index %u "
> > > +                           "for the used ring.\n", used_idx);
> > > +
> > > +                 dma_vring->copy_done_used = used_idx;
> > > +                 vhost_dma_vring_call(device, dma_vring);
> > > +                 put_used_index(dma_vring->indices,
> > > +                                dma_vring->max_indices, id);
> > > +         }
> > > + }
> > > + return dma_done;
> > > +}
> > > +
> > > +static  __rte_always_inline bool
> > > +rxvq_is_mergeable(struct pmd_internal *dev)
> > > +{
> > > + return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
> > > +}
> > > +
> >
> > I'm not sure whether shadow used ring can help in DMA acceleration
> > scenario.
> > Vhost driver will wait until DMA copy is done. Optimization in CPU move
> may
> > not help in overall performance but just add weird codes.
> 
> For small copies, we still use the CPU, as the IOAT is less efficient in small
> copies.
> Therefore, I think we still need CPU optimization here.
> 

Thanks for clarification, got it.

> > > +static __rte_always_inline void
> > > +flush_shadow_used_ring_split(struct pmd_internal *dev,
> > > +                      struct dma_vring *dma_vr)
> > > +{
> > > + uint16_t used_idx = dma_vr->last_used_idx & (dma_vr->vr.size - 1);
> > > +
> > > + if (used_idx + dma_vr->shadow_used_idx <= dma_vr->vr.size) {
> > > +         do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > > +                                         dma_vr->shadow_used_idx);
> > > + } else {
> > > +         uint16_t size;
> > > +
> > > +         /* update used ring interval [used_idx, vr->size] */
> > > +         size = dma_vr->vr.size - used_idx;
> > > +         do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > size);
> > > +
> > > +         /* update the left half used ring interval [0, left_size] */
> > > +         do_flush_shadow_used_ring_split(dma_vr, 0, size,
> > > +                                         dma_vr->shadow_used_idx -
> > > +                                         size);
> > > + }
> > > + dma_vr->last_used_idx += dma_vr->shadow_used_idx;
> > > +
> > > + rte_smp_wmb();
> > > +
> > > + if (dma_vr->nr_inflight > 0) {
> > > +         struct ring_index *index;
> > > +
> > > +         index = get_empty_index(dma_vr->indices, dma_vr-
> > > >max_indices);
> > > +         index->data = dma_vr->last_used_idx;
> > > +         while (unlikely(rte_ioat_enqueue_copy(dma_vr->dev_id,
> > > +                                               index->pa,
> > > +                                               dma_vr->used_idx_hpa,
> > > +                                               sizeof(uint16_t),
> > > +                                               index->idx, 0, 0) ==
> > > +                         0)) {
> > > +                 int ret;
> > > +
> > > +                 do {
> > > +                         ret = dma_vr->dma_done_fn(dev, dma_vr);
> > > +                 } while (ret <= 0);
> > > +         }
> > > +         dma_vr->nr_batching++;
> > > +         dma_vr->nr_inflight++;
> > > + } else {
> > > +         /**
> > > +          * we update index of used ring when all previous copy
> > > +          * jobs are completed.
> > > +          *
> > > +          * When enabling DMA copy, if there are outstanding copy
> > > +          * jobs of the DMA, to avoid the DMA overwriting the
> > > +          * write of the CPU, the DMA is in charge of updating
> > > +          * the index of used ring.
> > > +          */
> >
> > According to comments, here should be DMA data move. But following
> code
> > is CPU data move. Anything wrong here?
> 
> The update of used index is done by the CPU, if there are no inflight IOAT
> copies;
> otherwise, it's done by the IOAT. The code in "else {}" is executed only when
> dma_vr->nr_inflight is 0, which means no inflight IOAT copies, so the CPU is
> in
> charge of updating used ring's index.
> 

So look like comments can be separated into two parts, the second parts 
describes the behavior of index update when has infight copies. 

> >
> > > +         *(volatile uint16_t *)&dma_vr->vr.used->idx +=
> > > +                 dma_vr->shadow_used_idx;
> > > +         dma_vr->copy_done_used += dma_vr->shadow_used_idx;
> > > + }
> > > +
> > > + dma_vr->shadow_used_idx = 0;
> > > +}
> > > +
> > > 2.7.4
Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations

Reply via email to