> -----Original Message----- > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com] > Sent: Thursday, October 13, 2016 1:33 PM > To: Wang, Zhihong <zhihong.wang at intel.com> > Cc: Jianbo Liu <jianbo.liu at linaro.org>; Thomas Monjalon > <thomas.monjalon at 6wind.com>; Maxime Coquelin > <maxime.coquelin at redhat.com>; dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On Wed, Oct 12, 2016 at 12:22:08PM +0000, Wang, Zhihong wrote: > > > > >> > 3. How many percentage of drop are you seeing? > > > > The testing result: > > > > size (bytes) improvement (%) > > > > 64 3.92 > > > > 128 11.51 > > > > 256 24.16 > > > > 512 -13.79 > > > > 1024 -22.51 > > > > 1500 -12.22 > > > > A correction is that performance is dropping if byte size is larger than > 512. > > > > > > I have thought of this twice. Unfortunately, I think I may need NACK this > > > series. > > > > > > Merging two code path into one is really good: as you stated, it improves > > > the maintainability. But only if we see no performance regression on both > > > path after the refactor. Unfortunately, that's not the case here: it hurts > > > the performance for one code path (non-mrg Rx). > > > > > > That makes me think we may should not do the code path merge at all. I > think > > > that also aligns with what you have said before (internally): we could do > the > > > merge if it gives comparable performance before and after that. > > > > > > Besides that, I don't quite like the way you did in patch 2 (rewrite > enqueue): > > > you made a lot of changes in one patch. That means if something wrong > > > happened, > > > it is hard to narrow down which change introduces that regression. Badly, > > > that's exactly what we met here. Weeks have been passed, I see no > progress. > > > > > > That's the reason we like the idea of "one patch only does one thing, an > > > atomic thing". > > > > > > Yuanhan, folks, > > > > Thanks for the analysis. I disagree here though. > > > > I analyze, develop, benchmark on x86 platforms, where this patch > > works great. > > Yes, that's great effort! With your hardwork, we know what the bottleneck > is and how it could be improved. > > However, you don't have to do code refactor (merge two code path to one) > to apply those improvements. From what I know, in this patchset, there > are two factors could improve the performance: > > - copy hdr together with packet data > > - shadow used ring update and update at once > > The overall performance boost I got with your v6 patchset with mrg-Rx > code path is about 27% (in PVP case). And I have just applied the 1st > optimization, it yields about 20% boosts. The left could be covered if > we apply the 2nd optimization (I guess). > > That would be a clean way to optimize vhost mergeable Rx path: > > - you don't touch non-mrg Rx path (well, you may could apply the > shadow_used_ring trick to it as wel) > > This would at least make sure we will have no such performance > regression issue reported by ARM guys. > > - you don't refactor the code > > The rewrite from scratch could introduce other issues, besides the > performance regression. We may just don't know it yet. > > > Make sense to you? If you agree, I think we could still make it in > this release: they would be some small changes after all. For example, > below is the patch applies the 1st optimization tip on top of > dpdk-next-virtio
Thanks for this great idea. I think it's a better way to do it. I'll start to make the patch then. > > --yliu > > --------------------------------------------------------------- > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c > index 8a151af..0ddb5af 100644 > --- a/lib/librte_vhost/virtio_net.c > +++ b/lib/librte_vhost/virtio_net.c > @@ -379,7 +379,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net > *dev, struct vhost_virtqueue *vq, > uint16_t end_idx, struct rte_mbuf *m, > struct buf_vector *buf_vec) > { > - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0}; > + struct virtio_net_hdr_mrg_rxbuf *virtio_hdr; > uint32_t vec_idx = 0; > uint16_t start_idx = vq->last_used_idx; > uint16_t cur_idx = start_idx; > @@ -388,6 +388,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net > *dev, struct vhost_virtqueue *vq, > uint32_t desc_offset, desc_avail; > uint32_t cpy_len; > uint16_t desc_idx, used_idx; > + uint16_t num_buffers = end_idx - start_idx; > + int hdr_copied = 0; > > if (unlikely(m == NULL)) > return 0; > @@ -399,16 +401,11 @@ copy_mbuf_to_desc_mergeable(struct virtio_net > *dev, struct vhost_virtqueue *vq, > if (buf_vec[vec_idx].buf_len < dev->vhost_hlen || !desc_addr) > return 0; > > - rte_prefetch0((void *)(uintptr_t)desc_addr); > + virtio_hdr = (struct virtio_net_hdr_mrg_rxbuf > *)(uintptr_t)desc_addr; > + rte_prefetch0(virtio_hdr); > > - virtio_hdr.num_buffers = end_idx - start_idx; > LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n", > - dev->vid, virtio_hdr.num_buffers); > - > - virtio_enqueue_offload(m, &virtio_hdr.hdr); > - copy_virtio_net_hdr(dev, desc_addr, virtio_hdr); > - vhost_log_write(dev, buf_vec[vec_idx].buf_addr, dev->vhost_hlen); > - PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0); > + dev->vid, num_buffers); > > desc_avail = buf_vec[vec_idx].buf_len - dev->vhost_hlen; > desc_offset = dev->vhost_hlen; > @@ -450,6 +447,15 @@ copy_mbuf_to_desc_mergeable(struct virtio_net > *dev, struct vhost_virtqueue *vq, > mbuf_avail = rte_pktmbuf_data_len(m); > } > > + if (hdr_copied == 0) { > + virtio_hdr->num_buffers = num_buffers; > + virtio_enqueue_offload(m, &virtio_hdr->hdr); > + vhost_log_write(dev, buf_vec[vec_idx].buf_addr, > dev->vhost_hlen); > + PRINT_PACKET(dev, (uintptr_t)desc_addr, dev- > >vhost_hlen, 0); > + > + hdr_copied = 1; > + } > + > cpy_len = RTE_MIN(desc_avail, mbuf_avail); > rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)), > rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),