> -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] > Sent: Thursday, October 17, 2019 3:31 PM > To: Liu, Yong <yong....@intel.com>; Bie, Tiwei <tiwei....@intel.com>; Wang, > Zhihong <zhihong.w...@intel.com>; step...@networkplumber.org; > gavin...@arm.com > Cc: dev@dpdk.org > Subject: Re: [PATCH v6 00/13] vhost packed ring performance optimization > > Hi Marvin, > > This is almost good, just fix the small comments I made. > > Also, please rebase on top of next-virtio branch, because I applied > below patch from Flavio that you need to take into account: > > http://patches.dpdk.org/patch/61284/
Thanks, Maxime. I will start rebasing work. > > Regards, > Maxime > > On 10/15/19 6:07 PM, Marvin Liu wrote: > > Packed ring has more compact ring format and thus can significantly > > reduce the number of cache miss. It can lead to better performance. > > This has been approved in virtio user driver, on normal E5 Xeon cpu > > single core performance can raise 12%. > > > > http://mails.dpdk.org/archives/dev/2018-April/095470.html > > > > However vhost performance with packed ring performance was decreased. > > Through analysis, mostly extra cost was from the calculating of each > > descriptor flag which depended on ring wrap counter. Moreover, both > > frontend and backend need to write same descriptors which will cause > > cache contention. Especially when doing vhost enqueue function, virtio > > refill packed ring function may write same cache line when vhost doing > > enqueue function. This kind of extra cache cost will reduce the benefit > > of reducing cache misses. > > > > For optimizing vhost packed ring performance, vhost enqueue and dequeue > > function will be splitted into fast and normal path. > > > > Several methods will be taken in fast path: > > Handle descriptors in one cache line by batch. > > Split loop function into more pieces and unroll them. > > Prerequisite check that whether I/O space can copy directly into mbuf > > space and vice versa. > > Prerequisite check that whether descriptor mapping is successful. > > Distinguish vhost used ring update function by enqueue and dequeue > > function. > > Buffer dequeue used descriptors as many as possible. > > Update enqueue used descriptors by cache line. > > > > After all these methods done, single core vhost PvP performance with 64B > > packet on Xeon 8180 can boost 35%. > > > > v6: > > - Fix dequeue zcopy result check > > > > v5: > > - Remove disable sw prefetch as performance impact is small > > - Change unroll pragma macro format > > - Rename shadow counter elements names > > - Clean dequeue update check condition > > - Add inline functions replace of duplicated code > > - Unify code style > > > > v4: > > - Support meson build > > - Remove memory region cache for no clear performance gain and ABI break > > - Not assume ring size is power of two > > > > v3: > > - Check available index overflow > > - Remove dequeue remained descs number check > > - Remove changes in split ring datapath > > - Call memory write barriers once when updating used flags > > - Rename some functions and macros > > - Code style optimization > > > > v2: > > - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc > > - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) > > - Optimize dequeue used ring update when in_order negotiated > > > > > > Marvin Liu (13): > > vhost: add packed ring indexes increasing function > > vhost: add packed ring single enqueue > > vhost: try to unroll for each loop > > vhost: add packed ring batch enqueue > > vhost: add packed ring single dequeue > > vhost: add packed ring batch dequeue > > vhost: flush enqueue updates by batch > > vhost: flush batched enqueue descs directly > > vhost: buffer packed ring dequeue updates > > vhost: optimize packed ring enqueue > > vhost: add packed ring zcopy batch and single dequeue > > vhost: optimize packed ring dequeue > > vhost: optimize packed ring dequeue when in-order > > > > lib/librte_vhost/Makefile | 18 + > > lib/librte_vhost/meson.build | 7 + > > lib/librte_vhost/vhost.h | 57 +++ > > lib/librte_vhost/virtio_net.c | 924 +++++++++++++++++++++++++++------- > > 4 files changed, 812 insertions(+), 194 deletions(-) > >