> -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] > Sent: Thursday, October 24, 2019 2:50 PM > To: Liu, Yong <yong....@intel.com>; Bie, Tiwei <tiwei....@intel.com>; Wang, > Zhihong <zhihong.w...@intel.com>; step...@networkplumber.org; > gavin...@arm.com > Cc: dev@dpdk.org > Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization > > I get some checkpatch warnings, and build fails with clang. > Could you please fix these issues and send v9? >
Hi Maxime, Clang build fails will be fixed in v9. For checkpatch warning, it was due to pragma string inside. Previous version can avoid such warning, while format is a little messy as below. I prefer to keep code clean and more readable. How about your idea? +#ifdef UNROLL_PRAGMA_PARAM +#define VHOST_UNROLL_PRAGMA(param) _Pragma(param) +#else +#define VHOST_UNROLL_PRAGMA(param) do {} while (0); +#endif + VHOST_UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM) + for (i = 0; i < PACKED_BATCH_SIZE; i++) Regards, Marvin > Thanks, > Maxime > > ### [PATCH] vhost: try to unroll for each loop > > WARNING:CAMELCASE: Avoid CamelCase: <_Pragma> > #78: FILE: lib/librte_vhost/vhost.h:47: > +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll > 4") \ > > ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > parenthesis > #78: FILE: lib/librte_vhost/vhost.h:47: > +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll > 4") \ > + for (iter = val; iter < size; iter++) > > ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > parenthesis > #83: FILE: lib/librte_vhost/vhost.h:52: > +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ > + for (iter = val; iter < size; iter++) > > ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > parenthesis > #88: FILE: lib/librte_vhost/vhost.h:57: > +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ > + for (iter = val; iter < size; iter++) > > total: 3 errors, 1 warnings, 67 lines checked > > 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1: > error: unused function 'free_zmbuf' [-Werror,-Wunused-function] > free_zmbuf(struct vhost_virtqueue *vq) > ^ > 1 error generated. > make[5]: *** [virtio_net.o] Error 1 > make[4]: *** [librte_vhost] Error 2 > make[4]: *** Waiting for unfinished jobs.... > make[3]: *** [lib] Error 2 > make[2]: *** [all] Error 2 > make[1]: *** [pre_install] Error 2 > make: *** [install] Error 2 > > > On 10/22/19 12:08 AM, Marvin Liu wrote: > > Packed ring has more compact ring format and thus can significantly > > reduce the number of cache miss. It can lead to better performance. > > This has been approved in virtio user driver, on normal E5 Xeon cpu > > single core performance can raise 12%. > > > > http://mails.dpdk.org/archives/dev/2018-April/095470.html > > > > However vhost performance with packed ring performance was decreased. > > Through analysis, mostly extra cost was from the calculating of each > > descriptor flag which depended on ring wrap counter. Moreover, both > > frontend and backend need to write same descriptors which will cause > > cache contention. Especially when doing vhost enqueue function, virtio > > refill packed ring function may write same cache line when vhost doing > > enqueue function. This kind of extra cache cost will reduce the benefit > > of reducing cache misses. > > > > For optimizing vhost packed ring performance, vhost enqueue and dequeue > > function will be splitted into fast and normal path. > > > > Several methods will be taken in fast path: > > Handle descriptors in one cache line by batch. > > Split loop function into more pieces and unroll them. > > Prerequisite check that whether I/O space can copy directly into mbuf > > space and vice versa. > > Prerequisite check that whether descriptor mapping is successful. > > Distinguish vhost used ring update function by enqueue and dequeue > > function. > > Buffer dequeue used descriptors as many as possible. > > Update enqueue used descriptors by cache line. > > > > After all these methods done, single core vhost PvP performance with 64B > > packet on Xeon 8180 can boost 35%. > > > > v8: > > - Allocate mbuf by virtio_dev_pktmbuf_alloc > > > > v7: > > - Rebase code > > - Rename unroll macro and definitions > > - Calculate flags when doing single dequeue > > > > v6: > > - Fix dequeue zcopy result check > > > > v5: > > - Remove disable sw prefetch as performance impact is small > > - Change unroll pragma macro format > > - Rename shadow counter elements names > > - Clean dequeue update check condition > > - Add inline functions replace of duplicated code > > - Unify code style > > > > v4: > > - Support meson build > > - Remove memory region cache for no clear performance gain and ABI break > > - Not assume ring size is power of two > > > > v3: > > - Check available index overflow > > - Remove dequeue remained descs number check > > - Remove changes in split ring datapath > > - Call memory write barriers once when updating used flags > > - Rename some functions and macros > > - Code style optimization > > > > v2: > > - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc > > - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) > > - Optimize dequeue used ring update when in_order negotiated > > > > > > Marvin Liu (13): > > vhost: add packed ring indexes increasing function > > vhost: add packed ring single enqueue > > vhost: try to unroll for each loop > > vhost: add packed ring batch enqueue > > vhost: add packed ring single dequeue > > vhost: add packed ring batch dequeue > > vhost: flush enqueue updates by cacheline > > vhost: flush batched enqueue descs directly > > vhost: buffer packed ring dequeue updates > > vhost: optimize packed ring enqueue > > vhost: add packed ring zcopy batch and single dequeue > > vhost: optimize packed ring dequeue > > vhost: optimize packed ring dequeue when in-order > > > > lib/librte_vhost/Makefile | 18 + > > lib/librte_vhost/meson.build | 7 + > > lib/librte_vhost/vhost.h | 57 ++ > > lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++------- > > 4 files changed, 837 insertions(+), 193 deletions(-) > >