Re: [dpdk-dev] [PATCH v8 00/13] vhost packed ring performance optimization

Liu, Yong Thu, 24 Oct 2019 00:19:40 -0700


> -----Original Message-----
> From: Maxime Coquelin [mailto:[email protected]]
> Sent: Thursday, October 24, 2019 2:50 PM
> To: Liu, Yong <[email protected]>; Bie, Tiwei <[email protected]>; Wang,
> Zhihong <[email protected]>; [email protected];
> [email protected]
> Cc: [email protected]
> Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization
> 
> I get some checkpatch warnings, and build fails with clang.
> Could you please fix these issues and send v9?
>



Hi Maxime,
Clang build fails will be fixed in v9. For checkpatch warning, it was due to 
pragma string inside.
Previous version can avoid such warning, while format is a little messy as 
below. 
I prefer to keep code clean and more readable. How about your idea?

+#ifdef UNROLL_PRAGMA_PARAM
+#define VHOST_UNROLL_PRAGMA(param) _Pragma(param)
+#else
+#define VHOST_UNROLL_PRAGMA(param) do {} while (0);
+#endif

+       VHOST_UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
+       for (i = 0; i < PACKED_BATCH_SIZE; i++)

Regards,
Marvin

> Thanks,
> Maxime
> 
> ### [PATCH] vhost: try to unroll for each loop
> 
> WARNING:CAMELCASE: Avoid CamelCase: <_Pragma>
> #78: FILE: lib/librte_vhost/vhost.h:47:
> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
> 4") \
> 
> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
> parenthesis
> #78: FILE: lib/librte_vhost/vhost.h:47:
> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
> 4") \
> +     for (iter = val; iter < size; iter++)
> 
> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
> parenthesis
> #83: FILE: lib/librte_vhost/vhost.h:52:
> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
> +     for (iter = val; iter < size; iter++)
> 
> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
> parenthesis
> #88: FILE: lib/librte_vhost/vhost.h:57:
> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
> +     for (iter = val; iter < size; iter++)
> 
> total: 3 errors, 1 warnings, 67 lines checked
> 
> 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1:
> error: unused function 'free_zmbuf' [-Werror,-Wunused-function]
> free_zmbuf(struct vhost_virtqueue *vq)
> ^
> 1 error generated.
> make[5]: *** [virtio_net.o] Error 1
> make[4]: *** [librte_vhost] Error 2
> make[4]: *** Waiting for unfinished jobs....
> make[3]: *** [lib] Error 2
> make[2]: *** [all] Error 2
> make[1]: *** [pre_install] Error 2
> make: *** [install] Error 2
> 
> 
> On 10/22/19 12:08 AM, Marvin Liu wrote:
> > Packed ring has more compact ring format and thus can significantly
> > reduce the number of cache miss. It can lead to better performance.
> > This has been approved in virtio user driver, on normal E5 Xeon cpu
> > single core performance can raise 12%.
> >
> > http://mails.dpdk.org/archives/dev/2018-April/095470.html
> >
> > However vhost performance with packed ring performance was decreased.
> > Through analysis, mostly extra cost was from the calculating of each
> > descriptor flag which depended on ring wrap counter. Moreover, both
> > frontend and backend need to write same descriptors which will cause
> > cache contention. Especially when doing vhost enqueue function, virtio
> > refill packed ring function may write same cache line when vhost doing
> > enqueue function. This kind of extra cache cost will reduce the benefit
> > of reducing cache misses.
> >
> > For optimizing vhost packed ring performance, vhost enqueue and dequeue
> > function will be splitted into fast and normal path.
> >
> > Several methods will be taken in fast path:
> >   Handle descriptors in one cache line by batch.
> >   Split loop function into more pieces and unroll them.
> >   Prerequisite check that whether I/O space can copy directly into mbuf
> >     space and vice versa.
> >   Prerequisite check that whether descriptor mapping is successful.
> >   Distinguish vhost used ring update function by enqueue and dequeue
> >     function.
> >   Buffer dequeue used descriptors as many as possible.
> >   Update enqueue used descriptors by cache line.
> >
> > After all these methods done, single core vhost PvP performance with 64B
> > packet on Xeon 8180 can boost 35%.
> >
> > v8:
> > - Allocate mbuf by virtio_dev_pktmbuf_alloc
> >
> > v7:
> > - Rebase code
> > - Rename unroll macro and definitions
> > - Calculate flags when doing single dequeue
> >
> > v6:
> > - Fix dequeue zcopy result check
> >
> > v5:
> > - Remove disable sw prefetch as performance impact is small
> > - Change unroll pragma macro format
> > - Rename shadow counter elements names
> > - Clean dequeue update check condition
> > - Add inline functions replace of duplicated code
> > - Unify code style
> >
> > v4:
> > - Support meson build
> > - Remove memory region cache for no clear performance gain and ABI break
> > - Not assume ring size is power of two
> >
> > v3:
> > - Check available index overflow
> > - Remove dequeue remained descs number check
> > - Remove changes in split ring datapath
> > - Call memory write barriers once when updating used flags
> > - Rename some functions and macros
> > - Code style optimization
> >
> > v2:
> > - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> > - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> > - Optimize dequeue used ring update when in_order negotiated
> >
> >
> > Marvin Liu (13):
> >   vhost: add packed ring indexes increasing function
> >   vhost: add packed ring single enqueue
> >   vhost: try to unroll for each loop
> >   vhost: add packed ring batch enqueue
> >   vhost: add packed ring single dequeue
> >   vhost: add packed ring batch dequeue
> >   vhost: flush enqueue updates by cacheline
> >   vhost: flush batched enqueue descs directly
> >   vhost: buffer packed ring dequeue updates
> >   vhost: optimize packed ring enqueue
> >   vhost: add packed ring zcopy batch and single dequeue
> >   vhost: optimize packed ring dequeue
> >   vhost: optimize packed ring dequeue when in-order
> >
> >  lib/librte_vhost/Makefile     |  18 +
> >  lib/librte_vhost/meson.build  |   7 +
> >  lib/librte_vhost/vhost.h      |  57 ++
> >  lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++-------
> >  4 files changed, 837 insertions(+), 193 deletions(-)
> >

Re: [dpdk-dev] [PATCH v8 00/13] vhost packed ring performance optimization

Reply via email to