Re: [dpdk-dev] [PATCH v9 00/13] vhost packed ring performance optimization

Maxime Coquelin Thu, 24 Oct 2019 03:19:21 -0700

On 10/24/19 6:08 PM, Marvin Liu wrote:
> Packed ring has more compact ring format and thus can significantly
> reduce the number of cache miss. It can lead to better performance.
> This has been approved in virtio user driver, on normal E5 Xeon cpu
> single core performance can raise 12%.
> 
> http://mails.dpdk.org/archives/dev/2018-April/095470.html
> 
> However vhost performance with packed ring performance was decreased.
> Through analysis, mostly extra cost was from the calculating of each
> descriptor flag which depended on ring wrap counter. Moreover, both
> frontend and backend need to write same descriptors which will cause
> cache contention. Especially when doing vhost enqueue function, virtio
> refill packed ring function may write same cache line when vhost doing
> enqueue function. This kind of extra cache cost will reduce the benefit
> of reducing cache misses. 
> 
> For optimizing vhost packed ring performance, vhost enqueue and dequeue
> function will be split into fast and normal path.
> 
> Several methods will be taken in fast path:
>   Handle descriptors in one cache line by batch.
>   Split loop function into more pieces and unroll them.
>   Prerequisite check that whether I/O space can copy directly into mbuf
>     space and vice versa. 
>   Prerequisite check that whether descriptor mapping is successful.
>   Distinguish vhost used ring update function by enqueue and dequeue
>     function.
>   Buffer dequeue used descriptors as many as possible.
>   Update enqueue used descriptors by cache line.
> 
> After all these methods done, single core vhost PvP performance with 64B
> packet on Xeon 8180 can boost 35%.
> 
> v9:
> - Fix clang build error
> 
> v8:
> - Allocate mbuf by virtio_dev_pktmbuf_alloc
> 
> v7:
> - Rebase code
> - Rename unroll macro and definitions
> - Calculate flags when doing single dequeue
> 
> v6:
> - Fix dequeue zcopy result check
> 
> v5:
> - Remove disable sw prefetch as performance impact is small
> - Change unroll pragma macro format
> - Rename shadow counter elements names
> - Clean dequeue update check condition
> - Add inline functions replace of duplicated code
> - Unify code style
> 
> v4:
> - Support meson build
> - Remove memory region cache for no clear performance gain and ABI break
> - Not assume ring size is power of two
> 
> v3:
> - Check available index overflow
> - Remove dequeue remained descs number check
> - Remove changes in split ring datapath
> - Call memory write barriers once when updating used flags
> - Rename some functions and macros
> - Code style optimization
> 
> v2:
> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> - Optimize dequeue used ring update when in_order negotiated
> 
> 
> Marvin Liu (13):
>   vhost: add packed ring indexes increasing function
>   vhost: add packed ring single enqueue
>   vhost: try to unroll for each loop
>   vhost: add packed ring batch enqueue
>   vhost: add packed ring single dequeue
>   vhost: add packed ring batch dequeue
>   vhost: flush enqueue updates by cacheline
>   vhost: flush batched enqueue descs directly
>   vhost: buffer packed ring dequeue updates
>   vhost: optimize packed ring enqueue
>   vhost: add packed ring zcopy batch and single dequeue
>   vhost: optimize packed ring dequeue
>   vhost: optimize packed ring dequeue when in-order
> 
>  lib/librte_vhost/Makefile     |  18 +
>  lib/librte_vhost/meson.build  |   7 +
>  lib/librte_vhost/vhost.h      |  57 ++
>  lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++-------
>  4 files changed, 837 insertions(+), 193 deletions(-)
> 

Applied to dpdk-next-virtio/master.

Thanks,
Maxime
Re: [dpdk-dev] [PATCH v9 00/13] vhost packed ring performance optimization

Reply via email to