Re: [dpdk-dev] [PATCH v6 00/13] vhost packed ring performance optimization

Maxime Coquelin Thu, 17 Oct 2019 00:32:01 -0700

Hi Marvin,

This is almost good, just fix the small comments I made.


Also, please rebase on top of next-virtio branch, because I applied
below patch from Flavio that you need to take into account:

http://patches.dpdk.org/patch/61284/

Regards,
Maxime

On 10/15/19 6:07 PM, Marvin Liu wrote:
> Packed ring has more compact ring format and thus can significantly
> reduce the number of cache miss. It can lead to better performance.
> This has been approved in virtio user driver, on normal E5 Xeon cpu
> single core performance can raise 12%.
> 
> http://mails.dpdk.org/archives/dev/2018-April/095470.html
> 
> However vhost performance with packed ring performance was decreased.
> Through analysis, mostly extra cost was from the calculating of each
> descriptor flag which depended on ring wrap counter. Moreover, both
> frontend and backend need to write same descriptors which will cause
> cache contention. Especially when doing vhost enqueue function, virtio
> refill packed ring function may write same cache line when vhost doing
> enqueue function. This kind of extra cache cost will reduce the benefit
> of reducing cache misses. 
> 
> For optimizing vhost packed ring performance, vhost enqueue and dequeue
> function will be splitted into fast and normal path.
> 
> Several methods will be taken in fast path:
>   Handle descriptors in one cache line by batch.
>   Split loop function into more pieces and unroll them.
>   Prerequisite check that whether I/O space can copy directly into mbuf
>     space and vice versa. 
>   Prerequisite check that whether descriptor mapping is successful.
>   Distinguish vhost used ring update function by enqueue and dequeue
>     function.
>   Buffer dequeue used descriptors as many as possible.
>   Update enqueue used descriptors by cache line.
> 
> After all these methods done, single core vhost PvP performance with 64B
> packet on Xeon 8180 can boost 35%.
> 
> v6:
> - Fix dequeue zcopy result check
> 
> v5:
> - Remove disable sw prefetch as performance impact is small
> - Change unroll pragma macro format
> - Rename shadow counter elements names
> - Clean dequeue update check condition
> - Add inline functions replace of duplicated code
> - Unify code style
> 
> v4:
> - Support meson build
> - Remove memory region cache for no clear performance gain and ABI break
> - Not assume ring size is power of two
> 
> v3:
> - Check available index overflow
> - Remove dequeue remained descs number check
> - Remove changes in split ring datapath
> - Call memory write barriers once when updating used flags
> - Rename some functions and macros
> - Code style optimization
> 
> v2:
> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> - Optimize dequeue used ring update when in_order negotiated
> 
> 
> Marvin Liu (13):
>   vhost: add packed ring indexes increasing function
>   vhost: add packed ring single enqueue
>   vhost: try to unroll for each loop
>   vhost: add packed ring batch enqueue
>   vhost: add packed ring single dequeue
>   vhost: add packed ring batch dequeue
>   vhost: flush enqueue updates by batch
>   vhost: flush batched enqueue descs directly
>   vhost: buffer packed ring dequeue updates
>   vhost: optimize packed ring enqueue
>   vhost: add packed ring zcopy batch and single dequeue
>   vhost: optimize packed ring dequeue
>   vhost: optimize packed ring dequeue when in-order
> 
>  lib/librte_vhost/Makefile     |  18 +
>  lib/librte_vhost/meson.build  |   7 +
>  lib/librte_vhost/vhost.h      |  57 +++
>  lib/librte_vhost/virtio_net.c | 924 +++++++++++++++++++++++++++-------
>  4 files changed, 812 insertions(+), 194 deletions(-)
>

Re: [dpdk-dev] [PATCH v6 00/13] vhost packed ring performance optimization

Reply via email to