On 10/24/19 6:08 PM, Marvin Liu wrote:
> Packed ring has more compact ring format and thus can significantly
> reduce the number of cache miss. It can lead to better performance.
> This has been approved in virtio user driver, on normal E5 Xeon cpu
> single core performance can raise 12%.
>
> http://mails.dpdk.org/archives/dev/2018-April/095470.html
>
> However vhost performance with packed ring performance was decreased.
> Through analysis, mostly extra cost was from the calculating of each
> descriptor flag which depended on ring wrap counter. Moreover, both
> frontend and backend need to write same descriptors which will cause
> cache contention. Especially when doing vhost enqueue function, virtio
> refill packed ring function may write same cache line when vhost doing
> enqueue function. This kind of extra cache cost will reduce the benefit
> of reducing cache misses.
>
> For optimizing vhost packed ring performance, vhost enqueue and dequeue
> function will be split into fast and normal path.
>
> Several methods will be taken in fast path:
> Handle descriptors in one cache line by batch.
> Split loop function into more pieces and unroll them.
> Prerequisite check that whether I/O space can copy directly into mbuf
> space and vice versa.
> Prerequisite check that whether descriptor mapping is successful.
> Distinguish vhost used ring update function by enqueue and dequeue
> function.
> Buffer dequeue used descriptors as many as possible.
> Update enqueue used descriptors by cache line.
>
> After all these methods done, single core vhost PvP performance with 64B
> packet on Xeon 8180 can boost 35%.
>
> v9:
> - Fix clang build error
>
> v8:
> - Allocate mbuf by virtio_dev_pktmbuf_alloc
>
> v7:
> - Rebase code
> - Rename unroll macro and definitions
> - Calculate flags when doing single dequeue
>
> v6:
> - Fix dequeue zcopy result check
>
> v5:
> - Remove disable sw prefetch as performance impact is small
> - Change unroll pragma macro format
> - Rename shadow counter elements names
> - Clean dequeue update check condition
> - Add inline functions replace of duplicated code
> - Unify code style
>
> v4:
> - Support meson build
> - Remove memory region cache for no clear performance gain and ABI break
> - Not assume ring size is power of two
>
> v3:
> - Check available index overflow
> - Remove dequeue remained descs number check
> - Remove changes in split ring datapath
> - Call memory write barriers once when updating used flags
> - Rename some functions and macros
> - Code style optimization
>
> v2:
> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> - Optimize dequeue used ring update when in_order negotiated
>
>
> Marvin Liu (13):
> vhost: add packed ring indexes increasing function
> vhost: add packed ring single enqueue
> vhost: try to unroll for each loop
> vhost: add packed ring batch enqueue
> vhost: add packed ring single dequeue
> vhost: add packed ring batch dequeue
> vhost: flush enqueue updates by cacheline
> vhost: flush batched enqueue descs directly
> vhost: buffer packed ring dequeue updates
> vhost: optimize packed ring enqueue
> vhost: add packed ring zcopy batch and single dequeue
> vhost: optimize packed ring dequeue
> vhost: optimize packed ring dequeue when in-order
>
> lib/librte_vhost/Makefile | 18 +
> lib/librte_vhost/meson.build | 7 +
> lib/librte_vhost/vhost.h | 57 ++
> lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++-------
> 4 files changed, 837 insertions(+), 193 deletions(-)
>
Applied to dpdk-next-virtio/master.
Thanks,
Maxime