Hi Marvin, This is almost good, just fix the small comments I made.
Also, please rebase on top of next-virtio branch, because I applied below patch from Flavio that you need to take into account: http://patches.dpdk.org/patch/61284/ Regards, Maxime On 10/15/19 6:07 PM, Marvin Liu wrote: > Packed ring has more compact ring format and thus can significantly > reduce the number of cache miss. It can lead to better performance. > This has been approved in virtio user driver, on normal E5 Xeon cpu > single core performance can raise 12%. > > http://mails.dpdk.org/archives/dev/2018-April/095470.html > > However vhost performance with packed ring performance was decreased. > Through analysis, mostly extra cost was from the calculating of each > descriptor flag which depended on ring wrap counter. Moreover, both > frontend and backend need to write same descriptors which will cause > cache contention. Especially when doing vhost enqueue function, virtio > refill packed ring function may write same cache line when vhost doing > enqueue function. This kind of extra cache cost will reduce the benefit > of reducing cache misses. > > For optimizing vhost packed ring performance, vhost enqueue and dequeue > function will be splitted into fast and normal path. > > Several methods will be taken in fast path: > Handle descriptors in one cache line by batch. > Split loop function into more pieces and unroll them. > Prerequisite check that whether I/O space can copy directly into mbuf > space and vice versa. > Prerequisite check that whether descriptor mapping is successful. > Distinguish vhost used ring update function by enqueue and dequeue > function. > Buffer dequeue used descriptors as many as possible. > Update enqueue used descriptors by cache line. > > After all these methods done, single core vhost PvP performance with 64B > packet on Xeon 8180 can boost 35%. > > v6: > - Fix dequeue zcopy result check > > v5: > - Remove disable sw prefetch as performance impact is small > - Change unroll pragma macro format > - Rename shadow counter elements names > - Clean dequeue update check condition > - Add inline functions replace of duplicated code > - Unify code style > > v4: > - Support meson build > - Remove memory region cache for no clear performance gain and ABI break > - Not assume ring size is power of two > > v3: > - Check available index overflow > - Remove dequeue remained descs number check > - Remove changes in split ring datapath > - Call memory write barriers once when updating used flags > - Rename some functions and macros > - Code style optimization > > v2: > - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc > - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) > - Optimize dequeue used ring update when in_order negotiated > > > Marvin Liu (13): > vhost: add packed ring indexes increasing function > vhost: add packed ring single enqueue > vhost: try to unroll for each loop > vhost: add packed ring batch enqueue > vhost: add packed ring single dequeue > vhost: add packed ring batch dequeue > vhost: flush enqueue updates by batch > vhost: flush batched enqueue descs directly > vhost: buffer packed ring dequeue updates > vhost: optimize packed ring enqueue > vhost: add packed ring zcopy batch and single dequeue > vhost: optimize packed ring dequeue > vhost: optimize packed ring dequeue when in-order > > lib/librte_vhost/Makefile | 18 + > lib/librte_vhost/meson.build | 7 + > lib/librte_vhost/vhost.h | 57 +++ > lib/librte_vhost/virtio_net.c | 924 +++++++++++++++++++++++++++------- > 4 files changed, 812 insertions(+), 194 deletions(-) >