On 10/24/19 9:18 AM, Liu, Yong wrote:
>
>
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
>> Sent: Thursday, October 24, 2019 2:50 PM
>> To: Liu, Yong <yong....@intel.com>; Bie, Tiwei <tiwei....@intel.com>; Wang,
>> Zhihong <zhihong.w...@intel.com>; step...@networkplumber.org;
>> gavin...@arm.com
>> Cc: dev@dpdk.org
>> Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization
>>
>> I get some checkpatch warnings, and build fails with clang.
>> Could you please fix these issues and send v9?
>>
>
>
> Hi Maxime,
> Clang build fails will be fixed in v9. For checkpatch warning, it was due to
> pragma string inside.
> Previous version can avoid such warning, while format is a little messy as
> below.
> I prefer to keep code clean and more readable. How about your idea?
>
> +#ifdef UNROLL_PRAGMA_PARAM
> +#define VHOST_UNROLL_PRAGMA(param) _Pragma(param)
> +#else
> +#define VHOST_UNROLL_PRAGMA(param) do {} while (0);
> +#endif
>
> + VHOST_UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
> + for (i = 0; i < PACKED_BATCH_SIZE; i++)
That's less clean indeed. I agree to waive the checkpatch errors.
just fix the Clang build for patch 8 and we're good.
Thanks,
Maxime
> Regards,
> Marvin
>
>> Thanks,
>> Maxime
>>
>> ### [PATCH] vhost: try to unroll for each loop
>>
>> WARNING:CAMELCASE: Avoid CamelCase: <_Pragma>
>> #78: FILE: lib/librte_vhost/vhost.h:47:
>> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
>> 4") \
>>
>> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
>> parenthesis
>> #78: FILE: lib/librte_vhost/vhost.h:47:
>> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
>> 4") \
>> + for (iter = val; iter < size; iter++)
>>
>> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
>> parenthesis
>> #83: FILE: lib/librte_vhost/vhost.h:52:
>> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
>> + for (iter = val; iter < size; iter++)
>>
>> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
>> parenthesis
>> #88: FILE: lib/librte_vhost/vhost.h:57:
>> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
>> + for (iter = val; iter < size; iter++)
>>
>> total: 3 errors, 1 warnings, 67 lines checked
>>
>> 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1:
>> error: unused function 'free_zmbuf' [-Werror,-Wunused-function]
>> free_zmbuf(struct vhost_virtqueue *vq)
>> ^
>> 1 error generated.
>> make[5]: *** [virtio_net.o] Error 1
>> make[4]: *** [librte_vhost] Error 2
>> make[4]: *** Waiting for unfinished jobs....
>> make[3]: *** [lib] Error 2
>> make[2]: *** [all] Error 2
>> make[1]: *** [pre_install] Error 2
>> make: *** [install] Error 2
>>
>>
>> On 10/22/19 12:08 AM, Marvin Liu wrote:
>>> Packed ring has more compact ring format and thus can significantly
>>> reduce the number of cache miss. It can lead to better performance.
>>> This has been approved in virtio user driver, on normal E5 Xeon cpu
>>> single core performance can raise 12%.
>>>
>>> http://mails.dpdk.org/archives/dev/2018-April/095470.html
>>>
>>> However vhost performance with packed ring performance was decreased.
>>> Through analysis, mostly extra cost was from the calculating of each
>>> descriptor flag which depended on ring wrap counter. Moreover, both
>>> frontend and backend need to write same descriptors which will cause
>>> cache contention. Especially when doing vhost enqueue function, virtio
>>> refill packed ring function may write same cache line when vhost doing
>>> enqueue function. This kind of extra cache cost will reduce the benefit
>>> of reducing cache misses.
>>>
>>> For optimizing vhost packed ring performance, vhost enqueue and dequeue
>>> function will be splitted into fast and normal path.
>>>
>>> Several methods will be taken in fast path:
>>> Handle descriptors in one cache line by batch.
>>> Split loop function into more pieces and unroll them.
>>> Prerequisite check that whether I/O space can copy directly into mbuf
>>> space and vice versa.
>>> Prerequisite check that whether descriptor mapping is successful.
>>> Distinguish vhost used ring update function by enqueue and dequeue
>>> function.
>>> Buffer dequeue used descriptors as many as possible.
>>> Update enqueue used descriptors by cache line.
>>>
>>> After all these methods done, single core vhost PvP performance with 64B
>>> packet on Xeon 8180 can boost 35%.
>>>
>>> v8:
>>> - Allocate mbuf by virtio_dev_pktmbuf_alloc
>>>
>>> v7:
>>> - Rebase code
>>> - Rename unroll macro and definitions
>>> - Calculate flags when doing single dequeue
>>>
>>> v6:
>>> - Fix dequeue zcopy result check
>>>
>>> v5:
>>> - Remove disable sw prefetch as performance impact is small
>>> - Change unroll pragma macro format
>>> - Rename shadow counter elements names
>>> - Clean dequeue update check condition
>>> - Add inline functions replace of duplicated code
>>> - Unify code style
>>>
>>> v4:
>>> - Support meson build
>>> - Remove memory region cache for no clear performance gain and ABI break
>>> - Not assume ring size is power of two
>>>
>>> v3:
>>> - Check available index overflow
>>> - Remove dequeue remained descs number check
>>> - Remove changes in split ring datapath
>>> - Call memory write barriers once when updating used flags
>>> - Rename some functions and macros
>>> - Code style optimization
>>>
>>> v2:
>>> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
>>> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
>>> - Optimize dequeue used ring update when in_order negotiated
>>>
>>>
>>> Marvin Liu (13):
>>> vhost: add packed ring indexes increasing function
>>> vhost: add packed ring single enqueue
>>> vhost: try to unroll for each loop
>>> vhost: add packed ring batch enqueue
>>> vhost: add packed ring single dequeue
>>> vhost: add packed ring batch dequeue
>>> vhost: flush enqueue updates by cacheline
>>> vhost: flush batched enqueue descs directly
>>> vhost: buffer packed ring dequeue updates
>>> vhost: optimize packed ring enqueue
>>> vhost: add packed ring zcopy batch and single dequeue
>>> vhost: optimize packed ring dequeue
>>> vhost: optimize packed ring dequeue when in-order
>>>
>>> lib/librte_vhost/Makefile | 18 +
>>> lib/librte_vhost/meson.build | 7 +
>>> lib/librte_vhost/vhost.h | 57 ++
>>> lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++-------
>>> 4 files changed, 837 insertions(+), 193 deletions(-)
>>>
>