Re: [dpdk-dev] [PATCH v8 00/13] vhost packed ring performance optimization

Maxime Coquelin Thu, 24 Oct 2019 01:25:08 -0700

On 10/24/19 9:18 AM, Liu, Yong wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:[email protected]]
>> Sent: Thursday, October 24, 2019 2:50 PM
>> To: Liu, Yong <[email protected]>; Bie, Tiwei <[email protected]>; Wang,
>> Zhihong <[email protected]>; [email protected];
>> [email protected]
>> Cc: [email protected]
>> Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization
>>
>> I get some checkpatch warnings, and build fails with clang.
>> Could you please fix these issues and send v9?
>>
> 
> 
> Hi Maxime,
> Clang build fails will be fixed in v9. For checkpatch warning, it was due to 
> pragma string inside.
> Previous version can avoid such warning, while format is a little messy as 
> below. 
> I prefer to keep code clean and more readable. How about your idea?
> 
> +#ifdef UNROLL_PRAGMA_PARAM
> +#define VHOST_UNROLL_PRAGMA(param) _Pragma(param)
> +#else
> +#define VHOST_UNROLL_PRAGMA(param) do {} while (0);
> +#endif
> 
> +       VHOST_UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
> +       for (i = 0; i < PACKED_BATCH_SIZE; i++)

That's less clean indeed. I agree to waive the checkpatch errors.
just fix the Clang build for patch 8 and we're good.

Thanks,
Maxime

> Regards,
> Marvin
> 
>> Thanks,
>> Maxime
>>
>> ### [PATCH] vhost: try to unroll for each loop
>>
>> WARNING:CAMELCASE: Avoid CamelCase: <_Pragma>
>> #78: FILE: lib/librte_vhost/vhost.h:47:
>> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
>> 4") \
>>
>> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
>> parenthesis
>> #78: FILE: lib/librte_vhost/vhost.h:47:
>> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
>> 4") \
>> +    for (iter = val; iter < size; iter++)
>>
>> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
>> parenthesis
>> #83: FILE: lib/librte_vhost/vhost.h:52:
>> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
>> +    for (iter = val; iter < size; iter++)
>>
>> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
>> parenthesis
>> #88: FILE: lib/librte_vhost/vhost.h:57:
>> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
>> +    for (iter = val; iter < size; iter++)
>>
>> total: 3 errors, 1 warnings, 67 lines checked
>>
>> 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1:
>> error: unused function 'free_zmbuf' [-Werror,-Wunused-function]
>> free_zmbuf(struct vhost_virtqueue *vq)
>> ^
>> 1 error generated.
>> make[5]: *** [virtio_net.o] Error 1
>> make[4]: *** [librte_vhost] Error 2
>> make[4]: *** Waiting for unfinished jobs....
>> make[3]: *** [lib] Error 2
>> make[2]: *** [all] Error 2
>> make[1]: *** [pre_install] Error 2
>> make: *** [install] Error 2
>>
>>
>> On 10/22/19 12:08 AM, Marvin Liu wrote:
>>> Packed ring has more compact ring format and thus can significantly
>>> reduce the number of cache miss. It can lead to better performance.
>>> This has been approved in virtio user driver, on normal E5 Xeon cpu
>>> single core performance can raise 12%.
>>>
>>> http://mails.dpdk.org/archives/dev/2018-April/095470.html
>>>
>>> However vhost performance with packed ring performance was decreased.
>>> Through analysis, mostly extra cost was from the calculating of each
>>> descriptor flag which depended on ring wrap counter. Moreover, both
>>> frontend and backend need to write same descriptors which will cause
>>> cache contention. Especially when doing vhost enqueue function, virtio
>>> refill packed ring function may write same cache line when vhost doing
>>> enqueue function. This kind of extra cache cost will reduce the benefit
>>> of reducing cache misses.
>>>
>>> For optimizing vhost packed ring performance, vhost enqueue and dequeue
>>> function will be splitted into fast and normal path.
>>>
>>> Several methods will be taken in fast path:
>>>   Handle descriptors in one cache line by batch.
>>>   Split loop function into more pieces and unroll them.
>>>   Prerequisite check that whether I/O space can copy directly into mbuf
>>>     space and vice versa.
>>>   Prerequisite check that whether descriptor mapping is successful.
>>>   Distinguish vhost used ring update function by enqueue and dequeue
>>>     function.
>>>   Buffer dequeue used descriptors as many as possible.
>>>   Update enqueue used descriptors by cache line.
>>>
>>> After all these methods done, single core vhost PvP performance with 64B
>>> packet on Xeon 8180 can boost 35%.
>>>
>>> v8:
>>> - Allocate mbuf by virtio_dev_pktmbuf_alloc
>>>
>>> v7:
>>> - Rebase code
>>> - Rename unroll macro and definitions
>>> - Calculate flags when doing single dequeue
>>>
>>> v6:
>>> - Fix dequeue zcopy result check
>>>
>>> v5:
>>> - Remove disable sw prefetch as performance impact is small
>>> - Change unroll pragma macro format
>>> - Rename shadow counter elements names
>>> - Clean dequeue update check condition
>>> - Add inline functions replace of duplicated code
>>> - Unify code style
>>>
>>> v4:
>>> - Support meson build
>>> - Remove memory region cache for no clear performance gain and ABI break
>>> - Not assume ring size is power of two
>>>
>>> v3:
>>> - Check available index overflow
>>> - Remove dequeue remained descs number check
>>> - Remove changes in split ring datapath
>>> - Call memory write barriers once when updating used flags
>>> - Rename some functions and macros
>>> - Code style optimization
>>>
>>> v2:
>>> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
>>> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
>>> - Optimize dequeue used ring update when in_order negotiated
>>>
>>>
>>> Marvin Liu (13):
>>>   vhost: add packed ring indexes increasing function
>>>   vhost: add packed ring single enqueue
>>>   vhost: try to unroll for each loop
>>>   vhost: add packed ring batch enqueue
>>>   vhost: add packed ring single dequeue
>>>   vhost: add packed ring batch dequeue
>>>   vhost: flush enqueue updates by cacheline
>>>   vhost: flush batched enqueue descs directly
>>>   vhost: buffer packed ring dequeue updates
>>>   vhost: optimize packed ring enqueue
>>>   vhost: add packed ring zcopy batch and single dequeue
>>>   vhost: optimize packed ring dequeue
>>>   vhost: optimize packed ring dequeue when in-order
>>>
>>>  lib/librte_vhost/Makefile     |  18 +
>>>  lib/librte_vhost/meson.build  |   7 +
>>>  lib/librte_vhost/vhost.h      |  57 ++
>>>  lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++-------
>>>  4 files changed, 837 insertions(+), 193 deletions(-)
>>>
>
Re: [dpdk-dev] [PATCH v8 00/13] vhost packed ring performance optimization

Reply via email to