在 2016/1/18 11:05, Zhihong Wang 写道: > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full > utilization of hardware resources and deliver high performance. > > In current DPDK, memcpy holds a large proportion of execution time in > libs like Vhost, especially for large packets, and this patch can bring > considerable benefits. > > The implementation is based on the current DPDK memcpy framework, some > background introduction can be found in these threads: > http://dpdk.org/ml/archives/dev/2014-November/008158.html > http://dpdk.org/ml/archives/dev/2015-January/011800.html > > Code changes are: > > 1. Read CPUID to check if AVX512 is supported by CPU > > 2. Predefine AVX512 macro if AVX512 is enabled by compiler > > 3. Implement AVX512 memcpy and choose the right implementation based on > predefined macros > > 4. Decide alignment unit for memcpy perf test based on predefined macros > > -------------- > Changes in v2: > > 1. Tune performance for prior platforms > > Zhihong Wang (5): > lib/librte_eal: Identify AVX512 CPU flag > mk: Predefine AVX512 macro for compiler > lib/librte_eal: Optimize memcpy for AVX512 platforms > app/test: Adjust alignment unit for memcpy perf test > lib/librte_eal: Tune memcpy for prior platforms > > app/test/test_memcpy_perf.c | 6 + > .../common/include/arch/x86/rte_cpuflags.h | 2 + > .../common/include/arch/x86/rte_memcpy.h | 269 > ++++++++++++++++++++- > mk/rte.cpuflags.mk | 4 + > 4 files changed, 268 insertions(+), 13 deletions(-) >
Hi Zhihong Wang I test avx512 rte_memcpy found the performanc for ovs dpdk is lower than avx2 rte_memcpy. The vm loop test for ovs dpdk results: avx512 is *15*Gbps perf data: 0.52 │ vmovdq (%r8,%r10,1),%zmm0 95.33 │ sub $0x40,%r9 0.45 │ add $0x40,%r8 0.60 │ vmovdq %zmm0,-0x40(%r8) 1.84 │ cmp $0x3f,%r9 │ ↓ ja f20 │ lea -0x40(%rsi),%r8 0.15 │ or $0xffffffffffffffc0,%rsi 0.21 │ and $0xffffffffffffffc0,%r8 0.00 │ lea 0x40(%rsi,%r8,1),%rsi 0.00 │ vmovdq (%rcx,%rsi,1),%zmm0 0.22 │ vmovdq %zmm0,(%rdx,%rsi,1) 0.67 │ ↓ jmpq c78 │ mov -0x128(%rbp),%rdi │ rex.R │ .byte 0x89 │ popfq avx2 is *18.8*Gbps perf data: 0.96 │ add %r9,%r13 66.04 │ vmovdq (%rdx),%ymm0 1.20 │ sub $0x40,%rdi 1.53 │ add $0x40,%rdx 10.83 │ vmovdq %ymm0,-0x40(%rdx,%r15,1) 8.64 │ vmovdq -0x20(%rdx),%ymm0 7.58 │ vmovdq %ymm0,-0x40(%rdx,%r13,1) dpdk version: v17.05 ovs version: 2.8.90 qemu version: QEMU emulator version 2.9.94 (v2.10.0-rc4-dirty) gcc version: gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6) kernal version: 3.10.0 compile dpdk: CONFIG_RTE_ENABLE_AVX512=y export DPDK_DIR=$PWD export DPDK_TARGET=x86_64-native-linuxapp-gcc export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET make install T=$DPDK_TARGET DESTDIR=install compile ovs: sh boot.sh ./configure CFLAGS="-g -O2" --with-dpdk=$DPDK_BUILD --prefix=/usr --localstatedir=/var --sysconfdir=/etc make -j make install The test for dpdk test_memcpy_perf: avx2: ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) ** ======= ============== ============== ============== ============== Size Cache to cache Cache to mem Mem to cache Mem to mem (bytes) (ticks) (ticks) (ticks) (ticks) ------- -------------- -------------- -------------- -------------- ========================== 32B aligned ============================ 64 6 - 10 27 - 52 30 - 39 56 - 97 512 24 - 44 251 - 271 145 - 217 396 - 447 1024 35 - 78 394 - 433 252 - 319 609 - 670 ------- -------------- -------------- -------------- -------------- C 64 3 - 9 28 - 31 29 - 40 55 - 66 C 512 25 - 55 253 - 268 139 - 268 397 - 410 C 1024 32 - 83 394 - 416 250 - 396 612 - 687 =========================== Unaligned ============================= 64 8 - 9 85 - 71 45 - 45 125 - 121 512 33 - 49 282 - 305 153 - 252 420 - 478 1024 42 - 83 409 - 491 259 - 389 640 - 748 ------- -------------- -------------- -------------- -------------- C 64 4 - 9 42 - 46 39 - 46 76 - 90 C 512 33 - 55 280 - 272 153 - 281 421 - 415 C 1024 41 - 83 407 - 427 258 - 405 578 - 701 ======= ============== ============== ============== ============== avx512: ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) ** ======= ============== ============== ============== ============== Size Cache to cache Cache to mem Mem to cache Mem to mem (bytes) (ticks) (ticks) (ticks) (ticks) ------- -------------- -------------- -------------- -------------- ========================== 64B aligned ============================ 64 6 - 9 18 - 33 24 - 38 40 - 65 512 18 - 44 178 - 262 138 - 218 309 - 429 1024 27 - 79 338 - 430 250 - 322 560 - 674 ------- -------------- -------------- -------------- -------------- C 64 3 - 9 18 - 20 23 - 41 39 - 50 C 512 15 - 54 205 - 270 134 - 268 304 - 409 C 1024 24 - 83 371 - 414 242 - 400 550 - 692 =========================== Unaligned ============================= 64 8 - 9 87 - 74 45 - 48 125 - 118 512 23 - 49 298 - 311 150 - 250 437 - 482 1024 36 - 83 427 - 505 259 - 406 633 - 754 ------- -------------- -------------- -------------- -------------- C 64 4 - 9 42 - 46 39 - 46 76 - 94 C 512 23 - 55 246 - 277 152 - 290 349 - 426 C 1024 38 - 83 398 - 431 258 - 416 634 - 725 ======= ============== ============== ============== ==============