On 6/26/24 16:58, Stephen Hemminger wrote:
On Wed, 26 Jun 2024 10:37:31 +0200
Maxime Coquelin <maxime.coque...@redhat.com> wrote:

On 6/25/24 21:27, Mattias Rönnblom wrote:
On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote:
Hi Mattias,

On 6/20/24 19:57, Mattias Rönnblom wrote:
This patch set make DPDK library, driver, and application code use the
compiler/libc memcpy() by default when functions in <rte_memcpy.h> are
invoked.

The various custom DPDK rte_memcpy() implementations may be retained
by means of a build-time option.

This patch set only make a difference on x86, PPC and ARM. Loongarch
and RISCV already used compiler/libc memcpy().

It indeed makes a difference on x86!

Just tested latest main with and without your series on
Intel(R) Xeon(R) Gold 6438N.

The test is a simple IO loop between a Vhost PMD and a Virtio-user PMD:
# dpdk-testpmd -l 4-6   --file-prefix=virtio1 --no-pci --vdev 
'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost-net,server=1,mrg_rxbuf=1,in_order=1'
--single-file-segments -- -i
testpmd> start

# dpdk-testpmd -l 8-10   --file-prefix=vhost1 --no-pci --vdev
'net_vhost0,iface=vhost-net,client=1'   --single-file-segments -- -i
testpmd> start tx_first 32

Latest main: 14.5Mpps
Latest main + this series: 10Mpps

I ran the above benchmark on my Raptor Lake desktop (locked to 3,2
GHz). GCC 12.3.0.

Core use_cc_memcpy Mpps
E    false         9.5
E    true          9.7
P    false         16.4
P    true          13.5

On the P-cores, there's a significant performance regression, although
not as bad as the one you see on your Sapphire Rapids Xeon. On the
E-cores, there's actually a slight performance gain.

The virtio PMD does not directly invoke rte_memcpy() or anything else
from <rte_memcpy.h>, but rather use memcpy(), so I'm not sure I
understand what's going on here. Does the virtio driver delegate some
performance-critical task to some module that in turns uses
rte_memcpy()?

This is because Vhost is the bottleneck here, not Virtio driver.
Indeed, the virtqueues memory belongs to the Virtio driver and the
descriptors buffers are Virtio's mbufs, so not much memcpy's are done
there.

Vhost however, is a heavy memcpy user, as all the descriptors buffers
are copied to/from its mbufs.

Would be good to now the size (if small it is inlining that matters, or
maybe alignment matters), and have test results for multiple compiler versions.
Ideally, feed results back and update Gcc and Clang.

I was testing with GCC 11 on RHEL-9:
gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)

I was using the default one, 64B packets.

I don't have time to perform these tests, but if you are willing to do
it I'll be happy to review the results.

DPDK doesn't need to be in the optimize C library space.

Certainly, but we already have an optimized version currently, so not
much to do now on our side. When C libraries implementations will be on
par, we should definitely use them by default.

Maxime

Reply via email to