On Wed, 26 Jun 2024 10:37:31 +0200
Maxime Coquelin <maxime.coque...@redhat.com> wrote:

> On 6/25/24 21:27, Mattias Rönnblom wrote:
> > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote:  
> >> Hi Mattias,
> >>
> >> On 6/20/24 19:57, Mattias Rönnblom wrote:  
> >>> This patch set make DPDK library, driver, and application code use the
> >>> compiler/libc memcpy() by default when functions in <rte_memcpy.h> are
> >>> invoked.
> >>>
> >>> The various custom DPDK rte_memcpy() implementations may be retained
> >>> by means of a build-time option.
> >>>
> >>> This patch set only make a difference on x86, PPC and ARM. Loongarch
> >>> and RISCV already used compiler/libc memcpy().  
> >>
> >> It indeed makes a difference on x86!
> >>
> >> Just tested latest main with and without your series on
> >> Intel(R) Xeon(R) Gold 6438N.
> >>
> >> The test is a simple IO loop between a Vhost PMD and a Virtio-user PMD:
> >> # dpdk-testpmd -l 4-6   --file-prefix=virtio1 --no-pci --vdev 
> >> 'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost-net,server=1,mrg_rxbuf=1,in_order=1'
> >> --single-file-segments -- -i  
> >> testpmd> start  
> >>
> >> # dpdk-testpmd -l 8-10   --file-prefix=vhost1 --no-pci --vdev
> >> 'net_vhost0,iface=vhost-net,client=1'   --single-file-segments -- -i  
> >> testpmd> start tx_first 32  
> >>
> >> Latest main: 14.5Mpps
> >> Latest main + this series: 10Mpps
> >>  
> > 
> > I ran the above benchmark on my Raptor Lake desktop (locked to 3,2
> > GHz). GCC 12.3.0.
> > 
> > Core use_cc_memcpy Mpps
> > E    false         9.5
> > E    true          9.7
> > P    false         16.4
> > P    true          13.5
> > 
> > On the P-cores, there's a significant performance regression, although
> > not as bad as the one you see on your Sapphire Rapids Xeon. On the
> > E-cores, there's actually a slight performance gain.
> > 
> > The virtio PMD does not directly invoke rte_memcpy() or anything else
> > from <rte_memcpy.h>, but rather use memcpy(), so I'm not sure I
> > understand what's going on here. Does the virtio driver delegate some
> > performance-critical task to some module that in turns uses
> > rte_memcpy()?  
> 
> This is because Vhost is the bottleneck here, not Virtio driver.
> Indeed, the virtqueues memory belongs to the Virtio driver and the
> descriptors buffers are Virtio's mbufs, so not much memcpy's are done
> there.
> 
> Vhost however, is a heavy memcpy user, as all the descriptors buffers 
> are copied to/from its mbufs.

Would be good to now the size (if small it is inlining that matters, or
maybe alignment matters), and have test results for multiple compiler versions.
Ideally, feed results back and update Gcc and Clang.

DPDK doesn't need to be in the optimize C library space.

Reply via email to