On Thu, 27 Jun 2024 13:06:22 +0200 Mattias Rönnblom <hof...@lysator.liu.se> wrote:
> On Wed, Jun 26, 2024 at 10:16:06PM +0200, Morten Brørup wrote: > > > From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > > > Sent: Wednesday, 26 June 2024 20.48 > > > > > > On Wed, Jun 26, 2024 at 05:24:04PM +0200, Maxime Coquelin wrote: > > > > > > > > > > > > On 6/26/24 16:58, Stephen Hemminger wrote: > > > > > On Wed, 26 Jun 2024 10:37:31 +0200 > > > > > Maxime Coquelin <maxime.coque...@redhat.com> wrote: > > > > > > > > > > > On 6/25/24 21:27, Mattias Rönnblom wrote: > > > > > > > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote: > > > > > > > > Hi Mattias, > > > > > > > > > > > > > > > > On 6/20/24 19:57, Mattias Rönnblom wrote: > > > > > > > > > This patch set make DPDK library, driver, and application > > > code use the > > > > > > > > > compiler/libc memcpy() by default when functions in > > > <rte_memcpy.h> are > > > > > > > > > invoked. > > > > > > > > > > > > > > > > > > The various custom DPDK rte_memcpy() implementations may be > > > retained > > > > > > > > > by means of a build-time option. > > > > > > > > > > > > > > > > > > This patch set only make a difference on x86, PPC and ARM. > > > Loongarch > > > > > > > > > and RISCV already used compiler/libc memcpy(). > > > > > > > > > > > > > > > > It indeed makes a difference on x86! > > > > > > > > > > > > > > > > Just tested latest main with and without your series on > > > > > > > > Intel(R) Xeon(R) Gold 6438N. > > > > > > > > > > > > > > > > The test is a simple IO loop between a Vhost PMD and a Virtio- > > > user PMD: > > > > > > > > # dpdk-testpmd -l 4-6 --file-prefix=virtio1 --no-pci --vdev > > > 'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost- > > > net,server=1,mrg_rxbuf=1,in_order=1' > > > > > > > > --single-file-segments -- -i > > > > > > > > testpmd> start > > > > > > > > > > > > > > > > # dpdk-testpmd -l 8-10 --file-prefix=vhost1 --no-pci --vdev > > > > > > > > 'net_vhost0,iface=vhost-net,client=1' --single-file-segments > > > -- -i > > > > > > > > testpmd> start tx_first 32 > > > > > > > > > > > > > > > > Latest main: 14.5Mpps > > > > > > > > Latest main + this series: 10Mpps > > > > > > > > > > > > > > I ran the above benchmark on my Raptor Lake desktop (locked to > > > 3,2 > > > > > > > GHz). GCC 12.3.0. > > > > > > > > > > > > > > Core use_cc_memcpy Mpps > > > > > > > E false 9.5 > > > > > > > E true 9.7 > > > > > > > P false 16.4 > > > > > > > P true 13.5 > > > > > > > > > > > > > > On the P-cores, there's a significant performance regression, > > > although > > > > > > > not as bad as the one you see on your Sapphire Rapids Xeon. On > > > the > > > > > > > E-cores, there's actually a slight performance gain. > > > > > > > > > > > > > > The virtio PMD does not directly invoke rte_memcpy() or anything > > > else > > > > > > > from <rte_memcpy.h>, but rather use memcpy(), so I'm not sure I > > > > > > > understand what's going on here. Does the virtio driver delegate > > > some > > > > > > > performance-critical task to some module that in turns uses > > > > > > > rte_memcpy()? > > > > > > > > > > > > This is because Vhost is the bottleneck here, not Virtio driver. > > > > > > Indeed, the virtqueues memory belongs to the Virtio driver and the > > > > > > descriptors buffers are Virtio's mbufs, so not much memcpy's are > > > done > > > > > > there. > > > > > > > > > > > > Vhost however, is a heavy memcpy user, as all the descriptors > > > buffers > > > > > > are copied to/from its mbufs. > > > > > > > > > > Would be good to now the size (if small it is inlining that matters, > > > or > > > > > maybe alignment matters), and have test results for multiple > > > compiler versions. > > > > > Ideally, feed results back and update Gcc and Clang. > > > > > > > > I was testing with GCC 11 on RHEL-9: > > > > gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) > > > > > > > > I was using the default one, 64B packets. > > > > > > > > I don't have time to perform these tests, but if you are willing to do > > > > it I'll be happy to review the results. > > > > > > > > > DPDK doesn't need to be in the optimize C library space. > > > > > > > > Certainly, but we already have an optimized version currently, so not > > > > much to do now on our side. When C libraries implementations will be > > > on > > > > par, we should definitely use them by default. > > > > > > > > > > I think it's not so much about optimized versus non-optimized at this > > > point. It's just that cc/libc memcpy sometimes performs better than > > > RTE memcpy, and sometimes doesn't. > > > > > > For virtio, a single memory copy in > > > lib/vhost/virtio_net.c:do_data_copy_enqueue() > > > is responsible for >95% of the performance regression introduced by > > > the cc memcpy patch for small packets on Intel P-cores. > > > > > > I'm not so sure this performance regression will go away in newer > > > compilers. PGO would certainly help, but PGO is a hassle. > > > > > > One way to fix this issue would be to introduce a custom, > > > memcpy()-based packet copying routine. I tried the below patch, with > > > the following results: > > > > > > Raptor Lake @ 3,2 GHz > > > GCC 12 > > > > > > 64 bytes packets > > > Core Mode Mpps > > > ---------------------------- > > > E RTE memcpy 9.5 > > > E cc memcpy 9.7 > > > E cc memcpy+pktcpy 9.0 > > > > > > P RTE memcpy 16.4 > > > P cc memcpy 13.5 > > > P cc memcpy+pktcpy 16.2 > > > > > > 1500 bytes > > > Core Mode Mpps > > > ---------------------------- > > > P RTE memcpy 5.8 > > > P cc memcpy 5.9 > > > P cc memcpy+pktcpy 5.9 > > > > > > As you can see, most of the regression is eliminated, at the cost of > > > worse E-core performance. I didn't look at the generated code, but one > > > could suspect heavy use of wide SIMD is to blame, which E-cores don't > > > necessarily benefit from. > > > > > > The below prototype assumes the source and destination buffers are > > > 16-byte aligned. Does that always hold? > > > > Perhaps always for this specific function; I don't know. > > Not generally *always*, but I guess in many cases packet copies would have > > 64-byte aligned pointers, but not sizes. > > Unless explicitly stated by the developer, it is unsafe to make assumptions > > about alignment. > > > > I meant always (for every packet) in DPDK virtio net. > > > A future rte_memcpy() function might take flags with explicit alignment > > information for optimized copying, as was part of my non-temporal memcpy(). > > (The development on this is still on hold.) > > > > There is already a mechanism to express alignment in GCC-compatible > compilers. See the below patch. > > > > > > > I'm sure one could further improve performance using context-specific > > > information, such as packets always being >= 64 bytes. One could also > > > consider having special cases, maybe for 64 bytes and MTU-sized > > > packets. Such are always a hassle when you try to characterize > > > performance though. > > > > Absolutely! > > > > This got me thinking: > > These tests are run with 64 byte packets only. > > Perhaps branch prediction pollutes the results, by optimizing branches in > > the copy routine for all packets being 64 byte. > > > > You really should be testing with IMIX or random packet sizes. > > > > Sure, and it should also be a real app, not testpmd, on top of virtio. > > > In my experience, most internet packets are large (> 1024 byte, but not > > 1514 byte due to QUIC's conservative max packet size), closely followed by > > 64 byte (excl. any VLAN tags) packets; only the minority of packets are > > medium size. > > > > > > > > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c > > > index 370402d849..7b595a6622 100644 > > > --- a/lib/vhost/virtio_net.c > > > +++ b/lib/vhost/virtio_net.c > > > @@ -231,6 +231,26 @@ vhost_async_dma_check_completed(struct virtio_net > > > *dev, int16_t dma_id, uint16_t > > > return nr_copies; > > > } > > > > > > +static inline void > > > +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len) > > > +{ > > > + void *dst = __builtin_assume_aligned(in_dst, 16); > > > + const void *src = __builtin_assume_aligned(in_src, 16); > > > + > > > + if (len <= 256) { > > > + size_t left; > > > + > > > + for (left = len; left >= 32; left -= 32) { > > > + memcpy(dst, src, 32); > > > + dst = RTE_PTR_ADD(dst, 32); > > > + src = RTE_PTR_ADD(src, 32); > > > + } > > > + > > > + memcpy(dst, src, left); > > > + } else > > > + memcpy(dst, src, len); > > > +} > > > + Is alignment an optimization or requirement? I can think of cases in tunneling where the packet was received into mbuf (aligned) but then the application prepends a header making the packet unaligned before sending.