> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Wednesday, 26 June 2024 20.48 > > On Wed, Jun 26, 2024 at 05:24:04PM +0200, Maxime Coquelin wrote: > > > > > > On 6/26/24 16:58, Stephen Hemminger wrote: > > > On Wed, 26 Jun 2024 10:37:31 +0200 > > > Maxime Coquelin <maxime.coque...@redhat.com> wrote: > > > > > > > On 6/25/24 21:27, Mattias Rönnblom wrote: > > > > > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote: > > > > > > Hi Mattias, > > > > > > > > > > > > On 6/20/24 19:57, Mattias Rönnblom wrote: > > > > > > > This patch set make DPDK library, driver, and application > code use the > > > > > > > compiler/libc memcpy() by default when functions in > <rte_memcpy.h> are > > > > > > > invoked. > > > > > > > > > > > > > > The various custom DPDK rte_memcpy() implementations may be > retained > > > > > > > by means of a build-time option. > > > > > > > > > > > > > > This patch set only make a difference on x86, PPC and ARM. > Loongarch > > > > > > > and RISCV already used compiler/libc memcpy(). > > > > > > > > > > > > It indeed makes a difference on x86! > > > > > > > > > > > > Just tested latest main with and without your series on > > > > > > Intel(R) Xeon(R) Gold 6438N. > > > > > > > > > > > > The test is a simple IO loop between a Vhost PMD and a Virtio- > user PMD: > > > > > > # dpdk-testpmd -l 4-6 --file-prefix=virtio1 --no-pci --vdev > 'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost- > net,server=1,mrg_rxbuf=1,in_order=1' > > > > > > --single-file-segments -- -i > > > > > > testpmd> start > > > > > > > > > > > > # dpdk-testpmd -l 8-10 --file-prefix=vhost1 --no-pci --vdev > > > > > > 'net_vhost0,iface=vhost-net,client=1' --single-file-segments > -- -i > > > > > > testpmd> start tx_first 32 > > > > > > > > > > > > Latest main: 14.5Mpps > > > > > > Latest main + this series: 10Mpps > > > > > > > > > > I ran the above benchmark on my Raptor Lake desktop (locked to > 3,2 > > > > > GHz). GCC 12.3.0. > > > > > > > > > > Core use_cc_memcpy Mpps > > > > > E false 9.5 > > > > > E true 9.7 > > > > > P false 16.4 > > > > > P true 13.5 > > > > > > > > > > On the P-cores, there's a significant performance regression, > although > > > > > not as bad as the one you see on your Sapphire Rapids Xeon. On > the > > > > > E-cores, there's actually a slight performance gain. > > > > > > > > > > The virtio PMD does not directly invoke rte_memcpy() or anything > else > > > > > from <rte_memcpy.h>, but rather use memcpy(), so I'm not sure I > > > > > understand what's going on here. Does the virtio driver delegate > some > > > > > performance-critical task to some module that in turns uses > > > > > rte_memcpy()? > > > > > > > > This is because Vhost is the bottleneck here, not Virtio driver. > > > > Indeed, the virtqueues memory belongs to the Virtio driver and the > > > > descriptors buffers are Virtio's mbufs, so not much memcpy's are > done > > > > there. > > > > > > > > Vhost however, is a heavy memcpy user, as all the descriptors > buffers > > > > are copied to/from its mbufs. > > > > > > Would be good to now the size (if small it is inlining that matters, > or > > > maybe alignment matters), and have test results for multiple > compiler versions. > > > Ideally, feed results back and update Gcc and Clang. > > > > I was testing with GCC 11 on RHEL-9: > > gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) > > > > I was using the default one, 64B packets. > > > > I don't have time to perform these tests, but if you are willing to do > > it I'll be happy to review the results. > > > > > DPDK doesn't need to be in the optimize C library space. > > > > Certainly, but we already have an optimized version currently, so not > > much to do now on our side. When C libraries implementations will be > on > > par, we should definitely use them by default. > > > > I think it's not so much about optimized versus non-optimized at this > point. It's just that cc/libc memcpy sometimes performs better than > RTE memcpy, and sometimes doesn't. > > For virtio, a single memory copy in > lib/vhost/virtio_net.c:do_data_copy_enqueue() > is responsible for >95% of the performance regression introduced by > the cc memcpy patch for small packets on Intel P-cores. > > I'm not so sure this performance regression will go away in newer > compilers. PGO would certainly help, but PGO is a hassle. > > One way to fix this issue would be to introduce a custom, > memcpy()-based packet copying routine. I tried the below patch, with > the following results: > > Raptor Lake @ 3,2 GHz > GCC 12 > > 64 bytes packets > Core Mode Mpps > ---------------------------- > E RTE memcpy 9.5 > E cc memcpy 9.7 > E cc memcpy+pktcpy 9.0 > > P RTE memcpy 16.4 > P cc memcpy 13.5 > P cc memcpy+pktcpy 16.2 > > 1500 bytes > Core Mode Mpps > ---------------------------- > P RTE memcpy 5.8 > P cc memcpy 5.9 > P cc memcpy+pktcpy 5.9 > > As you can see, most of the regression is eliminated, at the cost of > worse E-core performance. I didn't look at the generated code, but one > could suspect heavy use of wide SIMD is to blame, which E-cores don't > necessarily benefit from. > > The below prototype assumes the source and destination buffers are > 16-byte aligned. Does that always hold?
Perhaps always for this specific function; I don't know. Not generally *always*, but I guess in many cases packet copies would have 64-byte aligned pointers, but not sizes. Unless explicitly stated by the developer, it is unsafe to make assumptions about alignment. A future rte_memcpy() function might take flags with explicit alignment information for optimized copying, as was part of my non-temporal memcpy(). (The development on this is still on hold.) > > I'm sure one could further improve performance using context-specific > information, such as packets always being >= 64 bytes. One could also > consider having special cases, maybe for 64 bytes and MTU-sized > packets. Such are always a hassle when you try to characterize > performance though. Absolutely! This got me thinking: These tests are run with 64 byte packets only. Perhaps branch prediction pollutes the results, by optimizing branches in the copy routine for all packets being 64 byte. You really should be testing with IMIX or random packet sizes. In my experience, most internet packets are large (> 1024 byte, but not 1514 byte due to QUIC's conservative max packet size), closely followed by 64 byte (excl. any VLAN tags) packets; only the minority of packets are medium size. > > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c > index 370402d849..7b595a6622 100644 > --- a/lib/vhost/virtio_net.c > +++ b/lib/vhost/virtio_net.c > @@ -231,6 +231,26 @@ vhost_async_dma_check_completed(struct virtio_net > *dev, int16_t dma_id, uint16_t > return nr_copies; > } > > +static inline void > +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len) > +{ > + void *dst = __builtin_assume_aligned(in_dst, 16); > + const void *src = __builtin_assume_aligned(in_src, 16); > + > + if (len <= 256) { > + size_t left; > + > + for (left = len; left >= 32; left -= 32) { > + memcpy(dst, src, 32); > + dst = RTE_PTR_ADD(dst, 32); > + src = RTE_PTR_ADD(src, 32); > + } > + > + memcpy(dst, src, left); > + } else > + memcpy(dst, src, len); > +} > + > static inline void > do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue > *vq) > __rte_shared_locks_required(&vq->iotlb_lock) > @@ -240,7 +260,7 @@ do_data_copy_enqueue(struct virtio_net *dev, struct > vhost_virtqueue *vq) > int i; > > for (i = 0; i < count; i++) { > - rte_memcpy(elem[i].dst, elem[i].src, elem[i].len); > + pktcpy(elem[i].dst, elem[i].src, elem[i].len); > vhost_log_cache_write_iova(dev, vq, elem[i].log_addr, > elem[i].len); > PRINT_PACKET(dev, (uintptr_t)elem[i].dst, elem[i].len, 0); > @@ -257,7 +277,7 @@ do_data_copy_dequeue(struct vhost_virtqueue *vq) > int i; > > for (i = 0; i < count; i++) > - rte_memcpy(elem[i].dst, elem[i].src, elem[i].len); > + pktcpy(elem[i].dst, elem[i].src, elem[i].len); > > vq->batch_copy_nb_elems = 0; > } > > > > > Maxime > >