On 7/25/23 08:55, Jason Wang wrote: > On Thu, Jul 20, 2023 at 9:26 PM Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> On 7/20/23 09:37, Jason Wang wrote: >>> On Thu, Jul 6, 2023 at 4:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote: >>>> >>>> AF_XDP is a network socket family that allows communication directly >>>> with the network device driver in the kernel, bypassing most or all >>>> of the kernel networking stack. In the essence, the technology is >>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >>>> and works with any network interfaces without driver modifications. >>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >>>> require access to character devices or unix sockets. Only access to >>>> the network interface itself is necessary. >>>> >>>> This patch implements a network backend that communicates with the >>>> kernel by creating an AF_XDP socket. A chunk of userspace memory >>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >>>> Fill and Completion) are placed in that memory along with a pool of >>>> memory buffers for the packet data. Data transmission is done by >>>> allocating one of the buffers, copying packet data into it and >>>> placing the pointer into Tx ring. After transmission, device will >>>> return the buffer via Completion ring. On Rx, device will take >>>> a buffer form a pre-populated Fill ring, write the packet data into >>>> it and place the buffer into Rx ring. >>>> >>>> AF_XDP network backend takes on the communication with the host >>>> kernel and the network interface and forwards packets to/from the >>>> peer device in QEMU. >>>> >>>> Usage example: >>>> >>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >>>> >>>> XDP program bridges the socket with a network interface. It can be >>>> attached to the interface in 2 different modes: >>>> >>>> 1. skb - this mode should work for any interface and doesn't require >>>> driver support. With a caveat of lower performance. >>>> >>>> 2. native - this does require support from the driver and allows to >>>> bypass skb allocation in the kernel and potentially use >>>> zero-copy while getting packets in/out userspace. >>>> >>>> By default, QEMU will try to use native mode and fall back to skb. >>>> Mode can be forced via 'mode' option. To force 'copy' even in native >>>> mode, use 'force-copy=on' option. This might be useful if there is >>>> some issue with the driver. >>>> >>>> Option 'queues=N' allows to specify how many device queues should >>>> be open. Note that all the queues that are not open are still >>>> functional and can receive traffic, but it will not be delivered to >>>> QEMU. So, the number of device queues should generally match the >>>> QEMU configuration, unless the device is shared with something >>>> else and the traffic re-direction to appropriate queues is correctly >>>> configured on a device level (e.g. with ethtool -N). >>>> 'start-queue=M' option can be used to specify from which queue id >>>> QEMU should start configuring 'N' queues. It might also be necessary >>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >>>> for examples. >>>> >>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >>>> capabilities in order to load default XSK/XDP programs to the >>>> network interface and configure BPF maps. It is possible, however, >>>> to run with no capabilities. For that to work, an external process >>>> with admin capabilities will need to pre-load default XSK program, >>>> create AF_XDP sockets and pass their file descriptors to QEMU process >>>> on startup via 'sock-fds' option. Network backend will need to be >>>> configured with 'inhibit=on' to avoid loading of the program. >>>> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue >>>> or CAP_IPC_LOCK. >>>> >>>> Alternatively, the file descriptor for 'xsks_map' can be passed via >>>> 'xsks-map-fd=N' option instead of passing socket file descriptors. >>>> That will additionally require CAP_NET_RAW on QEMU side. This is >>>> useful, because 'sock-fds' may not be available with older libxdp. >>>> 'sock-fds' requires libxdp >= 1.4.0. >>>> >>>> There are few performance challenges with the current network backends. >>>> >>>> First is that they do not support IO threads. This means that data >>>> path is handled by the main thread in QEMU and may slow down other >>>> work or may be slowed down by some other work. This also means that >>>> taking advantage of multi-queue is generally not possible today. >>>> >>>> Another thing is that data path is going through the device emulation >>>> code, which is not really optimized for performance. The fastest >>>> "frontend" device is virtio-net. But it's not optimized for heavy >>>> traffic either, because it expects such use-cases to be handled via >>>> some implementation of vhost (user, kernel, vdpa). In practice, we >>>> have virtio notifications and rcu lock/unlock on a per-packet basis >>>> and not very efficient accesses to the guest memory. Communication >>>> channels between backend and frontend devices do not allow passing >>>> more than one packet at a time as well. >>>> >>>> Some of these challenges can be avoided in the future by adding better >>>> batching into device emulation or by implementing vhost-af-xdp variant. >>>> >>>> There are also a few kernel limitations. AF_XDP sockets do not >>>> support any kinds of checksum or segmentation offloading. Buffers >>>> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer >>>> support implementation for AF_XDP is in progress, but not ready yet. >>>> Also, transmission in all non-zero-copy modes is synchronous, i.e. >>>> done in a syscall. That doesn't allow high packet rates on virtual >>>> interfaces. >>>> >>>> However, keeping in mind all of these challenges, current implementation >>>> of the AF_XDP backend shows a decent performance while running on top >>>> of a physical NIC with zero-copy support. >>>> >>>> Test setup: >>>> >>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. >>>> Network backend is configured to open the NIC directly in native mode. >>>> The driver supports zero-copy. NIC is configured to use 1 queue. >>>> >>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd >>>> for PPS testing. >>>> >>>> iperf3 result: >>>> TCP stream : 19.1 Gbps >>>> >>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: >>>> Tx only : 3.4 Mpps >>>> Rx only : 2.0 Mpps >>>> L2 FWD Loopback : 1.5 Mpps >>>> >>>> In skb mode the same setup shows much lower performance, similar to >>>> the setup where pair of physical NICs is replaced with veth pair: >>>> >>>> iperf3 result: >>>> TCP stream : 9 Gbps >>>> >>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: >>>> Tx only : 1.2 Mpps >>>> Rx only : 1.0 Mpps >>>> L2 FWD Loopback : 0.7 Mpps >>>> >>>> Results in skb mode or over the veth are close to results of a tap >>>> backend with vhost=on and disabled segmentation offloading bridged >>>> with a NIC. >>>> >>>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> >>> >>> Looks good overall, see few comments inline. >> >> Thanks for review! >> >>> >>>> --- >>>> >>>> Version 2: >>>> >>>> - Added support for running with no capabilities by passing >>>> pre-created AF_XDP socket file descriptors via 'sock-fds' option. >>>> Conditionally complied because requires unreleased libxdp 1.4.0. >>>> The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue. >>>> >>>> - Refined and extended documentation. >>>> >>>> >>>> MAINTAINERS | 4 + >>>> hmp-commands.hx | 2 +- >>>> meson.build | 19 + >>>> meson_options.txt | 2 + >>>> net/af-xdp.c | 570 ++++++++++++++++++ >>>> net/clients.h | 5 + >>>> net/meson.build | 3 + >>>> net/net.c | 6 + >>>> qapi/net.json | 60 +- >>>> qemu-options.hx | 83 ++- >>>> .../ci/org.centos/stream/8/x86_64/configure | 1 + >>>> scripts/meson-buildoptions.sh | 3 + >>>> tests/docker/dockerfiles/debian-amd64.docker | 1 + >>>> 13 files changed, 756 insertions(+), 3 deletions(-) >>>> create mode 100644 net/af-xdp.c >>>> >>>> diff --git a/MAINTAINERS b/MAINTAINERS >>>> index 7164cf55a1..80d4ba4004 100644 >>>> --- a/MAINTAINERS >>>> +++ b/MAINTAINERS >>>> @@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/ >>>> S: Maintained >>>> F: net/netmap.c >>>> >>>> +AF_XDP network backend >>>> +R: Ilya Maximets <i.maxim...@ovn.org> >>>> +F: net/af-xdp.c >>>> + >>>> Host Memory Backends >>>> M: David Hildenbrand <da...@redhat.com> >>>> M: Igor Mammedov <imamm...@redhat.com> >>>> diff --git a/hmp-commands.hx b/hmp-commands.hx >>>> index 2cbd0f77a0..af9ffe4681 100644 >>>> --- a/hmp-commands.hx >>>> +++ b/hmp-commands.hx >>>> @@ -1295,7 +1295,7 @@ ERST >>>> { >>>> .name = "netdev_add", >>>> .args_type = "netdev:O", >>>> - .params = >>>> "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user" >>>> + .params = >>>> "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user" >>>> #ifdef CONFIG_VMNET >>>> "|vmnet-host|vmnet-shared|vmnet-bridged" >>>> #endif >>>> diff --git a/meson.build b/meson.build >>>> index a9ba0bfab3..1f8772ea5d 100644 >>>> --- a/meson.build >>>> +++ b/meson.build >>>> @@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links(''' >>>> endif >>>> endif >>>> >>>> +# libxdp >>>> +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: >>>> 'pkg-config') >>>> +if libxdp.found() and \ >>>> + not (libbpf.found() and libbpf.version().version_compare('>=0.7')) >>>> + libxdp = not_found >>>> + if get_option('af_xdp').enabled() >>>> + error('af-xdp support requires libbpf version >= 0.7') >>> >>> Can we simply limit this to 1.4? >> >> This is a check for libbpf, not libxdp. Or do you think there is no need >> to check libbpf version if we request libxdp version high enough? >> Users may still break the build by installing old libbpf manually even if >> distributions ship more modern versions. >> >> Or do you mean limit the libxdp version to 1.4 in order to avoid conditional >> on HAVE_XSK_UMEM__CREATE_WITH_FD ? > > Yes. > >> The problem with that is that libxdp 1.4 is a week+ old, so not available in >> any distribution, AFAIK. Not sure how big of a problem that is though. > > It doesn't matter as this is a brand new backend, it would simplify > future maintenance if we can get rid of any HAVE_XXX macros.
OK, makes sense. I'll require libxdp 1.4 and remove all ifdefs. I'll also remove xsks-map-fd configuration, since it will be always posisble to just use sock-fds instead. We may add it back later, but it requires extra privileges (NET_RAW), so I'm not sure there is much value in it. > >> >>> >>> >>>> + else >>>> + warning('af-xdp support requires libbpf version >= 0.7, disabling') >>>> + endif >>>> +endif >>>> + >>>> # libdw >>>> libdw = not_found >>>> if not get_option('libdw').auto() or \ >>>> @@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', >>>> get_option('hexagon_idef_pars >>>> config_host_data.set('CONFIG_LIBATTR', have_old_libattr) >>>> config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found()) >>>> config_host_data.set('CONFIG_EBPF', libbpf.found()) >>>> +config_host_data.set('CONFIG_AF_XDP', libxdp.found()) >>>> +if libxdp.found() >>>> + config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD', >>>> + cc.has_function('xsk_umem__create_with_fd', >>>> + dependencies: libxdp)) >>>> +endif >>>> config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found()) >>>> config_host_data.set('CONFIG_LIBISCSI', libiscsi.found()) >>>> config_host_data.set('CONFIG_LIBNFS', libnfs.found()) >>>> @@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support': have_pvrdma} >>>> summary_info += {'fdt support': fdt_opt == 'disabled' ? false : >>>> fdt_opt} >>>> summary_info += {'libcap-ng support': libcap_ng} >>>> summary_info += {'bpf support': libbpf} >>>> +summary_info += {'AF_XDP support': libxdp} >>>> summary_info += {'rbd support': rbd} >>>> summary_info += {'smartcard support': cacard} >>>> summary_info += {'U2F support': u2f} >>>> diff --git a/meson_options.txt b/meson_options.txt >>>> index bbb5c7e886..f4e950ce6a 100644 >>>> --- a/meson_options.txt >>>> +++ b/meson_options.txt >>>> @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto', >>>> option('keyring', type: 'feature', value: 'auto', >>>> description: 'Linux keyring support') >>>> >>>> +option('af_xdp', type : 'feature', value : 'auto', >>>> + description: 'AF_XDP network backend support') >>>> option('attr', type : 'feature', value : 'auto', >>>> description: 'attr/xattr support') >>>> option('auth_pam', type : 'feature', value : 'auto', >>>> diff --git a/net/af-xdp.c b/net/af-xdp.c >>>> new file mode 100644 >>>> index 0000000000..265ba6b12e >>>> --- /dev/null >>>> +++ b/net/af-xdp.c >>>> @@ -0,0 +1,570 @@ >>>> +/* >>>> + * AF_XDP network backend. >>>> + * >>>> + * Copyright (c) 2023 Red Hat, Inc. >>>> + * >>>> + * Authors: >>>> + * Ilya Maximets <i.maxim...@ovn.org> >>>> + * >>>> + * This work is licensed under the terms of the GNU GPL, version 2 or >>>> later. >>>> + * See the COPYING file in the top-level directory. >>>> + */ >>>> + >>>> + >>>> +#include "qemu/osdep.h" >>>> +#include <bpf/bpf.h> >>>> +#include <inttypes.h> >>>> +#include <linux/if_link.h> >>>> +#include <linux/if_xdp.h> >>>> +#include <net/if.h> >>>> +#include <xdp/xsk.h> >>>> + >>>> +#include "clients.h" >>>> +#include "monitor/monitor.h" >>>> +#include "net/net.h" >>>> +#include "qapi/error.h" >>>> +#include "qemu/cutils.h" >>>> +#include "qemu/error-report.h" >>>> +#include "qemu/iov.h" >>>> +#include "qemu/main-loop.h" >>>> +#include "qemu/memalign.h" >>>> + >>>> + >>>> +typedef struct AFXDPState { >>>> + NetClientState nc; >>>> + >>>> + struct xsk_socket *xsk; >>>> + struct xsk_ring_cons rx; >>>> + struct xsk_ring_prod tx; >>>> + struct xsk_ring_cons cq; >>>> + struct xsk_ring_prod fq; >>>> + >>>> + char ifname[IFNAMSIZ]; >>>> + int ifindex; >>>> + bool read_poll; >>>> + bool write_poll; >>>> + uint32_t outstanding_tx; >>>> + >>>> + uint64_t *pool; >>>> + uint32_t n_pool; >>>> + char *buffer; >>>> + struct xsk_umem *umem; >>>> + >>>> + uint32_t n_queues; >>>> + uint32_t xdp_flags; >>>> + bool inhibit; >>>> +} AFXDPState; >>>> + >>>> +#define AF_XDP_BATCH_SIZE 64 >>>> + >>>> +static void af_xdp_send(void *opaque); >>>> +static void af_xdp_writable(void *opaque); >>>> + >>>> +/* Set the event-loop handlers for the af-xdp backend. */ >>>> +static void af_xdp_update_fd_handler(AFXDPState *s) >>>> +{ >>>> + qemu_set_fd_handler(xsk_socket__fd(s->xsk), >>>> + s->read_poll ? af_xdp_send : NULL, >>>> + s->write_poll ? af_xdp_writable : NULL, >>>> + s); >>>> +} >>>> + >>>> +/* Update the read handler. */ >>>> +static void af_xdp_read_poll(AFXDPState *s, bool enable) >>>> +{ >>>> + if (s->read_poll != enable) { >>>> + s->read_poll = enable; >>>> + af_xdp_update_fd_handler(s); >>>> + } >>>> +} >>>> + >>>> +/* Update the write handler. */ >>>> +static void af_xdp_write_poll(AFXDPState *s, bool enable) >>>> +{ >>>> + if (s->write_poll != enable) { >>>> + s->write_poll = enable; >>>> + af_xdp_update_fd_handler(s); >>>> + } >>>> +} >>>> + >>>> +static void af_xdp_poll(NetClientState *nc, bool enable) >>>> +{ >>>> + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); >>>> + >>>> + if (s->read_poll != enable || s->write_poll != enable) { >>>> + s->write_poll = enable; >>>> + s->read_poll = enable; >>>> + af_xdp_update_fd_handler(s); >>>> + } >>>> +} >>>> + >>>> +static void af_xdp_complete_tx(AFXDPState *s) >>>> +{ >>>> + uint32_t idx = 0; >>>> + uint32_t done, i; >>>> + uint64_t *addr; >>>> + >>>> + done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, >>>> &idx); >>>> + >>>> + for (i = 0; i < done; i++) { >>>> + addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++); >>>> + s->pool[s->n_pool++] = *addr; >>>> + s->outstanding_tx--; >>>> + } >>>> + >>>> + if (done) { >>>> + xsk_ring_cons__release(&s->cq, done); >>>> + } >>>> +} >>>> + >>>> +/* >>>> + * The fd_write() callback, invoked if the fd is marked as writable >>>> + * after a poll. >>>> + */ >>>> +static void af_xdp_writable(void *opaque) >>>> +{ >>>> + AFXDPState *s = opaque; >>>> + >>>> + /* Try to recover buffers that are already sent. */ >>>> + af_xdp_complete_tx(s); >>>> + >>>> + /* >>>> + * Unregister the handler, unless we still have packets to transmit >>>> + * and kernel needs a wake up. >>>> + */ >>>> + if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) { >>>> + af_xdp_write_poll(s, false); >>>> + } >>>> + >>>> + /* Flush any buffered packets. */ >>>> + qemu_flush_queued_packets(&s->nc); >>>> +} >>>> + >>>> +static ssize_t af_xdp_receive(NetClientState *nc, >>>> + const uint8_t *buf, size_t size) >>>> +{ >>>> + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); >>>> + struct xdp_desc *desc; >>>> + uint32_t idx; >>>> + void *data; >>>> + >>>> + /* Try to recover buffers that are already sent. */ >>>> + af_xdp_complete_tx(s); >>>> + >>>> + if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) { >>>> + /* We can't transmit packet this size... */ >>>> + return size; >>>> + } >>>> + >>>> + if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) { >>>> + /* >>>> + * Out of buffers or space in tx ring. Poll until we can write. >>>> + * This will also kick the Tx, if it was waiting on CQ. >>>> + */ >>>> + af_xdp_write_poll(s, true); >>>> + return 0; >>>> + } >>>> + >>>> + desc = xsk_ring_prod__tx_desc(&s->tx, idx); >>>> + desc->addr = s->pool[--s->n_pool]; >>>> + desc->len = size; >>>> + >>>> + data = xsk_umem__get_data(s->buffer, desc->addr); >>>> + memcpy(data, buf, size); >>>> + >>>> + xsk_ring_prod__submit(&s->tx, 1); >>>> + s->outstanding_tx++; >>>> + >>>> + if (xsk_ring_prod__needs_wakeup(&s->tx)) { >>>> + af_xdp_write_poll(s, true); >>>> + } >>>> + >>>> + return size; >>>> +} >>>> + >>>> +/* >>>> + * Complete a previous send (backend --> guest) and enable the >>>> + * fd_read callback. >>>> + */ >>>> +static void af_xdp_send_completed(NetClientState *nc, ssize_t len) >>>> +{ >>>> + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); >>>> + >>>> + af_xdp_read_poll(s, true); >>>> +} >>>> + >>>> +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n) >>>> +{ >>>> + uint32_t i, idx = 0; >>>> + >>>> + /* Leave one packet for Tx, just in case. */ >>>> + if (s->n_pool < n + 1) { >>>> + n = s->n_pool; >>>> + } >>>> + >>>> + if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) { >>>> + return; >>>> + } >>>> + >>>> + for (i = 0; i < n; i++) { >>>> + *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool]; >>>> + } >>>> + xsk_ring_prod__submit(&s->fq, n); >>>> + >>>> + if (xsk_ring_prod__needs_wakeup(&s->fq)) { >>>> + /* Receive was blocked by not having enough buffers. Wake it up. >>>> */ >>>> + af_xdp_read_poll(s, true); >>>> + } >>>> +} >>>> + >>>> +static void af_xdp_send(void *opaque) >>>> +{ >>>> + uint32_t i, n_rx, idx = 0; >>>> + AFXDPState *s = opaque; >>>> + >>>> + n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx); >>>> + if (!n_rx) { >>>> + return; >>>> + } >>>> + >>>> + for (i = 0; i < n_rx; i++) { >>>> + const struct xdp_desc *desc; >>>> + struct iovec iov; >>>> + >>>> + desc = xsk_ring_cons__rx_desc(&s->rx, idx++); >>>> + >>>> + iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr); >>>> + iov.iov_len = desc->len; >>>> + >>>> + s->pool[s->n_pool++] = desc->addr; >>>> + >>>> + if (!qemu_sendv_packet_async(&s->nc, &iov, 1, >>>> + af_xdp_send_completed)) { >>>> + /* >>>> + * The peer does not receive anymore. Packet is queued, stop >>>> + * reading from the backend until af_xdp_send_completed(). >>>> + */ >>>> + af_xdp_read_poll(s, false); >>>> + >>>> + /* Re-peek the descriptors to not break the ring cache. */ >>>> + xsk_ring_cons__cancel(&s->rx, n_rx); >>>> + n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx); >>> >>> The code turns out to be hard to read here. >>> >>> 1) This seems to undo the peek (usually peek doesn't touch the >>> prod/consumer but it seems not what xsk_ring_cons__peek()) did: >> >> Yeah, it's unfortunate that the peek() function changes the internal >> state, but that is what we have... >> >>> >>> static inline __u32 xsk_ring_cons__peek(struct xsk_ring_cons *cons, >>> __u32 nb, __u32 *idx) >>> { >>> __u32 entries = xsk_cons_nb_avail(cons, nb); >>> >>> if (entries > 0) { >>> *idx = cons->cached_cons; >>> cons->cached_cons += entries; >>> } >>> >>> return entries; >>> } >>> >>> 2) It looks to me a partial rollback is sufficient? >>> >>> xsk_ring_cons__cancel(n_rx - i + 1)? >> >> Good point. Should work. It should be n_rx - i - 1 though, if I'm not >> mistaken. So: >> >> xsk_ring_cons__cancel(n_rx - i - 1); >> n_rx = i + 1; >> >> I'm not sure if that is much easier to read, but that's OK. Should be >> a touch faster as well. What do you think? > > Let's do that please. OK, Sure. Seems to work fine. I'll post v3 soon with this and other discused changes. > >> >>> >>>> + g_assert(n_rx == i + 1); >>>> + break; >>>> + } >>>> + } >>>> + >>>> + /* Release actually sent descriptors and try to re-fill. */ >>>> + xsk_ring_cons__release(&s->rx, n_rx); >>>> + af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE); >>>> +} >>>> + >>>> +/* Flush and close. */ >>>> +static void af_xdp_cleanup(NetClientState *nc) >>>> +{ >>>> + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); >>>> + >>>> + qemu_purge_queued_packets(nc); >>>> + >>>> + af_xdp_poll(nc, false); >>>> + >>>> + xsk_socket__delete(s->xsk); >>>> + s->xsk = NULL; >>>> + g_free(s->pool); >>>> + s->pool = NULL; >>>> + xsk_umem__delete(s->umem); >>>> + s->umem = NULL; >>>> + qemu_vfree(s->buffer); >>>> + s->buffer = NULL; >>>> + >>>> + /* Remove the program if it's the last open queue. */ >>>> + if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags >>>> + && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) { >>>> + fprintf(stderr, >>>> + "af-xdp: unable to remove XDP program from '%s', ifindex: >>>> %d\n", >>>> + s->ifname, s->ifindex); >>>> + } >>>> +} >>>> + >>>> +static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp) >>>> +{ >>>> + struct xsk_umem_config config = { >>>> + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, >>>> + .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, >>>> + .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE, >>>> + .frame_headroom = 0, >>>> + }; >>>> + uint64_t n_descs; >>>> + uint64_t size; >>>> + int64_t i; >>>> + int ret; >>>> + >>>> + /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */ >>>> + n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS >>>> + + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2; >>>> + size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE; >>>> + >>>> + s->buffer = qemu_memalign(qemu_real_host_page_size(), size); >>>> + memset(s->buffer, 0, size); >>>> + >>>> + if (sock_fd < 0) { >>>> + ret = xsk_umem__create(&s->umem, s->buffer, size, >>>> + &s->fq, &s->cq, &config); >>>> + } else { >>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD >>>> + ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size, >>>> + &s->fq, &s->cq, &config); >>>> +#else >>> >>> So sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD won't work. We'd better >>> >>> 1) disable sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD >> >> The qapi property is conditionally defined, so users will not be able to >> set sock-fds if not supported. And qemu will complain about unknown >> property. That should be enough? > > Yes. > >> >>> >>> or >>> >>> 2) disable af_xdp without HAVE_XSK_UMEM__CREATE_WITH_FD >> >> If we require libxdp 1.4 that will be the case. >> >>> >>>> + ret = -1; >>>> + errno = EINVAL; >>>> +#endif >>>> + } >>>> + >>>> + if (ret) { >>>> + qemu_vfree(s->buffer); >>>> + error_setg_errno(errp, errno, >>>> + "failed to create umem for %s queue_index: %d", >>>> + s->ifname, s->nc.queue_index); >>>> + return -1; >>>> + } >>>> + >>>> + s->pool = g_new(uint64_t, n_descs); >>>> + /* Fill the pool in the opposite order, because it's a LIFO queue. */ >>>> + for (i = n_descs; i >= 0; i--) { >>>> + s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE; >>>> + } >>>> + s->n_pool = n_descs; >>>> + >>>> + af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS); >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +static int af_xdp_socket_create(AFXDPState *s, >>>> + const NetdevAFXDPOptions *opts, >>>> + int xsks_map_fd, Error **errp) >>>> +{ >>>> + struct xsk_socket_config cfg = { >>>> + .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, >>>> + .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, >>>> + .libxdp_flags = 0, >>>> + .bind_flags = XDP_USE_NEED_WAKEUP, >>>> + .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST, >>>> + }; >>>> + int queue_id, error = 0; >>>> + >>>> + s->inhibit = opts->has_inhibit && opts->inhibit; >>>> + if (s->inhibit) { >>>> + cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD; >>>> + } >>>> + >>>> + if (opts->has_force_copy && opts->force_copy) { >>>> + cfg.bind_flags |= XDP_COPY; >>>> + } >>>> + >>>> + queue_id = s->nc.queue_index; >>>> + if (opts->has_start_queue && opts->start_queue > 0) { >>>> + queue_id += opts->start_queue; >>>> + } >>>> + >>>> + if (opts->has_mode) { >>>> + /* Specific mode requested. */ >>>> + cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE) >>>> + ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE; >>>> + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, >>>> + s->umem, &s->rx, &s->tx, &cfg)) { >>>> + error = errno; >>>> + } >>>> + } else { >>>> + /* No mode requested, try native first. */ >>>> + cfg.xdp_flags |= XDP_FLAGS_DRV_MODE; >>>> + >>>> + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, >>>> + s->umem, &s->rx, &s->tx, &cfg)) { >>>> + /* Can't use native mode, try skb. */ >>>> + cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE; >>>> + cfg.xdp_flags |= XDP_FLAGS_SKB_MODE; >>>> + >>>> + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, >>>> + s->umem, &s->rx, &s->tx, &cfg)) { >>>> + error = errno; >>>> + } >>>> + } >>>> + } >>>> + >>>> + if (error) { >>>> + error_setg_errno(errp, error, >>>> + "failed to create AF_XDP socket for %s queue_id: >>>> %d", >>>> + s->ifname, queue_id); >>>> + return -1; >>>> + } >>>> + >>>> + if (s->inhibit && xsks_map_fd >= 0) { >>>> + int xsk_fd = xsk_socket__fd(s->xsk); >>>> + >>>> + /* Need to update the map manually, libxdp skipped that step. */ >>>> + error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0); >>>> + if (error) { >>>> + error_setg_errno(errp, error, >>>> + "failed to update xsks map for %s queue_id: >>>> %d", >>>> + s->ifname, queue_id); >>>> + return -1; >>>> + } >>>> + } >>>> + >>>> + s->xdp_flags = cfg.xdp_flags; >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +/* NetClientInfo methods. */ >>>> +static NetClientInfo net_af_xdp_info = { >>>> + .type = NET_CLIENT_DRIVER_AF_XDP, >>>> + .size = sizeof(AFXDPState), >>>> + .receive = af_xdp_receive, >>>> + .poll = af_xdp_poll, >>>> + .cleanup = af_xdp_cleanup, >>>> +}; >>>> + >>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD >>>> +static int *parse_socket_fds(const char *sock_fds_str, >>>> + int64_t n_expected, Error **errp) >>>> +{ >>>> + gchar **substrings = g_strsplit(sock_fds_str, ":", -1); >>>> + int64_t i, n_sock_fds = g_strv_length(substrings); >>>> + int *sock_fds = NULL; >>>> + >>>> + if (n_sock_fds != n_expected) { >>>> + error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64, >>>> + n_expected, n_sock_fds); >>>> + goto exit; >>>> + } >>>> + >>>> + sock_fds = g_new(int, n_sock_fds); >>>> + >>>> + for (i = 0; i < n_sock_fds; i++) { >>>> + sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], >>>> errp); >>>> + if (sock_fds[i] < 0) { >>>> + g_free(sock_fds); >>>> + sock_fds = NULL; >>>> + goto exit; >>>> + } >>>> + } >>>> + >>>> +exit: >>>> + g_strfreev(substrings); >>>> + return sock_fds; >>>> +} >>>> +#endif >>>> + >>>> +/* >>>> + * The exported init function. >>>> + * >>>> + * ... -net af-xdp,ifname="..." >>> >>> This is the legacy command line, let's say -netdev af-xdp,... >> >> Sure. >> >>> >>>> + */ >>>> +int net_init_af_xdp(const Netdev *netdev, >>>> + const char *name, NetClientState *peer, Error **errp) >>>> +{ >>>> + const NetdevAFXDPOptions *opts = &netdev->u.af_xdp; >>>> + NetClientState *nc, *nc0 = NULL; >>>> + unsigned int ifindex; >>>> + uint32_t prog_id = 0; >>>> + int *sock_fds = NULL; >>>> + int xsks_map_fd = -1; >>>> + int64_t i, queues; >>>> + Error *err = NULL; >>>> + AFXDPState *s; >>>> + >>>> + ifindex = if_nametoindex(opts->ifname); >>>> + if (!ifindex) { >>>> + error_setg_errno(errp, errno, "failed to get ifindex for '%s'", >>>> + opts->ifname); >>>> + return -1; >>>> + } >>>> + >>>> + queues = opts->has_queues ? opts->queues : 1; >>>> + if (queues < 1) { >>>> + error_setg(errp, "invalid number of queues (%" PRIi64 ") for >>>> '%s'", >>>> + queues, opts->ifname); >>>> + return -1; >>>> + } >>>> + >>>> +#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD >>>> + if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) { >>>> + error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' >>>> together"); >>>> + return -1; >>>> + } >>>> +#else >>>> + if ((opts->has_inhibit && opts->inhibit) >>>> + != (opts->xsks_map_fd || opts->sock_fds)) { >>>> + error_setg(errp, "'inhibit=on' should be used together with " >>>> + "'sock-fds' or 'xsks-map-fd'"); >>>> + return -1; >>>> + } >>>> + >>>> + if (opts->xsks_map_fd && opts->sock_fds) { >>>> + error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually >>>> exclusive"); >>>> + return -1; >>>> + } >>>> + >>>> + if (opts->sock_fds) { >>>> + sock_fds = parse_socket_fds(opts->sock_fds, queues, errp); >>>> + if (!sock_fds) { >>>> + return -1; >>>> + } >>>> + } >>>> +#endif >>>> + >>>> + if (opts->xsks_map_fd) { >>>> + xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, >>>> errp); >>>> + if (xsks_map_fd < 0) { >>>> + goto err; >>>> + } >>>> + } >>>> + >>>> + for (i = 0; i < queues; i++) { >>>> + nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name); >>>> + qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname); >>>> + nc->queue_index = i; >>>> + >>>> + if (!nc0) { >>>> + nc0 = nc; >>>> + } >>>> + >>>> + s = DO_UPCAST(AFXDPState, nc, nc); >>>> + >>>> + pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname); >>>> + s->ifindex = ifindex; >>>> + s->n_queues = queues; >>>> + >>>> + if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp) >>>> + || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) { >>>> + /* Make sure the XDP program will be removed. */ >>>> + s->n_queues = i; >>>> + error_propagate(errp, err); >>>> + goto err; >>>> + } >>>> + } >>>> + >>>> + if (nc0) { >>>> + s = DO_UPCAST(AFXDPState, nc, nc0); >>>> + if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || >>>> !prog_id) { >>>> + error_setg_errno(errp, errno, >>>> + "no XDP program loaded on '%s', ifindex: %d", >>>> + s->ifname, s->ifindex); >>>> + goto err; >>>> + } >>>> + } >>>> + >>>> + af_xdp_read_poll(s, true); /* Initially only poll for reads. */ >>>> + >>>> + return 0; >>>> + >>>> +err: >>>> + g_free(sock_fds); >>>> + if (nc0) { >>>> + qemu_del_net_client(nc0); >>>> + } >>>> + >>>> + return -1; >>>> +} >>>> diff --git a/net/clients.h b/net/clients.h >>>> index ed8bdfff1e..be53794582 100644 >>>> --- a/net/clients.h >>>> +++ b/net/clients.h >>>> @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char >>>> *name, >>>> NetClientState *peer, Error **errp); >>>> #endif >>>> >>>> +#ifdef CONFIG_AF_XDP >>>> +int net_init_af_xdp(const Netdev *netdev, const char *name, >>>> + NetClientState *peer, Error **errp); >>>> +#endif >>>> + >>>> int net_init_vhost_user(const Netdev *netdev, const char *name, >>>> NetClientState *peer, Error **errp); >>>> >>>> diff --git a/net/meson.build b/net/meson.build >>>> index bdf564a57b..61628d4684 100644 >>>> --- a/net/meson.build >>>> +++ b/net/meson.build >>>> @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c')) >>>> if have_netmap >>>> system_ss.add(files('netmap.c')) >>>> endif >>>> + >>>> +system_ss.add(when: libxdp, if_true: files('af-xdp.c')) >>>> + >>>> if have_vhost_net_user >>>> system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: >>>> files('vhost-user.c'), if_false: files('vhost-user-stub.c')) >>>> system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c')) >>>> diff --git a/net/net.c b/net/net.c >>>> index 6492ad530e..127f70932b 100644 >>>> --- a/net/net.c >>>> +++ b/net/net.c >>>> @@ -1082,6 +1082,9 @@ static int (* const >>>> net_client_init_fun[NET_CLIENT_DRIVER__MAX])( >>>> #ifdef CONFIG_NETMAP >>>> [NET_CLIENT_DRIVER_NETMAP] = net_init_netmap, >>>> #endif >>>> +#ifdef CONFIG_AF_XDP >>>> + [NET_CLIENT_DRIVER_AF_XDP] = net_init_af_xdp, >>>> +#endif >>>> #ifdef CONFIG_NET_BRIDGE >>>> [NET_CLIENT_DRIVER_BRIDGE] = net_init_bridge, >>>> #endif >>>> @@ -1186,6 +1189,9 @@ void show_netdevs(void) >>>> #ifdef CONFIG_NETMAP >>>> "netmap", >>>> #endif >>>> +#ifdef CONFIG_AF_XDP >>>> + "af-xdp", >>>> +#endif >>>> #ifdef CONFIG_POSIX >>>> "vhost-user", >>>> #endif >>>> diff --git a/qapi/net.json b/qapi/net.json >>>> index db67501308..88f2c982c2 100644 >>>> --- a/qapi/net.json >>>> +++ b/qapi/net.json >>>> @@ -408,6 +408,62 @@ >>>> 'ifname': 'str', >>>> '*devname': 'str' } } >>>> >>>> +## >>>> +# @AFXDPMode: >>>> +# >>>> +# Attach mode for a default XDP program >>>> +# >>>> +# @skb: generic mode, no driver support necessary >>>> +# >>>> +# @native: DRV mode, program is attached to a driver, packets are passed >>>> to >>>> +# the socket without allocation of skb. >>>> +# >>>> +# Since: 8.1 >>> >>> I'd make it for 8.2. >> >> OK. >> >>> >>>> +## >>>> +{ 'enum': 'AFXDPMode', >>>> + 'data': [ 'native', 'skb' ] } >>>> + >>>> +## >>>> +# @NetdevAFXDPOptions: >>>> +# >>>> +# AF_XDP network backend >>>> +# >>>> +# @ifname: The name of an existing network interface. >>>> +# >>>> +# @mode: Attach mode for a default XDP program. If not specified, then >>>> +# 'native' will be tried first, then 'skb'. >>>> +# >>>> +# @force-copy: Force XDP copy mode even if device supports zero-copy. >>>> +# (default: false) >>>> +# >>>> +# @queues: number of queues to be used for multiqueue interfaces >>>> (default: 1). >>>> +# >>>> +# @start-queue: Use @queues starting from this queue number (default: 0). >>>> +# >>>> +# @inhibit: Don't load a default XDP program, use one already loaded to >>>> +# the interface (default: false). Requires @sock-fds or @xsks-map-fd. >>>> +# >>>> +# @sock-fds: A colon (:) separated list of file descriptors for already >>>> open >>>> +# but not bound AF_XDP sockets in the queue order. One fd per queue. >>>> +# These descriptors should already be added into XDP socket map for >>>> +# corresponding queues. Requires @inhibit. >>>> +# >>>> +# @xsks-map-fd: A file descriptor for an already open XDP socket map in >>>> +# the already loaded XDP program. Requires @inhibit. >>>> +# >>>> +# Since: 8.1 >>>> +## >>>> +{ 'struct': 'NetdevAFXDPOptions', >>>> + 'data': { >>>> + 'ifname': 'str', >>>> + '*mode': 'AFXDPMode', >>>> + '*force-copy': 'bool', >>>> + '*queues': 'int', >>>> + '*start-queue': 'int', >>>> + '*inhibit': 'bool', >>>> + '*sock-fds': { 'type': 'str', 'if': >>>> 'HAVE_XSK_UMEM__CREATE_WITH_FD' }, >> >> The paramater is defined conditionally hare and it will not be >> compiled in, if HAVE_XSK_UMEM__CREATE_WITH_FD is not defined. > > Right, I missed that. > >> >>>> + '*xsks-map-fd': 'str' } } >>>> + >>>> ## >>>> # @NetdevVhostUserOptions: >>>> # >>>> @@ -642,13 +698,14 @@ >>>> # @vmnet-bridged: since 7.1 >>>> # @stream: since 7.2 >>>> # @dgram: since 7.2 >>>> +# @af-xdp: since 8.1 >>>> # >>>> # Since: 2.7 >>>> ## >>>> { 'enum': 'NetClientDriver', >>>> 'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream', >>>> 'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user', >>>> - 'vhost-vdpa', >>>> + 'vhost-vdpa', 'af-xdp', >>>> { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' }, >>>> { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' }, >>>> { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] } >>>> @@ -680,6 +737,7 @@ >>>> 'bridge': 'NetdevBridgeOptions', >>>> 'hubport': 'NetdevHubPortOptions', >>>> 'netmap': 'NetdevNetmapOptions', >>>> + 'af-xdp': 'NetdevAFXDPOptions', >>>> 'vhost-user': 'NetdevVhostUserOptions', >>>> 'vhost-vdpa': 'NetdevVhostVDPAOptions', >>>> 'vmnet-host': { 'type': 'NetdevVmnetHostOptions', >>>> diff --git a/qemu-options.hx b/qemu-options.hx >>>> index b57489d7ca..d91610701c 100644 >>>> --- a/qemu-options.hx >>>> +++ b/qemu-options.hx >>>> @@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev, >>>> " VALE port (created on the fly) called 'name' >>>> ('nmname' is name of the \n" >>>> " netmap device, defaults to '/dev/netmap')\n" >>>> #endif >>>> +#ifdef CONFIG_AF_XDP >>>> + "-netdev >>>> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n" >>>> + " >>>> [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n" >>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD >>>> + " [,sock-fds=x:y:...:z]\n" >>>> +#endif >>>> + " attach to the existing network interface 'name' with >>>> AF_XDP socket\n" >>>> + " use 'mode=MODE' to specify an XDP program attach >>>> mode\n" >>>> + " use 'force-copy=on|off' to force XDP copy mode even >>>> if device supports zero-copy (default: off)\n" >>>> + " use 'inhibit=on|off' to inhibit loading of a default >>>> XDP program (default: off)\n" >>>> + " with inhibit=on,\n" >>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD >>>> + " use 'sock-fds' to provide file descriptors for >>>> already open XDP sockets\n" >>>> + " added to a socket map in XDP program. One socket >>>> per queue. Or\n" >>>> +#endif >>>> + " use 'xsks-map-fd=k' to provide a file descriptor >>>> for xsks map\n" >>>> + " use 'queues=n' to specify how many queues of a >>>> multiqueue interface should be used\n" >>>> + " use 'start-queue=m' to specify the first queue that >>>> should be used\n" >>>> +#endif >>>> #ifdef CONFIG_POSIX >>>> "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n" >>>> " configure a vhost-user network, backed by a chardev >>>> 'dev'\n" >>>> @@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic, >>>> #ifdef CONFIG_NETMAP >>>> "netmap|" >>>> #endif >>>> +#ifdef CONFIG_AF_XDP >>>> + "af-xdp|" >>>> +#endif >>>> #ifdef CONFIG_POSIX >>>> "vhost-user|" >>>> #endif >>>> @@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, >>>> #ifdef CONFIG_NETMAP >>>> "netmap|" >>>> #endif >>>> +#ifdef CONFIG_AF_XDP >>>> + "af-xdp|" >>>> +#endif >>>> #ifdef CONFIG_VMNET >>>> "vmnet-host|vmnet-shared|vmnet-bridged|" >>>> #endif >>>> @@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, >>>> " old way to initialize a host network interface\n" >>>> " (use the -netdev option if possible instead)\n", >>>> QEMU_ARCH_ALL) >>>> SRST >>>> -``-nic >>>> [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]`` >>>> +``-nic >>>> [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]`` >>>> This option is a shortcut for configuring both the on-board >>>> (default) guest NIC hardware and the host network backend in one go. >>>> The host backend options are the same as with the corresponding >>>> @@ -3350,6 +3375,62 @@ SRST >>>> # launch QEMU instance >>>> |qemu_system| linux.img -nic vde,sock=/tmp/myswitch >>>> >>>> +``-netdev >>>> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]`` >>>> + Configure AF_XDP backend to connect to a network interface 'name' >>>> + using AF_XDP socket. A specific program attach mode for a default >>>> + XDP program can be forced with 'mode', defaults to best-effort, >>>> + where the likely most performant mode will be in use. Number of >>>> queues >>>> + 'n' should generally match the number or queues in the interface, >>>> + defaults to 1. Traffic arriving on non-configured device queues will >>>> + not be delivered to the network backend. >>>> + >>>> + .. parsed-literal:: >>>> + >>>> + # set number of queues to 1 >>>> + ethtool -L eth0 combined 4 >>>> + # launch QEMU instance >>>> + |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ >>>> + -netdev af-xdp,id=n1,ifname=eth0,queues=4 >>>> + >>>> + 'start-queue' option can be specified if a particular range of queues >>>> + [m, m + n] should be in use. For example, this is necessary in order >>>> + to use MLX NICs in native mode. The driver will create a separate set >>>> + of queues on top of regular ones, and only these queues can be used >>>> + for AF_XDP sockets. MLX NICs will also require an additional traffic >>>> + redirection with ethtool to these queues. >>> >>> Let's avoid mentioning any vendor name unless the emulation is done >>> for a specific one. >> >> AFAIK, MLX/NVIDIA is the only vendor that implements XDP queues in this >> strange fashion, so these are the only NICs that will not work without >> start-queue configuration and the extra traffic re-direction. Unfortunately, >> this is also not documented anywhere, so you just have to know that it works >> this way... >> >> So, one one hand, I understand that it's not the place for vendor-specific >> docs. On the other, it gives qemu users a chance to not waste a lot of time >> trying to figure out why everything is configured correctly, but the traffic >> doesn't flow. >> >> Maybe something like this: >> >> 'start-queue' option can be specified if a particular range of queues >> [m, m + n] should be in use. For example, this is may be necessary in >> order to use certain NICs in native mode. Kernel allows the driver to >> create a separate set of XDP queues on top of regular ones, and only >> these queues can be used for AF_XDP sockets. NICs that work this way >> may also require an additional traffic redirection with ethtool to these >> special queues. >> >> .. parsed-literal:: >> >> # set number of queues to 1 >> ethtool -L eth0 combined 1 >> # redirect all the traffic to the second queue (id: 1) >> # note: drivers may require non-empty key/mask pair. >> ethtool -N eth0 flow-type ether \\ >> dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1 >> ethtool -N eth0 flow-type ether \\ >> dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1 >> # launch QEMU instance >> |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ >> -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1 >> >> What do you think? > > This is great. > >> >>> >>>> + E.g.: >>>> + >>>> + .. parsed-literal:: >>>> + >>>> + # set number of queues to 1 >>>> + ethtool -L eth0 combined 1 >>>> + # redirect all the traffic to the second queue (id: 1) >>>> + # note: mlx5 driver requires non-empty key/mask pair. >>>> + ethtool -N eth0 flow-type ether \\ >>>> + dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1 >>>> + ethtool -N eth0 flow-type ether \\ >>>> + dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1 >>>> + # launch QEMU instance >>>> + |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ >>>> + -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1 >>>> + >>>> + XDP program can also be loaded externally. In this case 'inhibit' >>>> option >>>> + should be set to 'on' and 'xsks-map-fd' provided with a file >>>> descriptor >>>> + for an open XDP socket map of that program, or 'sock-fds' with file >>>> + descriptors for already open but not bound XDP sockets already added >>>> to a >>>> + socket map for corresponding queues. One socket per queue. >>>> + >>>> + .. parsed-literal:: >>>> + >>>> + |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ >>>> + -netdev >>>> af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17 >>>> + >>>> + With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB >>>> + of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK >>>> capability. >>>> + With 'inhibit=on' and 'xsks-map-fd' it will additionally require >>>> + CAP_NET_RAW capability. With 'inhibit=off', CAP_SYS/NET_ADMIN should >>>> be >>>> + added as well. >>> >>> This requires the synchronization with the code changes, I suggest not >>> to describe: >>> >>> 1) actual numbers of memory requirement >>> 2) capabilites required for each mode, we don't do that for other >>> netdev like tap. >> >> Makes sense. So, should we just strike the paragraph above entirely? > > I think so. > > Thanks > >> >>> >>> Thanks >>> >>> >>> >>> >>>> + >>>> + >>>> ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]`` >>>> Establish a vhost-user netdev, backed by a chardev id. The chardev >>>> should be a unix domain socket backed one. The vhost-user uses a >>>> diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure >>>> b/scripts/ci/org.centos/stream/8/x86_64/configure >>>> index d02b09a4b9..7585c4c4ed 100755 >>>> --- a/scripts/ci/org.centos/stream/8/x86_64/configure >>>> +++ b/scripts/ci/org.centos/stream/8/x86_64/configure >>>> @@ -35,6 +35,7 @@ >>>> --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \ >>>> --with-coroutine=ucontext \ >>>> --tls-priority=@QEMU,SYSTEM \ >>>> +--disable-af-xdp \ >>>> --disable-attr \ >>>> --disable-auth-pam \ >>>> --disable-avx2 \ >>>> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh >>>> index 7dd5709ef4..162e455309 100644 >>>> --- a/scripts/meson-buildoptions.sh >>>> +++ b/scripts/meson-buildoptions.sh >>>> @@ -74,6 +74,7 @@ meson_options_help() { >>>> printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if >>>> available' >>>> printf "%s\n" '(unless built with --without-default-features):' >>>> printf "%s\n" '' >>>> + printf "%s\n" ' af-xdp AF_XDP network backend support' >>>> printf "%s\n" ' alsa ALSA sound support' >>>> printf "%s\n" ' attr attr/xattr support' >>>> printf "%s\n" ' auth-pam PAM access control' >>>> @@ -207,6 +208,8 @@ meson_options_help() { >>>> } >>>> _meson_option_parse() { >>>> case $1 in >>>> + --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;; >>>> + --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;; >>>> --enable-alsa) printf "%s" -Dalsa=enabled ;; >>>> --disable-alsa) printf "%s" -Dalsa=disabled ;; >>>> --enable-attr) printf "%s" -Dattr=enabled ;; >>>> diff --git a/tests/docker/dockerfiles/debian-amd64.docker >>>> b/tests/docker/dockerfiles/debian-amd64.docker >>>> index e39871c7bb..207f7adfb9 100644 >>>> --- a/tests/docker/dockerfiles/debian-amd64.docker >>>> +++ b/tests/docker/dockerfiles/debian-amd64.docker >>>> @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \ >>>> libvirglrenderer-dev \ >>>> libvte-2.91-dev \ >>>> libxen-dev \ >>>> + libxdp-dev \ >>>> libzstd-dev \ >>>> llvm \ >>>> locales \ >>>> -- >>>> 2.40.1 >>>> >>> >> >