On 7/25/23 08:55, Jason Wang wrote:
> On Thu, Jul 20, 2023 at 9:26 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> On 7/20/23 09:37, Jason Wang wrote:
>>> On Thu, Jul 6, 2023 at 4:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>>
>>>> AF_XDP is a network socket family that allows communication directly
>>>> with the network device driver in the kernel, bypassing most or all
>>>> of the kernel networking stack.  In the essence, the technology is
>>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>>>> and works with any network interfaces without driver modifications.
>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>>> require access to character devices or unix sockets.  Only access to
>>>> the network interface itself is necessary.
>>>>
>>>> This patch implements a network backend that communicates with the
>>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>>>> Fill and Completion) are placed in that memory along with a pool of
>>>> memory buffers for the packet data.  Data transmission is done by
>>>> allocating one of the buffers, copying packet data into it and
>>>> placing the pointer into Tx ring.  After transmission, device will
>>>> return the buffer via Completion ring.  On Rx, device will take
>>>> a buffer form a pre-populated Fill ring, write the packet data into
>>>> it and place the buffer into Rx ring.
>>>>
>>>> AF_XDP network backend takes on the communication with the host
>>>> kernel and the network interface and forwards packets to/from the
>>>> peer device in QEMU.
>>>>
>>>> Usage example:
>>>>
>>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>>
>>>> XDP program bridges the socket with a network interface.  It can be
>>>> attached to the interface in 2 different modes:
>>>>
>>>> 1. skb - this mode should work for any interface and doesn't require
>>>>          driver support.  With a caveat of lower performance.
>>>>
>>>> 2. native - this does require support from the driver and allows to
>>>>             bypass skb allocation in the kernel and potentially use
>>>>             zero-copy while getting packets in/out userspace.
>>>>
>>>> By default, QEMU will try to use native mode and fall back to skb.
>>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>>>> mode, use 'force-copy=on' option.  This might be useful if there is
>>>> some issue with the driver.
>>>>
>>>> Option 'queues=N' allows to specify how many device queues should
>>>> be open.  Note that all the queues that are not open are still
>>>> functional and can receive traffic, but it will not be delivered to
>>>> QEMU.  So, the number of device queues should generally match the
>>>> QEMU configuration, unless the device is shared with something
>>>> else and the traffic re-direction to appropriate queues is correctly
>>>> configured on a device level (e.g. with ethtool -N).
>>>> 'start-queue=M' option can be used to specify from which queue id
>>>> QEMU should start configuring 'N' queues.  It might also be necessary
>>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>>>> for examples.
>>>>
>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>>> capabilities in order to load default XSK/XDP programs to the
>>>> network interface and configure BPF maps.  It is possible, however,
>>>> to run with no capabilities.  For that to work, an external process
>>>> with admin capabilities will need to pre-load default XSK program,
>>>> create AF_XDP sockets and pass their file descriptors to QEMU process
>>>> on startup via 'sock-fds' option.  Network backend will need to be
>>>> configured with 'inhibit=on' to avoid loading of the program.
>>>> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
>>>> or CAP_IPC_LOCK.
>>>>
>>>> Alternatively, the file descriptor for 'xsks_map' can be passed via
>>>> 'xsks-map-fd=N' option instead of passing socket file descriptors.
>>>> That will additionally require CAP_NET_RAW on QEMU side.  This is
>>>> useful, because 'sock-fds' may not be available with older libxdp.
>>>> 'sock-fds' requires libxdp >= 1.4.0.
>>>>
>>>> There are few performance challenges with the current network backends.
>>>>
>>>> First is that they do not support IO threads.  This means that data
>>>> path is handled by the main thread in QEMU and may slow down other
>>>> work or may be slowed down by some other work.  This also means that
>>>> taking advantage of multi-queue is generally not possible today.
>>>>
>>>> Another thing is that data path is going through the device emulation
>>>> code, which is not really optimized for performance.  The fastest
>>>> "frontend" device is virtio-net.  But it's not optimized for heavy
>>>> traffic either, because it expects such use-cases to be handled via
>>>> some implementation of vhost (user, kernel, vdpa).  In practice, we
>>>> have virtio notifications and rcu lock/unlock on a per-packet basis
>>>> and not very efficient accesses to the guest memory.  Communication
>>>> channels between backend and frontend devices do not allow passing
>>>> more than one packet at a time as well.
>>>>
>>>> Some of these challenges can be avoided in the future by adding better
>>>> batching into device emulation or by implementing vhost-af-xdp variant.
>>>>
>>>> There are also a few kernel limitations.  AF_XDP sockets do not
>>>> support any kinds of checksum or segmentation offloading.  Buffers
>>>> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
>>>> support implementation for AF_XDP is in progress, but not ready yet.
>>>> Also, transmission in all non-zero-copy modes is synchronous, i.e.
>>>> done in a syscall.  That doesn't allow high packet rates on virtual
>>>> interfaces.
>>>>
>>>> However, keeping in mind all of these challenges, current implementation
>>>> of the AF_XDP backend shows a decent performance while running on top
>>>> of a physical NIC with zero-copy support.
>>>>
>>>> Test setup:
>>>>
>>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
>>>> Network backend is configured to open the NIC directly in native mode.
>>>> The driver supports zero-copy.  NIC is configured to use 1 queue.
>>>>
>>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
>>>> for PPS testing.
>>>>
>>>> iperf3 result:
>>>>  TCP stream      : 19.1 Gbps
>>>>
>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>>>  Tx only         : 3.4 Mpps
>>>>  Rx only         : 2.0 Mpps
>>>>  L2 FWD Loopback : 1.5 Mpps
>>>>
>>>> In skb mode the same setup shows much lower performance, similar to
>>>> the setup where pair of physical NICs is replaced with veth pair:
>>>>
>>>> iperf3 result:
>>>>   TCP stream      : 9 Gbps
>>>>
>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>>>   Tx only         : 1.2 Mpps
>>>>   Rx only         : 1.0 Mpps
>>>>   L2 FWD Loopback : 0.7 Mpps
>>>>
>>>> Results in skb mode or over the veth are close to results of a tap
>>>> backend with vhost=on and disabled segmentation offloading bridged
>>>> with a NIC.
>>>>
>>>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>>>
>>> Looks good overall, see few comments inline.
>>
>> Thanks for review!
>>
>>>
>>>> ---
>>>>
>>>> Version 2:
>>>>
>>>>   - Added support for running with no capabilities by passing
>>>>     pre-created AF_XDP socket file descriptors via 'sock-fds' option.
>>>>     Conditionally complied because requires unreleased libxdp 1.4.0.
>>>>     The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue.
>>>>
>>>>   - Refined and extended documentation.
>>>>
>>>>
>>>>  MAINTAINERS                                   |   4 +
>>>>  hmp-commands.hx                               |   2 +-
>>>>  meson.build                                   |  19 +
>>>>  meson_options.txt                             |   2 +
>>>>  net/af-xdp.c                                  | 570 ++++++++++++++++++
>>>>  net/clients.h                                 |   5 +
>>>>  net/meson.build                               |   3 +
>>>>  net/net.c                                     |   6 +
>>>>  qapi/net.json                                 |  60 +-
>>>>  qemu-options.hx                               |  83 ++-
>>>>  .../ci/org.centos/stream/8/x86_64/configure   |   1 +
>>>>  scripts/meson-buildoptions.sh                 |   3 +
>>>>  tests/docker/dockerfiles/debian-amd64.docker  |   1 +
>>>>  13 files changed, 756 insertions(+), 3 deletions(-)
>>>>  create mode 100644 net/af-xdp.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 7164cf55a1..80d4ba4004 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/
>>>>  S: Maintained
>>>>  F: net/netmap.c
>>>>
>>>> +AF_XDP network backend
>>>> +R: Ilya Maximets <i.maxim...@ovn.org>
>>>> +F: net/af-xdp.c
>>>> +
>>>>  Host Memory Backends
>>>>  M: David Hildenbrand <da...@redhat.com>
>>>>  M: Igor Mammedov <imamm...@redhat.com>
>>>> diff --git a/hmp-commands.hx b/hmp-commands.hx
>>>> index 2cbd0f77a0..af9ffe4681 100644
>>>> --- a/hmp-commands.hx
>>>> +++ b/hmp-commands.hx
>>>> @@ -1295,7 +1295,7 @@ ERST
>>>>      {
>>>>          .name       = "netdev_add",
>>>>          .args_type  = "netdev:O",
>>>> -        .params     = 
>>>> "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user"
>>>> +        .params     = 
>>>> "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user"
>>>>  #ifdef CONFIG_VMNET
>>>>                        "|vmnet-host|vmnet-shared|vmnet-bridged"
>>>>  #endif
>>>> diff --git a/meson.build b/meson.build
>>>> index a9ba0bfab3..1f8772ea5d 100644
>>>> --- a/meson.build
>>>> +++ b/meson.build
>>>> @@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links('''
>>>>    endif
>>>>  endif
>>>>
>>>> +# libxdp
>>>> +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 
>>>> 'pkg-config')
>>>> +if libxdp.found() and \
>>>> +      not (libbpf.found() and libbpf.version().version_compare('>=0.7'))
>>>> +  libxdp = not_found
>>>> +  if get_option('af_xdp').enabled()
>>>> +    error('af-xdp support requires libbpf version >= 0.7')
>>>
>>> Can we simply limit this to 1.4?
>>
>> This is a check for libbpf, not libxdp.  Or do you think there is no need
>> to check libbpf version if we request libxdp version high enough?
>> Users may still break the build by installing old libbpf manually even if
>> distributions ship more modern versions.
>>
>> Or do you mean limit the libxdp version to 1.4 in order to avoid conditional
>> on HAVE_XSK_UMEM__CREATE_WITH_FD ?
> 
> Yes.
> 
>> The problem with that is that libxdp 1.4 is a week+ old, so not available in
>> any distribution, AFAIK.  Not sure how big of a problem that is though.
> 
> It doesn't matter as this is a brand new backend, it would simplify
> future maintenance if we can get rid of any HAVE_XXX macros.

OK, makes sense.  I'll require libxdp 1.4 and remove all ifdefs.
I'll also remove xsks-map-fd configuration, since it will be always
posisble to just use sock-fds instead.  We may add it back later,
but it requires extra privileges (NET_RAW), so I'm not sure there
is much value in it.

> 
>>
>>>
>>>
>>>> +  else
>>>> +    warning('af-xdp support requires libbpf version >= 0.7, disabling')
>>>> +  endif
>>>> +endif
>>>> +
>>>>  # libdw
>>>>  libdw = not_found
>>>>  if not get_option('libdw').auto() or \
>>>> @@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', 
>>>> get_option('hexagon_idef_pars
>>>>  config_host_data.set('CONFIG_LIBATTR', have_old_libattr)
>>>>  config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found())
>>>>  config_host_data.set('CONFIG_EBPF', libbpf.found())
>>>> +config_host_data.set('CONFIG_AF_XDP', libxdp.found())
>>>> +if libxdp.found()
>>>> +  config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD',
>>>> +                       cc.has_function('xsk_umem__create_with_fd',
>>>> +                                       dependencies: libxdp))
>>>> +endif
>>>>  config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found())
>>>>  config_host_data.set('CONFIG_LIBISCSI', libiscsi.found())
>>>>  config_host_data.set('CONFIG_LIBNFS', libnfs.found())
>>>> @@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support':    have_pvrdma}
>>>>  summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : 
>>>> fdt_opt}
>>>>  summary_info += {'libcap-ng support': libcap_ng}
>>>>  summary_info += {'bpf support':       libbpf}
>>>> +summary_info += {'AF_XDP support':    libxdp}
>>>>  summary_info += {'rbd support':       rbd}
>>>>  summary_info += {'smartcard support': cacard}
>>>>  summary_info += {'U2F support':       u2f}
>>>> diff --git a/meson_options.txt b/meson_options.txt
>>>> index bbb5c7e886..f4e950ce6a 100644
>>>> --- a/meson_options.txt
>>>> +++ b/meson_options.txt
>>>> @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto',
>>>>  option('keyring', type: 'feature', value: 'auto',
>>>>         description: 'Linux keyring support')
>>>>
>>>> +option('af_xdp', type : 'feature', value : 'auto',
>>>> +       description: 'AF_XDP network backend support')
>>>>  option('attr', type : 'feature', value : 'auto',
>>>>         description: 'attr/xattr support')
>>>>  option('auth_pam', type : 'feature', value : 'auto',
>>>> diff --git a/net/af-xdp.c b/net/af-xdp.c
>>>> new file mode 100644
>>>> index 0000000000..265ba6b12e
>>>> --- /dev/null
>>>> +++ b/net/af-xdp.c
>>>> @@ -0,0 +1,570 @@
>>>> +/*
>>>> + * AF_XDP network backend.
>>>> + *
>>>> + * Copyright (c) 2023 Red Hat, Inc.
>>>> + *
>>>> + * Authors:
>>>> + *  Ilya Maximets <i.maxim...@ovn.org>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or 
>>>> later.
>>>> + * See the COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include <bpf/bpf.h>
>>>> +#include <inttypes.h>
>>>> +#include <linux/if_link.h>
>>>> +#include <linux/if_xdp.h>
>>>> +#include <net/if.h>
>>>> +#include <xdp/xsk.h>
>>>> +
>>>> +#include "clients.h"
>>>> +#include "monitor/monitor.h"
>>>> +#include "net/net.h"
>>>> +#include "qapi/error.h"
>>>> +#include "qemu/cutils.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "qemu/iov.h"
>>>> +#include "qemu/main-loop.h"
>>>> +#include "qemu/memalign.h"
>>>> +
>>>> +
>>>> +typedef struct AFXDPState {
>>>> +    NetClientState       nc;
>>>> +
>>>> +    struct xsk_socket    *xsk;
>>>> +    struct xsk_ring_cons rx;
>>>> +    struct xsk_ring_prod tx;
>>>> +    struct xsk_ring_cons cq;
>>>> +    struct xsk_ring_prod fq;
>>>> +
>>>> +    char                 ifname[IFNAMSIZ];
>>>> +    int                  ifindex;
>>>> +    bool                 read_poll;
>>>> +    bool                 write_poll;
>>>> +    uint32_t             outstanding_tx;
>>>> +
>>>> +    uint64_t             *pool;
>>>> +    uint32_t             n_pool;
>>>> +    char                 *buffer;
>>>> +    struct xsk_umem      *umem;
>>>> +
>>>> +    uint32_t             n_queues;
>>>> +    uint32_t             xdp_flags;
>>>> +    bool                 inhibit;
>>>> +} AFXDPState;
>>>> +
>>>> +#define AF_XDP_BATCH_SIZE 64
>>>> +
>>>> +static void af_xdp_send(void *opaque);
>>>> +static void af_xdp_writable(void *opaque);
>>>> +
>>>> +/* Set the event-loop handlers for the af-xdp backend. */
>>>> +static void af_xdp_update_fd_handler(AFXDPState *s)
>>>> +{
>>>> +    qemu_set_fd_handler(xsk_socket__fd(s->xsk),
>>>> +                        s->read_poll ? af_xdp_send : NULL,
>>>> +                        s->write_poll ? af_xdp_writable : NULL,
>>>> +                        s);
>>>> +}
>>>> +
>>>> +/* Update the read handler. */
>>>> +static void af_xdp_read_poll(AFXDPState *s, bool enable)
>>>> +{
>>>> +    if (s->read_poll != enable) {
>>>> +        s->read_poll = enable;
>>>> +        af_xdp_update_fd_handler(s);
>>>> +    }
>>>> +}
>>>> +
>>>> +/* Update the write handler. */
>>>> +static void af_xdp_write_poll(AFXDPState *s, bool enable)
>>>> +{
>>>> +    if (s->write_poll != enable) {
>>>> +        s->write_poll = enable;
>>>> +        af_xdp_update_fd_handler(s);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void af_xdp_poll(NetClientState *nc, bool enable)
>>>> +{
>>>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +
>>>> +    if (s->read_poll != enable || s->write_poll != enable) {
>>>> +        s->write_poll = enable;
>>>> +        s->read_poll  = enable;
>>>> +        af_xdp_update_fd_handler(s);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void af_xdp_complete_tx(AFXDPState *s)
>>>> +{
>>>> +    uint32_t idx = 0;
>>>> +    uint32_t done, i;
>>>> +    uint64_t *addr;
>>>> +
>>>> +    done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, 
>>>> &idx);
>>>> +
>>>> +    for (i = 0; i < done; i++) {
>>>> +        addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++);
>>>> +        s->pool[s->n_pool++] = *addr;
>>>> +        s->outstanding_tx--;
>>>> +    }
>>>> +
>>>> +    if (done) {
>>>> +        xsk_ring_cons__release(&s->cq, done);
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * The fd_write() callback, invoked if the fd is marked as writable
>>>> + * after a poll.
>>>> + */
>>>> +static void af_xdp_writable(void *opaque)
>>>> +{
>>>> +    AFXDPState *s = opaque;
>>>> +
>>>> +    /* Try to recover buffers that are already sent. */
>>>> +    af_xdp_complete_tx(s);
>>>> +
>>>> +    /*
>>>> +     * Unregister the handler, unless we still have packets to transmit
>>>> +     * and kernel needs a wake up.
>>>> +     */
>>>> +    if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) {
>>>> +        af_xdp_write_poll(s, false);
>>>> +    }
>>>> +
>>>> +    /* Flush any buffered packets. */
>>>> +    qemu_flush_queued_packets(&s->nc);
>>>> +}
>>>> +
>>>> +static ssize_t af_xdp_receive(NetClientState *nc,
>>>> +                              const uint8_t *buf, size_t size)
>>>> +{
>>>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +    struct xdp_desc *desc;
>>>> +    uint32_t idx;
>>>> +    void *data;
>>>> +
>>>> +    /* Try to recover buffers that are already sent. */
>>>> +    af_xdp_complete_tx(s);
>>>> +
>>>> +    if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) {
>>>> +        /* We can't transmit packet this size... */
>>>> +        return size;
>>>> +    }
>>>> +
>>>> +    if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) {
>>>> +        /*
>>>> +         * Out of buffers or space in tx ring.  Poll until we can write.
>>>> +         * This will also kick the Tx, if it was waiting on CQ.
>>>> +         */
>>>> +        af_xdp_write_poll(s, true);
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    desc = xsk_ring_prod__tx_desc(&s->tx, idx);
>>>> +    desc->addr = s->pool[--s->n_pool];
>>>> +    desc->len = size;
>>>> +
>>>> +    data = xsk_umem__get_data(s->buffer, desc->addr);
>>>> +    memcpy(data, buf, size);
>>>> +
>>>> +    xsk_ring_prod__submit(&s->tx, 1);
>>>> +    s->outstanding_tx++;
>>>> +
>>>> +    if (xsk_ring_prod__needs_wakeup(&s->tx)) {
>>>> +        af_xdp_write_poll(s, true);
>>>> +    }
>>>> +
>>>> +    return size;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Complete a previous send (backend --> guest) and enable the
>>>> + * fd_read callback.
>>>> + */
>>>> +static void af_xdp_send_completed(NetClientState *nc, ssize_t len)
>>>> +{
>>>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +
>>>> +    af_xdp_read_poll(s, true);
>>>> +}
>>>> +
>>>> +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n)
>>>> +{
>>>> +    uint32_t i, idx = 0;
>>>> +
>>>> +    /* Leave one packet for Tx, just in case. */
>>>> +    if (s->n_pool < n + 1) {
>>>> +        n = s->n_pool;
>>>> +    }
>>>> +
>>>> +    if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < n; i++) {
>>>> +        *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool];
>>>> +    }
>>>> +    xsk_ring_prod__submit(&s->fq, n);
>>>> +
>>>> +    if (xsk_ring_prod__needs_wakeup(&s->fq)) {
>>>> +        /* Receive was blocked by not having enough buffers.  Wake it up. 
>>>> */
>>>> +        af_xdp_read_poll(s, true);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void af_xdp_send(void *opaque)
>>>> +{
>>>> +    uint32_t i, n_rx, idx = 0;
>>>> +    AFXDPState *s = opaque;
>>>> +
>>>> +    n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx);
>>>> +    if (!n_rx) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < n_rx; i++) {
>>>> +        const struct xdp_desc *desc;
>>>> +        struct iovec iov;
>>>> +
>>>> +        desc = xsk_ring_cons__rx_desc(&s->rx, idx++);
>>>> +
>>>> +        iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr);
>>>> +        iov.iov_len = desc->len;
>>>> +
>>>> +        s->pool[s->n_pool++] = desc->addr;
>>>> +
>>>> +        if (!qemu_sendv_packet_async(&s->nc, &iov, 1,
>>>> +                                     af_xdp_send_completed)) {
>>>> +            /*
>>>> +             * The peer does not receive anymore.  Packet is queued, stop
>>>> +             * reading from the backend until af_xdp_send_completed().
>>>> +             */
>>>> +            af_xdp_read_poll(s, false);
>>>> +
>>>> +            /* Re-peek the descriptors to not break the ring cache. */
>>>> +            xsk_ring_cons__cancel(&s->rx, n_rx);
>>>> +            n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx);
>>>
>>> The code turns out to be hard to read here.
>>>
>>> 1) This seems to undo the peek (usually peek doesn't touch the
>>> prod/consumer but it seems not what xsk_ring_cons__peek()) did:
>>
>> Yeah, it's unfortunate that the peek() function changes the internal
>> state, but that is what we have...
>>
>>>
>>> static inline __u32 xsk_ring_cons__peek(struct xsk_ring_cons *cons,
>>> __u32 nb, __u32 *idx)
>>> {
>>>   __u32 entries = xsk_cons_nb_avail(cons, nb);
>>>
>>> if (entries > 0) {
>>>                 *idx = cons->cached_cons;
>>>                 cons->cached_cons += entries;
>>>         }
>>>
>>>         return entries;
>>> }
>>>
>>> 2) It looks to me a partial rollback is sufficient?
>>>
>>> xsk_ring_cons__cancel(n_rx - i + 1)?
>>
>> Good point.  Should work.  It should be n_rx - i - 1 though, if I'm not
>> mistaken.  So:
>>
>>     xsk_ring_cons__cancel(n_rx - i - 1);
>>     n_rx = i + 1;
>>
>> I'm not sure if that is much easier to read, but that's OK.  Should be
>> a touch faster as well.  What do you think?
> 
> Let's do that please.

OK, Sure.  Seems to work fine.

I'll post v3 soon with this and other discused changes.

> 
>>
>>>
>>>> +            g_assert(n_rx == i + 1);
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Release actually sent descriptors and try to re-fill. */
>>>> +    xsk_ring_cons__release(&s->rx, n_rx);
>>>> +    af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE);
>>>> +}
>>>> +
>>>> +/* Flush and close. */
>>>> +static void af_xdp_cleanup(NetClientState *nc)
>>>> +{
>>>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +
>>>> +    qemu_purge_queued_packets(nc);
>>>> +
>>>> +    af_xdp_poll(nc, false);
>>>> +
>>>> +    xsk_socket__delete(s->xsk);
>>>> +    s->xsk = NULL;
>>>> +    g_free(s->pool);
>>>> +    s->pool = NULL;
>>>> +    xsk_umem__delete(s->umem);
>>>> +    s->umem = NULL;
>>>> +    qemu_vfree(s->buffer);
>>>> +    s->buffer = NULL;
>>>> +
>>>> +    /* Remove the program if it's the last open queue. */
>>>> +    if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags
>>>> +        && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) {
>>>> +        fprintf(stderr,
>>>> +                "af-xdp: unable to remove XDP program from '%s', ifindex: 
>>>> %d\n",
>>>> +                s->ifname, s->ifindex);
>>>> +    }
>>>> +}
>>>> +
>>>> +static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp)
>>>> +{
>>>> +    struct xsk_umem_config config = {
>>>> +        .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
>>>> +        .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
>>>> +        .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
>>>> +        .frame_headroom = 0,
>>>> +    };
>>>> +    uint64_t n_descs;
>>>> +    uint64_t size;
>>>> +    int64_t i;
>>>> +    int ret;
>>>> +
>>>> +    /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */
>>>> +    n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS
>>>> +               + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2;
>>>> +    size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE;
>>>> +
>>>> +    s->buffer = qemu_memalign(qemu_real_host_page_size(), size);
>>>> +    memset(s->buffer, 0, size);
>>>> +
>>>> +    if (sock_fd < 0) {
>>>> +        ret = xsk_umem__create(&s->umem, s->buffer, size,
>>>> +                               &s->fq, &s->cq, &config);
>>>> +    } else {
>>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +        ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size,
>>>> +                                       &s->fq, &s->cq, &config);
>>>> +#else
>>>
>>> So sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD won't work. We'd better
>>>
>>> 1) disable sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD
>>
>> The qapi property is conditionally defined, so users will not be able to
>> set sock-fds if not supported.  And qemu will complain about unknown
>> property.  That should be enough?
> 
> Yes.
> 
>>
>>>
>>> or
>>>
>>> 2) disable af_xdp without HAVE_XSK_UMEM__CREATE_WITH_FD
>>
>> If we require libxdp 1.4 that will be the case.
>>
>>>
>>>> +        ret = -1;
>>>> +        errno = EINVAL;
>>>> +#endif
>>>> +    }
>>>> +
>>>> +    if (ret) {
>>>> +        qemu_vfree(s->buffer);
>>>> +        error_setg_errno(errp, errno,
>>>> +                         "failed to create umem for %s queue_index: %d",
>>>> +                         s->ifname, s->nc.queue_index);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    s->pool = g_new(uint64_t, n_descs);
>>>> +    /* Fill the pool in the opposite order, because it's a LIFO queue. */
>>>> +    for (i = n_descs; i >= 0; i--) {
>>>> +        s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE;
>>>> +    }
>>>> +    s->n_pool = n_descs;
>>>> +
>>>> +    af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int af_xdp_socket_create(AFXDPState *s,
>>>> +                                const NetdevAFXDPOptions *opts,
>>>> +                                int xsks_map_fd, Error **errp)
>>>> +{
>>>> +    struct xsk_socket_config cfg = {
>>>> +        .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
>>>> +        .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
>>>> +        .libxdp_flags = 0,
>>>> +        .bind_flags = XDP_USE_NEED_WAKEUP,
>>>> +        .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
>>>> +    };
>>>> +    int queue_id, error = 0;
>>>> +
>>>> +    s->inhibit = opts->has_inhibit && opts->inhibit;
>>>> +    if (s->inhibit) {
>>>> +        cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD;
>>>> +    }
>>>> +
>>>> +    if (opts->has_force_copy && opts->force_copy) {
>>>> +        cfg.bind_flags |= XDP_COPY;
>>>> +    }
>>>> +
>>>> +    queue_id = s->nc.queue_index;
>>>> +    if (opts->has_start_queue && opts->start_queue > 0) {
>>>> +        queue_id += opts->start_queue;
>>>> +    }
>>>> +
>>>> +    if (opts->has_mode) {
>>>> +        /* Specific mode requested. */
>>>> +        cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE)
>>>> +                         ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE;
>>>> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>>>> +                               s->umem, &s->rx, &s->tx, &cfg)) {
>>>> +            error = errno;
>>>> +        }
>>>> +    } else {
>>>> +        /* No mode requested, try native first. */
>>>> +        cfg.xdp_flags |= XDP_FLAGS_DRV_MODE;
>>>> +
>>>> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>>>> +                               s->umem, &s->rx, &s->tx, &cfg)) {
>>>> +            /* Can't use native mode, try skb. */
>>>> +            cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE;
>>>> +            cfg.xdp_flags |= XDP_FLAGS_SKB_MODE;
>>>> +
>>>> +            if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>>>> +                                   s->umem, &s->rx, &s->tx, &cfg)) {
>>>> +                error = errno;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (error) {
>>>> +        error_setg_errno(errp, error,
>>>> +                         "failed to create AF_XDP socket for %s queue_id: 
>>>> %d",
>>>> +                         s->ifname, queue_id);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (s->inhibit && xsks_map_fd >= 0) {
>>>> +        int xsk_fd = xsk_socket__fd(s->xsk);
>>>> +
>>>> +        /* Need to update the map manually, libxdp skipped that step. */
>>>> +        error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0);
>>>> +        if (error) {
>>>> +            error_setg_errno(errp, error,
>>>> +                             "failed to update xsks map for %s queue_id: 
>>>> %d",
>>>> +                             s->ifname, queue_id);
>>>> +            return -1;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    s->xdp_flags = cfg.xdp_flags;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +/* NetClientInfo methods. */
>>>> +static NetClientInfo net_af_xdp_info = {
>>>> +    .type = NET_CLIENT_DRIVER_AF_XDP,
>>>> +    .size = sizeof(AFXDPState),
>>>> +    .receive = af_xdp_receive,
>>>> +    .poll = af_xdp_poll,
>>>> +    .cleanup = af_xdp_cleanup,
>>>> +};
>>>> +
>>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +static int *parse_socket_fds(const char *sock_fds_str,
>>>> +                             int64_t n_expected, Error **errp)
>>>> +{
>>>> +    gchar **substrings = g_strsplit(sock_fds_str, ":", -1);
>>>> +    int64_t i, n_sock_fds = g_strv_length(substrings);
>>>> +    int *sock_fds = NULL;
>>>> +
>>>> +    if (n_sock_fds != n_expected) {
>>>> +        error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64,
>>>> +                   n_expected, n_sock_fds);
>>>> +        goto exit;
>>>> +    }
>>>> +
>>>> +    sock_fds = g_new(int, n_sock_fds);
>>>> +
>>>> +    for (i = 0; i < n_sock_fds; i++) {
>>>> +        sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], 
>>>> errp);
>>>> +        if (sock_fds[i] < 0) {
>>>> +            g_free(sock_fds);
>>>> +            sock_fds = NULL;
>>>> +            goto exit;
>>>> +        }
>>>> +    }
>>>> +
>>>> +exit:
>>>> +    g_strfreev(substrings);
>>>> +    return sock_fds;
>>>> +}
>>>> +#endif
>>>> +
>>>> +/*
>>>> + * The exported init function.
>>>> + *
>>>> + * ... -net af-xdp,ifname="..."
>>>
>>> This is the legacy command line, let's say -netdev af-xdp,...
>>
>> Sure.
>>
>>>
>>>> + */
>>>> +int net_init_af_xdp(const Netdev *netdev,
>>>> +                    const char *name, NetClientState *peer, Error **errp)
>>>> +{
>>>> +    const NetdevAFXDPOptions *opts = &netdev->u.af_xdp;
>>>> +    NetClientState *nc, *nc0 = NULL;
>>>> +    unsigned int ifindex;
>>>> +    uint32_t prog_id = 0;
>>>> +    int *sock_fds = NULL;
>>>> +    int xsks_map_fd = -1;
>>>> +    int64_t i, queues;
>>>> +    Error *err = NULL;
>>>> +    AFXDPState *s;
>>>> +
>>>> +    ifindex = if_nametoindex(opts->ifname);
>>>> +    if (!ifindex) {
>>>> +        error_setg_errno(errp, errno, "failed to get ifindex for '%s'",
>>>> +                         opts->ifname);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    queues = opts->has_queues ? opts->queues : 1;
>>>> +    if (queues < 1) {
>>>> +        error_setg(errp, "invalid number of queues (%" PRIi64 ") for 
>>>> '%s'",
>>>> +                   queues, opts->ifname);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +    if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) {
>>>> +        error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' 
>>>> together");
>>>> +        return -1;
>>>> +    }
>>>> +#else
>>>> +    if ((opts->has_inhibit && opts->inhibit)
>>>> +        != (opts->xsks_map_fd || opts->sock_fds)) {
>>>> +        error_setg(errp, "'inhibit=on' should be used together with "
>>>> +                         "'sock-fds' or 'xsks-map-fd'");
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (opts->xsks_map_fd && opts->sock_fds) {
>>>> +        error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually 
>>>> exclusive");
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (opts->sock_fds) {
>>>> +        sock_fds = parse_socket_fds(opts->sock_fds, queues, errp);
>>>> +        if (!sock_fds) {
>>>> +            return -1;
>>>> +        }
>>>> +    }
>>>> +#endif
>>>> +
>>>> +    if (opts->xsks_map_fd) {
>>>> +        xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, 
>>>> errp);
>>>> +        if (xsks_map_fd < 0) {
>>>> +            goto err;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < queues; i++) {
>>>> +        nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name);
>>>> +        qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname);
>>>> +        nc->queue_index = i;
>>>> +
>>>> +        if (!nc0) {
>>>> +            nc0 = nc;
>>>> +        }
>>>> +
>>>> +        s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +
>>>> +        pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
>>>> +        s->ifindex = ifindex;
>>>> +        s->n_queues = queues;
>>>> +
>>>> +        if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp)
>>>> +            || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) {
>>>> +            /* Make sure the XDP program will be removed. */
>>>> +            s->n_queues = i;
>>>> +            error_propagate(errp, err);
>>>> +            goto err;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (nc0) {
>>>> +        s = DO_UPCAST(AFXDPState, nc, nc0);
>>>> +        if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || 
>>>> !prog_id) {
>>>> +            error_setg_errno(errp, errno,
>>>> +                             "no XDP program loaded on '%s', ifindex: %d",
>>>> +                             s->ifname, s->ifindex);
>>>> +            goto err;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    af_xdp_read_poll(s, true); /* Initially only poll for reads. */
>>>> +
>>>> +    return 0;
>>>> +
>>>> +err:
>>>> +    g_free(sock_fds);
>>>> +    if (nc0) {
>>>> +        qemu_del_net_client(nc0);
>>>> +    }
>>>> +
>>>> +    return -1;
>>>> +}
>>>> diff --git a/net/clients.h b/net/clients.h
>>>> index ed8bdfff1e..be53794582 100644
>>>> --- a/net/clients.h
>>>> +++ b/net/clients.h
>>>> @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char 
>>>> *name,
>>>>                      NetClientState *peer, Error **errp);
>>>>  #endif
>>>>
>>>> +#ifdef CONFIG_AF_XDP
>>>> +int net_init_af_xdp(const Netdev *netdev, const char *name,
>>>> +                    NetClientState *peer, Error **errp);
>>>> +#endif
>>>> +
>>>>  int net_init_vhost_user(const Netdev *netdev, const char *name,
>>>>                          NetClientState *peer, Error **errp);
>>>>
>>>> diff --git a/net/meson.build b/net/meson.build
>>>> index bdf564a57b..61628d4684 100644
>>>> --- a/net/meson.build
>>>> +++ b/net/meson.build
>>>> @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c'))
>>>>  if have_netmap
>>>>    system_ss.add(files('netmap.c'))
>>>>  endif
>>>> +
>>>> +system_ss.add(when: libxdp, if_true: files('af-xdp.c'))
>>>> +
>>>>  if have_vhost_net_user
>>>>    system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: 
>>>> files('vhost-user.c'), if_false: files('vhost-user-stub.c'))
>>>>    system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c'))
>>>> diff --git a/net/net.c b/net/net.c
>>>> index 6492ad530e..127f70932b 100644
>>>> --- a/net/net.c
>>>> +++ b/net/net.c
>>>> @@ -1082,6 +1082,9 @@ static int (* const 
>>>> net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
>>>>  #ifdef CONFIG_NETMAP
>>>>          [NET_CLIENT_DRIVER_NETMAP]    = net_init_netmap,
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +        [NET_CLIENT_DRIVER_AF_XDP]    = net_init_af_xdp,
>>>> +#endif
>>>>  #ifdef CONFIG_NET_BRIDGE
>>>>          [NET_CLIENT_DRIVER_BRIDGE]    = net_init_bridge,
>>>>  #endif
>>>> @@ -1186,6 +1189,9 @@ void show_netdevs(void)
>>>>  #ifdef CONFIG_NETMAP
>>>>          "netmap",
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +        "af-xdp",
>>>> +#endif
>>>>  #ifdef CONFIG_POSIX
>>>>          "vhost-user",
>>>>  #endif
>>>> diff --git a/qapi/net.json b/qapi/net.json
>>>> index db67501308..88f2c982c2 100644
>>>> --- a/qapi/net.json
>>>> +++ b/qapi/net.json
>>>> @@ -408,6 +408,62 @@
>>>>      'ifname':     'str',
>>>>      '*devname':    'str' } }
>>>>
>>>> +##
>>>> +# @AFXDPMode:
>>>> +#
>>>> +# Attach mode for a default XDP program
>>>> +#
>>>> +# @skb: generic mode, no driver support necessary
>>>> +#
>>>> +# @native: DRV mode, program is attached to a driver, packets are passed 
>>>> to
>>>> +#     the socket without allocation of skb.
>>>> +#
>>>> +# Since: 8.1
>>>
>>> I'd make it for 8.2.
>>
>> OK.
>>
>>>
>>>> +##
>>>> +{ 'enum': 'AFXDPMode',
>>>> +  'data': [ 'native', 'skb' ] }
>>>> +
>>>> +##
>>>> +# @NetdevAFXDPOptions:
>>>> +#
>>>> +# AF_XDP network backend
>>>> +#
>>>> +# @ifname: The name of an existing network interface.
>>>> +#
>>>> +# @mode: Attach mode for a default XDP program.  If not specified, then
>>>> +#     'native' will be tried first, then 'skb'.
>>>> +#
>>>> +# @force-copy: Force XDP copy mode even if device supports zero-copy.
>>>> +#     (default: false)
>>>> +#
>>>> +# @queues: number of queues to be used for multiqueue interfaces 
>>>> (default: 1).
>>>> +#
>>>> +# @start-queue: Use @queues starting from this queue number (default: 0).
>>>> +#
>>>> +# @inhibit: Don't load a default XDP program, use one already loaded to
>>>> +#     the interface (default: false).  Requires @sock-fds or @xsks-map-fd.
>>>> +#
>>>> +# @sock-fds: A colon (:) separated list of file descriptors for already 
>>>> open
>>>> +#     but not bound AF_XDP sockets in the queue order.  One fd per queue.
>>>> +#     These descriptors should already be added into XDP socket map for
>>>> +#     corresponding queues.  Requires @inhibit.
>>>> +#
>>>> +# @xsks-map-fd: A file descriptor for an already open XDP socket map in
>>>> +#     the already loaded XDP program.  Requires @inhibit.
>>>> +#
>>>> +# Since: 8.1
>>>> +##
>>>> +{ 'struct': 'NetdevAFXDPOptions',
>>>> +  'data': {
>>>> +    'ifname':       'str',
>>>> +    '*mode':        'AFXDPMode',
>>>> +    '*force-copy':  'bool',
>>>> +    '*queues':      'int',
>>>> +    '*start-queue': 'int',
>>>> +    '*inhibit':     'bool',
>>>> +    '*sock-fds':    { 'type': 'str', 'if': 
>>>> 'HAVE_XSK_UMEM__CREATE_WITH_FD' },
>>
>> The paramater is defined conditionally hare and it will not be
>> compiled in, if HAVE_XSK_UMEM__CREATE_WITH_FD is not defined.
> 
> Right, I missed that.
> 
>>
>>>> +    '*xsks-map-fd': 'str' } }
>>>> +
>>>>  ##
>>>>  # @NetdevVhostUserOptions:
>>>>  #
>>>> @@ -642,13 +698,14 @@
>>>>  # @vmnet-bridged: since 7.1
>>>>  # @stream: since 7.2
>>>>  # @dgram: since 7.2
>>>> +# @af-xdp: since 8.1
>>>>  #
>>>>  # Since: 2.7
>>>>  ##
>>>>  { 'enum': 'NetClientDriver',
>>>>    'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream',
>>>>              'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user',
>>>> -            'vhost-vdpa',
>>>> +            'vhost-vdpa', 'af-xdp',
>>>>              { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
>>>>              { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
>>>>              { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
>>>> @@ -680,6 +737,7 @@
>>>>      'bridge':   'NetdevBridgeOptions',
>>>>      'hubport':  'NetdevHubPortOptions',
>>>>      'netmap':   'NetdevNetmapOptions',
>>>> +    'af-xdp':   'NetdevAFXDPOptions',
>>>>      'vhost-user': 'NetdevVhostUserOptions',
>>>>      'vhost-vdpa': 'NetdevVhostVDPAOptions',
>>>>      'vmnet-host': { 'type': 'NetdevVmnetHostOptions',
>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>> index b57489d7ca..d91610701c 100644
>>>> --- a/qemu-options.hx
>>>> +++ b/qemu-options.hx
>>>> @@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
>>>>      "                VALE port (created on the fly) called 'name' 
>>>> ('nmname' is name of the \n"
>>>>      "                netmap device, defaults to '/dev/netmap')\n"
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +    "-netdev 
>>>> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
>>>> +    "         
>>>> [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n"
>>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +    "         [,sock-fds=x:y:...:z]\n"
>>>> +#endif
>>>> +    "                attach to the existing network interface 'name' with 
>>>> AF_XDP socket\n"
>>>> +    "                use 'mode=MODE' to specify an XDP program attach 
>>>> mode\n"
>>>> +    "                use 'force-copy=on|off' to force XDP copy mode even 
>>>> if device supports zero-copy (default: off)\n"
>>>> +    "                use 'inhibit=on|off' to inhibit loading of a default 
>>>> XDP program (default: off)\n"
>>>> +    "                with inhibit=on,\n"
>>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +    "                  use 'sock-fds' to provide file descriptors for 
>>>> already open XDP sockets\n"
>>>> +    "                  added to a socket map in XDP program.  One socket 
>>>> per queue.  Or\n"
>>>> +#endif
>>>> +    "                  use 'xsks-map-fd=k' to provide a file descriptor 
>>>> for xsks map\n"
>>>> +    "                use 'queues=n' to specify how many queues of a 
>>>> multiqueue interface should be used\n"
>>>> +    "                use 'start-queue=m' to specify the first queue that 
>>>> should be used\n"
>>>> +#endif
>>>>  #ifdef CONFIG_POSIX
>>>>      "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n"
>>>>      "                configure a vhost-user network, backed by a chardev 
>>>> 'dev'\n"
>>>> @@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic,
>>>>  #ifdef CONFIG_NETMAP
>>>>      "netmap|"
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +    "af-xdp|"
>>>> +#endif
>>>>  #ifdef CONFIG_POSIX
>>>>      "vhost-user|"
>>>>  #endif
>>>> @@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>>>  #ifdef CONFIG_NETMAP
>>>>      "netmap|"
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +    "af-xdp|"
>>>> +#endif
>>>>  #ifdef CONFIG_VMNET
>>>>      "vmnet-host|vmnet-shared|vmnet-bridged|"
>>>>  #endif
>>>> @@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>>>      "                old way to initialize a host network interface\n"
>>>>      "                (use the -netdev option if possible instead)\n", 
>>>> QEMU_ARCH_ALL)
>>>>  SRST
>>>> -``-nic 
>>>> [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
>>>> +``-nic 
>>>> [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
>>>>      This option is a shortcut for configuring both the on-board
>>>>      (default) guest NIC hardware and the host network backend in one go.
>>>>      The host backend options are the same as with the corresponding
>>>> @@ -3350,6 +3375,62 @@ SRST
>>>>          # launch QEMU instance
>>>>          |qemu_system| linux.img -nic vde,sock=/tmp/myswitch
>>>>
>>>> +``-netdev 
>>>> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]``
>>>> +    Configure AF_XDP backend to connect to a network interface 'name'
>>>> +    using AF_XDP socket.  A specific program attach mode for a default
>>>> +    XDP program can be forced with 'mode', defaults to best-effort,
>>>> +    where the likely most performant mode will be in use.  Number of 
>>>> queues
>>>> +    'n' should generally match the number or queues in the interface,
>>>> +    defaults to 1.  Traffic arriving on non-configured device queues will
>>>> +    not be delivered to the network backend.
>>>> +
>>>> +    .. parsed-literal::
>>>> +
>>>> +        # set number of queues to 1
>>>> +        ethtool -L eth0 combined 4
>>>> +        # launch QEMU instance
>>>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>>>> +            -netdev af-xdp,id=n1,ifname=eth0,queues=4
>>>> +
>>>> +    'start-queue' option can be specified if a particular range of queues
>>>> +    [m, m + n] should be in use.  For example, this is necessary in order
>>>> +    to use MLX NICs in native mode.  The driver will create a separate set
>>>> +    of queues on top of regular ones, and only these queues can be used
>>>> +    for AF_XDP sockets.  MLX NICs will also require an additional traffic
>>>> +    redirection with ethtool to these queues.
>>>
>>> Let's avoid mentioning any vendor name unless the emulation is done
>>> for a specific one.
>>
>> AFAIK, MLX/NVIDIA is the only vendor that implements XDP queues in this
>> strange fashion, so these are the only NICs that will not work without
>> start-queue configuration and the extra traffic re-direction.  Unfortunately,
>> this is also not documented anywhere, so you just have to know that it works
>> this way...
>>
>> So, one one hand, I understand that it's not the place for vendor-specific
>> docs.  On the other, it gives qemu users a chance to not waste a lot of time
>> trying to figure out why everything is configured correctly, but the traffic
>> doesn't flow.
>>
>> Maybe something like this:
>>
>>     'start-queue' option can be specified if a particular range of queues
>>     [m, m + n] should be in use.  For example, this is may be necessary in
>>     order to use certain NICs in native mode.  Kernel allows the driver to
>>     create a separate set of XDP queues on top of regular ones, and only
>>     these queues can be used for AF_XDP sockets.  NICs that work this way
>>     may also require an additional traffic redirection with ethtool to these
>>     special queues.
>>
>>     .. parsed-literal::
>>
>>         # set number of queues to 1
>>         ethtool -L eth0 combined 1
>>         # redirect all the traffic to the second queue (id: 1)
>>         # note: drivers may require non-empty key/mask pair.
>>         ethtool -N eth0 flow-type ether \\
>>             dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
>>         ethtool -N eth0 flow-type ether \\
>>             dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
>>         # launch QEMU instance
>>         |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>>             -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
>>
>> What do you think?
> 
> This is great.
> 
>>
>>>
>>>> + E.g.:
>>>> +
>>>> +    .. parsed-literal::
>>>> +
>>>> +        # set number of queues to 1
>>>> +        ethtool -L eth0 combined 1
>>>> +        # redirect all the traffic to the second queue (id: 1)
>>>> +        # note: mlx5 driver requires non-empty key/mask pair.
>>>> +        ethtool -N eth0 flow-type ether \\
>>>> +            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
>>>> +        ethtool -N eth0 flow-type ether \\
>>>> +            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
>>>> +        # launch QEMU instance
>>>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>>>> +            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
>>>> +
>>>> +    XDP program can also be loaded externally.  In this case 'inhibit' 
>>>> option
>>>> +    should be set to 'on' and 'xsks-map-fd' provided with a file 
>>>> descriptor
>>>> +    for an open XDP socket map of that program, or 'sock-fds' with file
>>>> +    descriptors for already open but not bound XDP sockets already added 
>>>> to a
>>>> +    socket map for corresponding queues.  One socket per queue.
>>>> +
>>>> +    .. parsed-literal::
>>>> +
>>>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>>>> +            -netdev 
>>>> af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17
>>>> +
>>>> +    With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB
>>>> +    of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK 
>>>> capability.
>>>> +    With 'inhibit=on' and 'xsks-map-fd' it will additionally require
>>>> +    CAP_NET_RAW capability.  With 'inhibit=off', CAP_SYS/NET_ADMIN should 
>>>> be
>>>> +    added as well.
>>>
>>> This requires the synchronization with the code changes, I suggest not
>>> to describe:
>>>
>>> 1) actual numbers of memory requirement
>>> 2) capabilites required for each mode, we don't do that for other
>>> netdev like tap.
>>
>> Makes sense.  So, should we just strike the paragraph above entirely?
> 
> I think so.
> 
> Thanks
> 
>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>> +
>>>> +
>>>>  ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]``
>>>>      Establish a vhost-user netdev, backed by a chardev id. The chardev
>>>>      should be a unix domain socket backed one. The vhost-user uses a
>>>> diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure 
>>>> b/scripts/ci/org.centos/stream/8/x86_64/configure
>>>> index d02b09a4b9..7585c4c4ed 100755
>>>> --- a/scripts/ci/org.centos/stream/8/x86_64/configure
>>>> +++ b/scripts/ci/org.centos/stream/8/x86_64/configure
>>>> @@ -35,6 +35,7 @@
>>>>  --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
>>>>  --with-coroutine=ucontext \
>>>>  --tls-priority=@QEMU,SYSTEM \
>>>> +--disable-af-xdp \
>>>>  --disable-attr \
>>>>  --disable-auth-pam \
>>>>  --disable-avx2 \
>>>> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
>>>> index 7dd5709ef4..162e455309 100644
>>>> --- a/scripts/meson-buildoptions.sh
>>>> +++ b/scripts/meson-buildoptions.sh
>>>> @@ -74,6 +74,7 @@ meson_options_help() {
>>>>    printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if 
>>>> available'
>>>>    printf "%s\n" '(unless built with --without-default-features):'
>>>>    printf "%s\n" ''
>>>> +  printf "%s\n" '  af-xdp          AF_XDP network backend support'
>>>>    printf "%s\n" '  alsa            ALSA sound support'
>>>>    printf "%s\n" '  attr            attr/xattr support'
>>>>    printf "%s\n" '  auth-pam        PAM access control'
>>>> @@ -207,6 +208,8 @@ meson_options_help() {
>>>>  }
>>>>  _meson_option_parse() {
>>>>    case $1 in
>>>> +    --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;;
>>>> +    --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;;
>>>>      --enable-alsa) printf "%s" -Dalsa=enabled ;;
>>>>      --disable-alsa) printf "%s" -Dalsa=disabled ;;
>>>>      --enable-attr) printf "%s" -Dattr=enabled ;;
>>>> diff --git a/tests/docker/dockerfiles/debian-amd64.docker 
>>>> b/tests/docker/dockerfiles/debian-amd64.docker
>>>> index e39871c7bb..207f7adfb9 100644
>>>> --- a/tests/docker/dockerfiles/debian-amd64.docker
>>>> +++ b/tests/docker/dockerfiles/debian-amd64.docker
>>>> @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
>>>>                        libvirglrenderer-dev \
>>>>                        libvte-2.91-dev \
>>>>                        libxen-dev \
>>>> +                      libxdp-dev \
>>>>                        libzstd-dev \
>>>>                        llvm \
>>>>                        locales \
>>>> --
>>>> 2.40.1
>>>>
>>>
>>
> 


Reply via email to