On 5/30/25 3:35 PM, Daniel Borkmann wrote: > Hi Ilya, > > On 5/27/25 10:51 PM, Ilya Maximets wrote: >> On 5/12/25 6:06 PM, Daniel Borkmann wrote: >>> Extend 'inhibit=on' setting with the option to specify a pinned XSK map >>> path along with a starting index (default 0) to push the created XSK >>> sockets into. Example usage: >>> >>> # ./build/qemu-system-x86_64 [...] \ >>> -netdev >>> af-xdp,ifname=enp2s0f0np0,id=net0,mode=native,queues=2,start-queue=14,inhibit=on,map-path=/sys/fs/bpf/xsks_map,map-start-index=14 >>> \ >>> -device virtio-net-pci,netdev=net0 [...] >>> >>> This is useful for the case where an existing XDP program with XSK map >>> is present on the AF_XDP supported phys device and the XSK map is not >>> yet populated. For example, the former could have been pre-loaded onto >>> the netdevice by a control plane, which later launches qemu to populate >>> it with XSK sockets. >>> >>> Normally, the main idea behind 'inhibit=on' is that the qemu instance >>> doesn't need to have a lot of privileges to use the pre-loaded program >>> and the pre-created sockets, but this mentioned use-case here is different >>> where qemu still needs privileges to create the sockets. >>> >>> The 'map-start-index' parameter is optional and defaults to 0. It allows >>> flexible placement of the XSK sockets, and is up to the user to specify >>> when the XDP program with XSK map was already preloaded. In the simplest >>> case the queue-to-map-slot mapping is just 1:1 based on ctx->rx_queue_index >>> but the user might as well have a different scheme (or smaller map size, >>> e.g. ctx->rx_queue_index % max_size) to push the inbound traffic to one >>> of the XSK sockets. >>> >>> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> >>> Cc: Ilya Maximets <i.maxim...@ovn.org> >>> Cc: Anton Protopopov <as...@isovalent.com> >>> --- >>> net/af-xdp.c | 65 +++++++++++++++++++++++++++++++++++++++++++------ >>> qapi/net.json | 24 ++++++++++++------ >>> qemu-options.hx | 3 +++ >>> 3 files changed, 77 insertions(+), 15 deletions(-) >> >> Hi, Daniel. Thanks for v2 and sorry for delayed response. >> >> I tried it out for myself and the change does indeed work as expected. >> And it's quite nice! So I looked closely at the code this time. See some >> comments below. > > Thanks! > >>> diff --git a/net/af-xdp.c b/net/af-xdp.c >>> index 01c5fb914e..494b894706 100644 >>> --- a/net/af-xdp.c >>> +++ b/net/af-xdp.c >>> @@ -51,6 +51,8 @@ typedef struct AFXDPState { >>> >>> uint32_t n_queues; >>> uint32_t xdp_flags; >>> + char *map_path; >>> + uint32_t map_start_index; >>> bool inhibit; >>> } AFXDPState; >>> >>> @@ -261,6 +263,7 @@ static void af_xdp_send(void *opaque) >>> static void af_xdp_cleanup(NetClientState *nc) >>> { >>> AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); >>> + int pin_fd, idx; >>> >>> qemu_purge_queued_packets(nc); >>> >>> @@ -282,6 +285,22 @@ static void af_xdp_cleanup(NetClientState *nc) >>> "af-xdp: unable to remove XDP program from '%s', ifindex: >>> %d\n", >>> s->ifname, s->ifindex); >>> } >> >> I'd add an empty line here. The statements above and below are not >> really related. > > Ack, will do. > >>> + if (s->map_path) { >>> + pin_fd = bpf_obj_get(s->map_path); >>> + if (pin_fd < 0) { >>> + fprintf(stderr, >>> + "af-xdp: unable to remove %s's XSK sockets from '%s', >>> ifindex: %d\n", >>> + s->ifname, s->map_path, s->ifindex); >> >> Maybe it makes more sense to just not close it? i.e. keep the open file >> descriptor in the AFXDPState and use and close it on cleanup. This way we >> can be sure that the map doesn't just go away underneath us and so we can >> guarantee a proper cleanup. WDYT? >> >> Also, the ifindex isn't really part of this operation, so it may be >> reasonable to not print it here saving some line length. > > Ok, ack, I thought we don't need to waste / occupy an extra fd given VM can > keep > running for a long time, but I don't have a strong opinion. I'll keep it > around > for v3.
IMO, the fact that qemu process will be running for a long time is exactly the reason to keep this file open, as there is a higher chance for it to disappear at some point. Shouldn't be too bad, we have separate file descriptors open per queue, so one more shouldn't be a big problem. > >>> + } else { >>> + idx = nc->queue_index; >>> + if (s->map_start_index > 0) { >> >> Should probably check the has_ just in case. > > I don't think we need the extra test since we already do it earlier, but for > v3 > I reworked the logic a bit anyway and removed the test. > >>> + idx += s->map_start_index; >>> + } >>> + bpf_map_delete_elem(pin_fd, &idx); >> >> The uAPI says that this call can also fail. We should warn the user about >> this failure. > > Nothing we can do about it aside from warning, ok, will add. > >>> + close(pin_fd); >>> + } >>> + } >>> + g_free(s->map_path); >>> } >>> >>> static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp) >>> @@ -343,7 +362,7 @@ static int af_xdp_socket_create(AFXDPState *s, >>> .bind_flags = XDP_USE_NEED_WAKEUP, >>> .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST, >>> }; >>> - int queue_id, error = 0; >>> + int queue_id, pin_fd, xsk_fd, idx, error = 0; >>> >>> s->inhibit = opts->has_inhibit && opts->inhibit; >>> if (s->inhibit) { >>> @@ -384,6 +403,23 @@ static int af_xdp_socket_create(AFXDPState *s, >>> } >>> } >>> >>> + if (!error && s->map_path) { >>> + pin_fd = bpf_obj_get(s->map_path); >>> + if (pin_fd < 0) { >>> + error = errno; >>> + } else { >>> + xsk_fd = xsk_socket__fd(s->xsk); >>> + idx = s->nc.queue_index; >>> + if (s->map_start_index) { >>> + idx += s->map_start_index; >>> + } >>> + if (bpf_map_update_elem(pin_fd, &idx, &xsk_fd, 0)) { >>> + error = errno; >>> + } >>> + close(pin_fd); >>> + } >>> + } >>> + >> >> This part doesn't seem to belong in this function. It may be better if we >> just create sockets here and create a separate function af_xdp_update_bpf_map >> or something like that and call it from the main net_init_af_xdp. >> >> Main reasons for that are mostly related to the error path: >> >> 1. The error message below is a bit misleading and doesn't really tell what >> happened. For example, if the map_path is incorrect, then we get: >> "failed to create AF_XDP socket ... : No such path or directory". >> There is no mention of the map path that we tried to open and failed and >> the information is kinda wrong as we did actually create the socket. >> >> 2. It's better if have xdp_flags properly initialized when the socket is >> actually created and not error out of this function early. This may not >> matter today, but seems cleaner and may save us from some headaches in >> the future. > > Ok, will refactor and move it out. > >>> if (error) { >>> error_setg_errno(errp, error, >>> "failed to create AF_XDP socket for %s queue_id: >>> %d", >>> @@ -465,8 +501,12 @@ int net_init_af_xdp(const Netdev *netdev, >>> return -1; >>> } >>> >>> - if ((opts->has_inhibit && opts->inhibit) != !!opts->sock_fds) { >>> - error_setg(errp, "'inhibit=on' requires 'sock-fds' and vice >>> versa"); >>> + if ((opts->has_inhibit && opts->inhibit) != !!(opts->sock_fds || >>> opts->map_path)) { >>> + if (opts->has_inhibit && opts->inhibit) { >>> + error_setg(errp, "'inhibit=on' requires setting 'sock-fds' or >>> 'map-path'"); >>> + } else { >>> + error_setg(errp, "'sock-fds' or 'map-path' requires setting >>> 'inhibit=on'"); >>> + } >>> return -1; >>> } >> >> This is getting a little hard to navigate even after the re-wording, I think. >> And we also need to cover more cases now. So, I'd suggest to just bite the >> bullet and spell out every case separately, e.g.: >> >> inhibit = opts->has_inhibit && opts->inhibit; >> if (inhibit && !opts->sock_fds && !opts->map_path) { >> error_setg(errp, "'inhibit=on' requires 'sock-fds' or 'map-path'"); >> return -1; >> } >> if (!inhibit && (opts->sock_fds || opts->map_path)) { >> error_setg(errp, "'sock-fds' and 'map-path' require 'inhibit=on'"); >> return -1; >> } >> if (opts->sock_fds && opts->map_path) { >> error_setg(errp, "'sock-fds' and 'map-path' are mutually >> exclusive"); >> return -1; >> } >> >> This should be easier to understand and also saves some line lengths. WDYT? >> >> The exclusivity check here is because we expect the file descriptors in the >> sock-fds to be already in the map. It would be an awkward configuration if >> we have the fds and the map, so it might be better to just prohibit this. > > The combination of both will semlessly work however, so even if you pass in > the sockets, then the internal logic will skip creation and we can still push > them into the specified map. So technically there isn't really a reason to > restrict them to mutually exclusive. But I can adapt the code to the above > either way if that's the preference since we don't have a use-case today for > having both passed-in. The problem I have with combining the options is that we expect sockets to be already filled in the map when passed. This is explicitly documented. So, there is no point in having the map when the sockets are already in the map. If we allow them to not be in the map, then we need to re-work the docuemntation as well. And sice there is no real use case for this, it seems better to just mark them as mutually exclusive for now. > >> We'll also need a check that map-start-index requires the map-path. > > Will add. > >>> @@ -491,6 +531,12 @@ int net_init_af_xdp(const Netdev *netdev, >>> pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname); >>> s->ifindex = ifindex; >>> s->n_queues = queues; >> >> I'd put an empty line here. >> >>> + if (opts->map_path) { >> >> This check doesn't seem necessary, g_strdup is NULL-safe. >> >>> + s->map_path = g_strdup(opts->map_path); >>> + if (opts->has_map_start_index && opts->map_start_index > 0) { >>> + s->map_start_index = opts->map_start_index; >>> + } >>> + } >>> >>> if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp) >>> || af_xdp_socket_create(s, opts, errp)) { >>> @@ -504,10 +550,15 @@ int net_init_af_xdp(const Netdev *netdev, >>> if (nc0) { >>> s = DO_UPCAST(AFXDPState, nc, nc0); >>> if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || >>> !prog_id) { >>> - error_setg_errno(errp, errno, >>> - "no XDP program loaded on '%s', ifindex: %d", >>> - s->ifname, s->ifindex); >>> - goto err; >>> + if (!s->inhibit) { >> >> The s->inhibit is set as a side effect of af_xdp_socket_create(). It was a >> right >> decision when the socket creation and the cleanup were the only users. But >> since >> we're using it now here, we should move setting of this value to the loop >> above. > > Yeap, ack, coming in v3. Btw, I'm also considering to actually remove the warn > below and only leave the error case for !s->inhibit as this decision is > completely > delegated to users. The XDP loading can just as well happen afterwards. Maybe convert it to info? > >>> + error_setg_errno(errp, errno, >>> + "no XDP program loaded on '%s', ifindex: >>> %d", >>> + s->ifname, s->ifindex); >>> + goto err; >>> + } else { >>> + warn_report("no XDP program loaded yet on '%s', ifindex: >>> %d", >>> + s->ifname, s->ifindex); >>> + } >>> } >>> } >>> >>> diff --git a/qapi/net.json b/qapi/net.json >>> index 310cc4fd19..66a1fcf6ae 100644 >>> --- a/qapi/net.json >>> +++ b/qapi/net.json >>> @@ -454,7 +454,7 @@ >>> # (default: 0). >>> # >>> # @inhibit: Don't load a default XDP program, use one already loaded >>> -# to the interface (default: false). Requires @sock-fds. >>> +# to the interface (default: false). Requires @sock-fds or @map-path. >>> # >>> # @sock-fds: A colon (:) separated list of file descriptors for >>> # already open but not bound AF_XDP sockets in the queue order. >>> @@ -462,17 +462,25 @@ >>> # into XDP socket map for corresponding queues. Requires >>> # @inhibit. >>> # >>> +# @map-path: The path to a pinned xsk map to push file descriptors >>> +# for bound AF_XDP sockets into. Requires @inhibit. (Since 10.1) >>> +# >>> +# @map-start-index: Use @map-path to insert xsk sockets starting from >>> +# this index number (default: 0). Requires @map-path. (Since 10.1) >> >> The doc uses double spaces between sentences. Also, may mention that >> map-path and sock-fds are mutually exclusive. For both options. > > Ok. > >>> +# >>> # Since: 8.2 >>> ## >>> { 'struct': 'NetdevAFXDPOptions', >>> 'data': { >>> - 'ifname': 'str', >>> - '*mode': 'AFXDPMode', >>> - '*force-copy': 'bool', >>> - '*queues': 'int', >>> - '*start-queue': 'int', >>> - '*inhibit': 'bool', >>> - '*sock-fds': 'str' }, >>> + 'ifname': 'str', >>> + '*mode': 'AFXDPMode', >>> + '*force-copy': 'bool', >>> + '*queues': 'int', >>> + '*start-queue': 'int', >>> + '*inhibit': 'bool', >>> + '*sock-fds': 'str', >>> + '*map-path': 'str', >>> + '*map-start-index': 'int' }, >>> 'if': 'CONFIG_AF_XDP' } >>> >>> ## >>> diff --git a/qemu-options.hx b/qemu-options.hx >>> index dc694a99a3..50fc592f5d 100644 >>> --- a/qemu-options.hx >>> +++ b/qemu-options.hx >>> @@ -2909,6 +2909,7 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev, >>> #ifdef CONFIG_AF_XDP >>> "-netdev >>> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n" >>> " >>> [,queues=n][,start-queue=m][,inhibit=on|off][,sock-fds=x:y:...:z]\n" >>> + " [,map-path=/path/to/socket/map][,map-start-index=i]\n" >>> " attach to the existing network interface 'name' with >>> AF_XDP socket\n" >>> " use 'mode=MODE' to specify an XDP program attach >>> mode\n" >>> " use 'force-copy=on|off' to force XDP copy mode even >>> if device supports zero-copy (default: off)\n" >>> @@ -2916,6 +2917,8 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev, >>> " with inhibit=on,\n" >>> " use 'sock-fds' to provide file descriptors for >>> already open AF_XDP sockets\n" >>> " added to a socket map in XDP program. One socket >>> per queue.\n" >>> + " use 'map-path' to provide the socket map location >>> to populate AF_XDP sockets with\n" >>> + " beginning from the 'map-start-index' specified >>> index (default: 0) (Since 10.1)\n" >>> " use 'queues=n' to specify how many queues of a >>> multiqueue interface should be used\n" >>> " use 'start-queue=m' to specify the first queue that >>> should be used\n" >>> #endif >> >> There is another section below with all the options listed and some >> examples. It will need an update. Not sure if we need an example > > Thanks, will add! > >> for this new use case, but if you can make a small one with just >> using bpftool, that maight be useful. > > Full blown example is a bit more involved, so likely won't fit since we also > need > all the other pieces (example C code, build, BPF skeleton) which mostly > !inhibit > does in the background. Example may load the defaul xsk_def_xdp_prog.o that is shipped with libxdp. Should be no need for C code or compiling in this case. > > Thanks, > Daniel