Hi Ilya,
On 5/27/25 10:51 PM, Ilya Maximets wrote:
On 5/12/25 6:06 PM, Daniel Borkmann wrote:
Extend 'inhibit=on' setting with the option to specify a pinned XSK map
path along with a starting index (default 0) to push the created XSK
sockets into. Example usage:
# ./build/qemu-system-x86_64 [...] \
-netdev
af-xdp,ifname=enp2s0f0np0,id=net0,mode=native,queues=2,start-queue=14,inhibit=on,map-path=/sys/fs/bpf/xsks_map,map-start-index=14
\
-device virtio-net-pci,netdev=net0 [...]
This is useful for the case where an existing XDP program with XSK map
is present on the AF_XDP supported phys device and the XSK map is not
yet populated. For example, the former could have been pre-loaded onto
the netdevice by a control plane, which later launches qemu to populate
it with XSK sockets.
Normally, the main idea behind 'inhibit=on' is that the qemu instance
doesn't need to have a lot of privileges to use the pre-loaded program
and the pre-created sockets, but this mentioned use-case here is different
where qemu still needs privileges to create the sockets.
The 'map-start-index' parameter is optional and defaults to 0. It allows
flexible placement of the XSK sockets, and is up to the user to specify
when the XDP program with XSK map was already preloaded. In the simplest
case the queue-to-map-slot mapping is just 1:1 based on ctx->rx_queue_index
but the user might as well have a different scheme (or smaller map size,
e.g. ctx->rx_queue_index % max_size) to push the inbound traffic to one
of the XSK sockets.
Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Cc: Ilya Maximets <i.maxim...@ovn.org>
Cc: Anton Protopopov <as...@isovalent.com>
---
net/af-xdp.c | 65 +++++++++++++++++++++++++++++++++++++++++++------
qapi/net.json | 24 ++++++++++++------
qemu-options.hx | 3 +++
3 files changed, 77 insertions(+), 15 deletions(-)
Hi, Daniel. Thanks for v2 and sorry for delayed response.
I tried it out for myself and the change does indeed work as expected.
And it's quite nice! So I looked closely at the code this time. See some
comments below.
Thanks!
diff --git a/net/af-xdp.c b/net/af-xdp.c
index 01c5fb914e..494b894706 100644
--- a/net/af-xdp.c
+++ b/net/af-xdp.c
@@ -51,6 +51,8 @@ typedef struct AFXDPState {
uint32_t n_queues;
uint32_t xdp_flags;
+ char *map_path;
+ uint32_t map_start_index;
bool inhibit;
} AFXDPState;
@@ -261,6 +263,7 @@ static void af_xdp_send(void *opaque)
static void af_xdp_cleanup(NetClientState *nc)
{
AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+ int pin_fd, idx;
qemu_purge_queued_packets(nc);
@@ -282,6 +285,22 @@ static void af_xdp_cleanup(NetClientState *nc)
"af-xdp: unable to remove XDP program from '%s', ifindex:
%d\n",
s->ifname, s->ifindex);
}
I'd add an empty line here. The statements above and below are not
really related.
Ack, will do.
+ if (s->map_path) {
+ pin_fd = bpf_obj_get(s->map_path);
+ if (pin_fd < 0) {
+ fprintf(stderr,
+ "af-xdp: unable to remove %s's XSK sockets from '%s', ifindex:
%d\n",
+ s->ifname, s->map_path, s->ifindex);
Maybe it makes more sense to just not close it? i.e. keep the open file
descriptor in the AFXDPState and use and close it on cleanup. This way we
can be sure that the map doesn't just go away underneath us and so we can
guarantee a proper cleanup. WDYT?
Also, the ifindex isn't really part of this operation, so it may be
reasonable to not print it here saving some line length.
Ok, ack, I thought we don't need to waste / occupy an extra fd given VM can keep
running for a long time, but I don't have a strong opinion. I'll keep it around
for v3.
+ } else {
+ idx = nc->queue_index;
+ if (s->map_start_index > 0) {
Should probably check the has_ just in case.
I don't think we need the extra test since we already do it earlier, but for v3
I reworked the logic a bit anyway and removed the test.
+ idx += s->map_start_index;
+ }
+ bpf_map_delete_elem(pin_fd, &idx);
The uAPI says that this call can also fail. We should warn the user about
this failure.
Nothing we can do about it aside from warning, ok, will add.
+ close(pin_fd);
+ }
+ }
+ g_free(s->map_path);
}
static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp)
@@ -343,7 +362,7 @@ static int af_xdp_socket_create(AFXDPState *s,
.bind_flags = XDP_USE_NEED_WAKEUP,
.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
};
- int queue_id, error = 0;
+ int queue_id, pin_fd, xsk_fd, idx, error = 0;
s->inhibit = opts->has_inhibit && opts->inhibit;
if (s->inhibit) {
@@ -384,6 +403,23 @@ static int af_xdp_socket_create(AFXDPState *s,
}
}
+ if (!error && s->map_path) {
+ pin_fd = bpf_obj_get(s->map_path);
+ if (pin_fd < 0) {
+ error = errno;
+ } else {
+ xsk_fd = xsk_socket__fd(s->xsk);
+ idx = s->nc.queue_index;
+ if (s->map_start_index) {
+ idx += s->map_start_index;
+ }
+ if (bpf_map_update_elem(pin_fd, &idx, &xsk_fd, 0)) {
+ error = errno;
+ }
+ close(pin_fd);
+ }
+ }
+
This part doesn't seem to belong in this function. It may be better if we
just create sockets here and create a separate function af_xdp_update_bpf_map
or something like that and call it from the main net_init_af_xdp.
Main reasons for that are mostly related to the error path:
1. The error message below is a bit misleading and doesn't really tell what
happened. For example, if the map_path is incorrect, then we get:
"failed to create AF_XDP socket ... : No such path or directory".
There is no mention of the map path that we tried to open and failed and
the information is kinda wrong as we did actually create the socket.
2. It's better if have xdp_flags properly initialized when the socket is
actually created and not error out of this function early. This may not
matter today, but seems cleaner and may save us from some headaches in
the future.
Ok, will refactor and move it out.
if (error) {
error_setg_errno(errp, error,
"failed to create AF_XDP socket for %s queue_id: %d",
@@ -465,8 +501,12 @@ int net_init_af_xdp(const Netdev *netdev,
return -1;
}
- if ((opts->has_inhibit && opts->inhibit) != !!opts->sock_fds) {
- error_setg(errp, "'inhibit=on' requires 'sock-fds' and vice versa");
+ if ((opts->has_inhibit && opts->inhibit) != !!(opts->sock_fds ||
opts->map_path)) {
+ if (opts->has_inhibit && opts->inhibit) {
+ error_setg(errp, "'inhibit=on' requires setting 'sock-fds' or
'map-path'");
+ } else {
+ error_setg(errp, "'sock-fds' or 'map-path' requires setting
'inhibit=on'");
+ }
return -1;
}
This is getting a little hard to navigate even after the re-wording, I think.
And we also need to cover more cases now. So, I'd suggest to just bite the
bullet and spell out every case separately, e.g.:
inhibit = opts->has_inhibit && opts->inhibit;
if (inhibit && !opts->sock_fds && !opts->map_path) {
error_setg(errp, "'inhibit=on' requires 'sock-fds' or 'map-path'");
return -1;
}
if (!inhibit && (opts->sock_fds || opts->map_path)) {
error_setg(errp, "'sock-fds' and 'map-path' require 'inhibit=on'");
return -1;
}
if (opts->sock_fds && opts->map_path) {
error_setg(errp, "'sock-fds' and 'map-path' are mutually exclusive");
return -1;
}
This should be easier to understand and also saves some line lengths. WDYT?
The exclusivity check here is because we expect the file descriptors in the
sock-fds to be already in the map. It would be an awkward configuration if
we have the fds and the map, so it might be better to just prohibit this.
The combination of both will semlessly work however, so even if you pass in
the sockets, then the internal logic will skip creation and we can still push
them into the specified map. So technically there isn't really a reason to
restrict them to mutually exclusive. But I can adapt the code to the above
either way if that's the preference since we don't have a use-case today for
having both passed-in.
We'll also need a check that map-start-index requires the map-path.
Will add.
@@ -491,6 +531,12 @@ int net_init_af_xdp(const Netdev *netdev,
pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
s->ifindex = ifindex;
s->n_queues = queues;
I'd put an empty line here.
+ if (opts->map_path) {
This check doesn't seem necessary, g_strdup is NULL-safe.
+ s->map_path = g_strdup(opts->map_path);
+ if (opts->has_map_start_index && opts->map_start_index > 0) {
+ s->map_start_index = opts->map_start_index;
+ }
+ }
if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp)
|| af_xdp_socket_create(s, opts, errp)) {
@@ -504,10 +550,15 @@ int net_init_af_xdp(const Netdev *netdev,
if (nc0) {
s = DO_UPCAST(AFXDPState, nc, nc0);
if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || !prog_id)
{
- error_setg_errno(errp, errno,
- "no XDP program loaded on '%s', ifindex: %d",
- s->ifname, s->ifindex);
- goto err;
+ if (!s->inhibit) {
The s->inhibit is set as a side effect of af_xdp_socket_create(). It was a
right
decision when the socket creation and the cleanup were the only users. But
since
we're using it now here, we should move setting of this value to the loop above.
Yeap, ack, coming in v3. Btw, I'm also considering to actually remove the warn
below and only leave the error case for !s->inhibit as this decision is
completely
delegated to users. The XDP loading can just as well happen afterwards.
+ error_setg_errno(errp, errno,
+ "no XDP program loaded on '%s', ifindex: %d",
+ s->ifname, s->ifindex);
+ goto err;
+ } else {
+ warn_report("no XDP program loaded yet on '%s', ifindex: %d",
+ s->ifname, s->ifindex);
+ }
}
}
diff --git a/qapi/net.json b/qapi/net.json
index 310cc4fd19..66a1fcf6ae 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -454,7 +454,7 @@
# (default: 0).
#
# @inhibit: Don't load a default XDP program, use one already loaded
-# to the interface (default: false). Requires @sock-fds.
+# to the interface (default: false). Requires @sock-fds or @map-path.
#
# @sock-fds: A colon (:) separated list of file descriptors for
# already open but not bound AF_XDP sockets in the queue order.
@@ -462,17 +462,25 @@
# into XDP socket map for corresponding queues. Requires
# @inhibit.
#
+# @map-path: The path to a pinned xsk map to push file descriptors
+# for bound AF_XDP sockets into. Requires @inhibit. (Since 10.1)
+#
+# @map-start-index: Use @map-path to insert xsk sockets starting from
+# this index number (default: 0). Requires @map-path. (Since 10.1)
The doc uses double spaces between sentences. Also, may mention that
map-path and sock-fds are mutually exclusive. For both options.
Ok.
+#
# Since: 8.2
##
{ 'struct': 'NetdevAFXDPOptions',
'data': {
- 'ifname': 'str',
- '*mode': 'AFXDPMode',
- '*force-copy': 'bool',
- '*queues': 'int',
- '*start-queue': 'int',
- '*inhibit': 'bool',
- '*sock-fds': 'str' },
+ 'ifname': 'str',
+ '*mode': 'AFXDPMode',
+ '*force-copy': 'bool',
+ '*queues': 'int',
+ '*start-queue': 'int',
+ '*inhibit': 'bool',
+ '*sock-fds': 'str',
+ '*map-path': 'str',
+ '*map-start-index': 'int' },
'if': 'CONFIG_AF_XDP' }
##
diff --git a/qemu-options.hx b/qemu-options.hx
index dc694a99a3..50fc592f5d 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2909,6 +2909,7 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
#ifdef CONFIG_AF_XDP
"-netdev
af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
"
[,queues=n][,start-queue=m][,inhibit=on|off][,sock-fds=x:y:...:z]\n"
+ " [,map-path=/path/to/socket/map][,map-start-index=i]\n"
" attach to the existing network interface 'name' with AF_XDP
socket\n"
" use 'mode=MODE' to specify an XDP program attach mode\n"
" use 'force-copy=on|off' to force XDP copy mode even if device
supports zero-copy (default: off)\n"
@@ -2916,6 +2917,8 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
" with inhibit=on,\n"
" use 'sock-fds' to provide file descriptors for already open
AF_XDP sockets\n"
" added to a socket map in XDP program. One socket per
queue.\n"
+ " use 'map-path' to provide the socket map location to
populate AF_XDP sockets with\n"
+ " beginning from the 'map-start-index' specified index
(default: 0) (Since 10.1)\n"
" use 'queues=n' to specify how many queues of a multiqueue
interface should be used\n"
" use 'start-queue=m' to specify the first queue that should be
used\n"
#endif
There is another section below with all the options listed and some
examples. It will need an update. Not sure if we need an example
Thanks, will add!
for this new use case, but if you can make a small one with just
using bpftool, that maight be useful.
Full blown example is a bit more involved, so likely won't fit since we also
need
all the other pieces (example C code, build, BPF skeleton) which mostly !inhibit
does in the background.
Thanks,
Daniel