On Wed, Jun 28, 2023 at 7:14 PM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 6/28/23 05:27, Jason Wang wrote: > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > >> > >> On 6/27/23 04:54, Jason Wang wrote: > >>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maxim...@ovn.org> wrote: > >>>> > >>>> On 6/26/23 08:32, Jason Wang wrote: > >>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasow...@redhat.com> wrote: > >>>>>> > >>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maxim...@ovn.org> > >>>>>> wrote: > >>>>>>> > >>>>>>> AF_XDP is a network socket family that allows communication directly > >>>>>>> with the network device driver in the kernel, bypassing most or all > >>>>>>> of the kernel networking stack. In the essence, the technology is > >>>>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > >>>>>>> and works with any network interfaces without driver modifications. > >>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > >>>>>>> require access to character devices or unix sockets. Only access to > >>>>>>> the network interface itself is necessary. > >>>>>>> > >>>>>>> This patch implements a network backend that communicates with the > >>>>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory > >>>>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > >>>>>>> Fill and Completion) are placed in that memory along with a pool of > >>>>>>> memory buffers for the packet data. Data transmission is done by > >>>>>>> allocating one of the buffers, copying packet data into it and > >>>>>>> placing the pointer into Tx ring. After transmission, device will > >>>>>>> return the buffer via Completion ring. On Rx, device will take > >>>>>>> a buffer form a pre-populated Fill ring, write the packet data into > >>>>>>> it and place the buffer into Rx ring. > >>>>>>> > >>>>>>> AF_XDP network backend takes on the communication with the host > >>>>>>> kernel and the network interface and forwards packets to/from the > >>>>>>> peer device in QEMU. > >>>>>>> > >>>>>>> Usage example: > >>>>>>> > >>>>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > >>>>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > >>>>>>> > >>>>>>> XDP program bridges the socket with a network interface. It can be > >>>>>>> attached to the interface in 2 different modes: > >>>>>>> > >>>>>>> 1. skb - this mode should work for any interface and doesn't require > >>>>>>> driver support. With a caveat of lower performance. > >>>>>>> > >>>>>>> 2. native - this does require support from the driver and allows to > >>>>>>> bypass skb allocation in the kernel and potentially use > >>>>>>> zero-copy while getting packets in/out userspace. > >>>>>>> > >>>>>>> By default, QEMU will try to use native mode and fall back to skb. > >>>>>>> Mode can be forced via 'mode' option. To force 'copy' even in native > >>>>>>> mode, use 'force-copy=on' option. This might be useful if there is > >>>>>>> some issue with the driver. > >>>>>>> > >>>>>>> Option 'queues=N' allows to specify how many device queues should > >>>>>>> be open. Note that all the queues that are not open are still > >>>>>>> functional and can receive traffic, but it will not be delivered to > >>>>>>> QEMU. So, the number of device queues should generally match the > >>>>>>> QEMU configuration, unless the device is shared with something > >>>>>>> else and the traffic re-direction to appropriate queues is correctly > >>>>>>> configured on a device level (e.g. with ethtool -N). > >>>>>>> 'start-queue=M' option can be used to specify from which queue id > >>>>>>> QEMU should start configuring 'N' queues. It might also be necessary > >>>>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs > >>>>>>> for examples. > >>>>>>> > >>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > >>>>>>> capabilities in order to load default XSK/XDP programs to the > >>>>>>> network interface and configure BTF maps. > >>>>>> > >>>>>> I think you mean "BPF" actually? > >>>> > >>>> "BPF Type Format maps" kind of makes some sense, but yes. :) > >>>> > >>>>>> > >>>>>>> It is possible, however, > >>>>>>> to run only with CAP_NET_RAW. > >>>>>> > >>>>>> Qemu often runs without any privileges, so we need to fix it. > >>>>>> > >>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go. > >>>> > >>>> I looked through the code and it seems like we can run completely > >>>> non-privileged as far as kernel concerned. We'll need an API > >>>> modification in libxdp though. > >>>> > >>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is > >>>> a base socket creation. Binding and other configuration doesn't > >>>> require any privileges. So, we could create a socket externally > >>>> and pass it to QEMU. > >>> > >>> That's the way TAP works for example. > >>> > >>>> Should work, unless it's an oversight from > >>>> the kernel side that needs to be patched. :) libxdp doesn't have > >>>> a way to specify externally created socket today, so we'll need > >>>> to change that. Should be easy to do though. I can explore. > >>> > >>> Please do that. > >> > >> I have a prototype: > >> > >> https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3 > >> > >> Need to test it out and then submit PR to xdp-tools project. > >> > >>> > >>>> > >>>> In case the bind syscall will actually need CAP_NET_RAW for some > >>>> reason, we could change the kernel and allow non-privileged bind > >>>> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged > >>>> process bind the socket to a particular device, so QEMU can't > >>>> bind it to a random one. Might be a good use case to allow even > >>>> if not strictly necessary. > >>> > >>> Yes. > >> > >> Will propose something for a kernel as well. We might want something > >> more granular though, e.g. bind to a queue instead of a device. In > >> case we want better control in the device sharing scenario. > > > > I may miss something but the bind is already done at dev plus queue > > right now, isn't it? > > > Yes, the bind() syscall will bind socket to the dev+queue. I was talking > about SO_BINDTODEVICE that only ties the socket to a particular device, > but not a queue. > > Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and > assuming a privileged process does: > > fd = socket(AF_XDP, ...); > setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>); > > And sends fd to a non-privileged process. That non-privileged process > will be able to call: > > bind(fd, <device>, <random queue>); > > It will have to use the same device, but can choose any queue, if that > queue is not already busy with another socket. > > So, I was thinking maybe implementing something like XDP_BINDTOQID option. > This way the privileged process may call: > > setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>); > > And later kernel will be able to refuse bind() for any other queue for > this particular socket.
Not sure, if file descriptor passing works, we probably don't need another way. > > Not sure if that is necessary though. > Since we're allocating the socket in the privileged process, that process > may add the socket to the BPF map on the correct queue id. This way the > non-privileged process will not be able to receive any packets from any > other queue on this socket, even if bound to it. And no other AF_XDP > socket will be able to be bound to that other queue as well. I think that's by design, or anything wrong with this model? > So, the > rogue QEMU will be able to hog one extra queue, but it will not be able > to intercept traffic any from it, AFAICT. May not be a huge problem > after all. > > SO_BINDTODEVICE would still be nice to have. Especially for cases where > we give the whole device to one VM. Then we need to use AF_XDP in the guest which seems to be a different topic. Alibaba is working on the AF_XDP support for virtio-net. Thanks > > Best regards, Ilya Maximets. >