On Mon, Jul 10, 2023 at 11:21 PM Stefan Hajnoczi <stefa...@gmail.com> wrote: > > On Mon, 10 Jul 2023 at 06:55, Ilya Maximets <i.maxim...@ovn.org> wrote: > > > > On 7/10/23 05:51, Jason Wang wrote: > > > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maxim...@ovn.org> wrote: > > >> > > >> On 7/7/23 03:43, Jason Wang wrote: > > >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefa...@gmail.com> > > >>> wrote: > > >>>> > > >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasow...@redhat.com> wrote: > > >>>>> > > >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefa...@gmail.com> > > >>>>> wrote: > > >>>>>> > > >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasow...@redhat.com> wrote: > > >>>>>>> > > >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi > > >>>>>>> <stefa...@gmail.com> wrote: > > >>>>>>>> > > >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasow...@redhat.com> > > >>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi > > >>>>>>>>> <stefa...@gmail.com> wrote: > > >>>>>>>>>> > > >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasow...@redhat.com> > > >>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi > > >>>>>>>>>>> <stefa...@gmail.com> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasow...@redhat.com> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi > > >>>>>>>>>>>>> <stefa...@gmail.com> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang > > >>>>>>>>>>>>>> <jasow...@redhat.com> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets > > >>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote: > > >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets > > >>>>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote: > > >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang > > >>>>>>>>>>>>>>>>>>> <jasow...@redhat.com> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets > > >>>>>>>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote: > > >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with > > >>>>>>>>>>>>>>>>>> vhost=on in terms of PPS. > > >>>>>>>>>>>>>>>>>> So, that might be one case. Taking into account that > > >>>>>>>>>>>>>>>>>> just rcu lock and > > >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet > > >>>>>>>>>>>>>>>>>> copy, some batching > > >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly. > > >>>>>>>>>>>>>>>>>> And it shouldn't be > > >>>>>>>>>>>>>>>>>> too hard to implement. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be > > >>>>>>>>>>>>>>>>>> improved by creating > > >>>>>>>>>>>>>>>>>> a kernel thread for async Tx. Similarly to what > > >>>>>>>>>>>>>>>>>> io_uring allows. Currently > > >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that > > >>>>>>>>>>>>>>>>>> doesn't allow to > > >>>>>>>>>>>>>>>>>> scale well. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" > > >>>>>>>>>>>>>>>>> between > > >>>>>>>>>>>>>>>>> io_uring and AF_XDP: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register) > > >>>>>>>>>>>>>>>>> 2) both use ring for communication > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, > > >>>>>>>>>>>>>>>> then we can > > >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, > > >>>>>>>>>>>>>>>> i.e. for > > >>>>>>>>>>>>>>>> virtual interfaces. io_uring thread in the kernel will be > > >>>>>>>>>>>>>>>> able to > > >>>>>>>>>>>>>>>> perform transmission for us. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than > > >>>>>>>>>>>>>>> the main loop > > >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory > > >>>>>>>>>>>>>>> translation > > >>>>>>>>>>>>>>> cost. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code > > >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. > > >>>>>>>>>>>>>> I'm working > > >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in > > >>>>>>>>>>>>>> July. The > > >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring > > >>>>>>>>>>>>>> operations so > > >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. > > >>>>>>>>>>>>>> Both the > > >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on > > >>>>>>>>>>>>>> Linux hosts. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from > > >>>>>>>>>>>>> guest to > > >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA > > >>>>>>>>>>>>> which > > >>>>>>>>>>>>> seems expensive. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Vhost seems to be a shortcut for this. > > >>>>>>>>>>>> > > >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor > > >>>>>>>>>>>> monitoring) > > >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still > > >>>>>>>>>>>> needs to > > >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory > > >>>>>>>>>>>> and > > >>>>>>>>>>>> umem. > > >>>>>>>>>>> > > >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring > > >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And > > >>>>>>>>>>> this > > >>>>>>>>>>> part seems to be very expensive according to my test in the > > >>>>>>>>>>> past. > > >>>>>>>>>> > > >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as > > >>>>>>>>>> a QEMU > > >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net) > > >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in > > >>>>>>>>>> device > > >>>>>>>>>> emulation. > > >>>>>>>>> > > >>>>>>>>> Just to make sure we're on the same page. > > >>>>>>>>> > > >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use > > >>>>>>>>> the > > >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to > > >>>>>>>>> using the Qemu memory core translations which need to take care > > >>>>>>>>> about > > >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io > > >>>>>>>>> threads > > >>>>>>>>> which only cares about ram so the translation could be very fast. > > >>>>>>>> > > >>>>>>>> What does using "vhost in io threads" mean? > > >>>>>>> > > >>>>>>> It means a vhost userspace dataplane that is implemented via io > > >>>>>>> threads. > > >>>>>> > > >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use > > >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > > >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The > > >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and > > >>>>>> use AioContext APIs to run in IOThreads. > > >>>>> > > >>>>> Yes. > > >>>>> > > >>>>>> > > >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe > > >>>>>> it's fastest if you explain your idea and its advantages instead of > > >>>>>> me > > >>>>>> guessing. > > >>>>> > > >>>>> It's something like I'd proposed in [1]: > > >>>>> > > >>>>> 1) a vhost that is implemented via IOThreads > > >>>>> 2) memory translation is done via vhost memory table/IOTLB > > >>>>> > > >>>>> The advantages are: > > >>>>> > > >>>>> 1) No 3rd application like DPDK application > > >>>>> 2) Attack surface were reduced > > >>>>> 3) Better understanding/interactions with device model for things like > > >>>>> RSS and IOMMU > > >>>>> > > >>>>> There could be some dis-advantages but it's not obvious to me :) > > >>>> > > >>>> Why is QEMU's native device emulation API not the natural choice for > > >>>> writing built-in devices? I don't understand why the vhost interface > > >>>> is desirable for built-in devices. > > >>> > > >>> Unless the memory helpers (like address translations) were optimized > > >>> fully to satisfy this 10M+ PPS. > > >>> > > >>> Not sure if this is too hard, but last time I benchmark, perf told me > > >>> most of the time spent in the translation. > > >>> > > >>> Using a vhost is a workaround since its memory model is much more > > >>> simpler so it can skip lots of memory sections like I/O and ROM etc. > > >> > > >> So, we can have a thread running as part of QEMU process that implements > > >> vhost functionality for a virtio-net device. And this thread has an > > >> optimized way to access memory. What prevents current virtio-net > > >> emulation > > >> code accessing memory in the same optimized way? > > > > > > Current emulation using memory core accessors which needs to take care > > > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not > > > considered since day0 of vhost. You can do some experiment on this e.g > > > just dropping packets after fetching it from the TX ring. > > > > If I'm reading that right, virtio implementation is using address space > > caching by utilizing a memory listener and pre-translated addresses of > > interesting memory regions. Then it's performing address_space_read_cached, > > which is bypassing all the memory address translation logic on a cache hit. > > That sounds pretty similar to how memory table is prepared for vhost. > > Exactly, but only for the vring memory structures (avail, used, and > descriptor rings in the Split Virtqueue Layout).
Yes. It should speed up somehow. > > The packet headers and payloads are still translated using the > uncached virtqueue_pop() -> dma_memory_map() -> address_space_map() > API. > > Running a tx packet drop benchmark as Jason suggested and checking if > memory translation is a bottleneck seems worthwhile. Improving > dma_memory_map() performance would speed up all built-in QEMU devices. +1 > > Jason: When you noticed this bottleneck, were you using a normal > virtio-net-pci device without vIOMMU? Normal virtio-net-pci device without vIOMMU. Thanks > > Stefan >