On Mon, Jul 10, 2023 at 11:21 PM Stefan Hajnoczi <stefa...@gmail.com> wrote:
>
> On Mon, 10 Jul 2023 at 06:55, Ilya Maximets <i.maxim...@ovn.org> wrote:
> >
> > On 7/10/23 05:51, Jason Wang wrote:
> > > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
> > >>
> > >> On 7/7/23 03:43, Jason Wang wrote:
> > >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefa...@gmail.com> 
> > >>> wrote:
> > >>>>
> > >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasow...@redhat.com> wrote:
> > >>>>>
> > >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefa...@gmail.com> 
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasow...@redhat.com> wrote:
> > >>>>>>>
> > >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi 
> > >>>>>>> <stefa...@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasow...@redhat.com> 
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi 
> > >>>>>>>>> <stefa...@gmail.com> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasow...@redhat.com> 
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi 
> > >>>>>>>>>>> <stefa...@gmail.com> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasow...@redhat.com> 
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
> > >>>>>>>>>>>>> <stefa...@gmail.com> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang 
> > >>>>>>>>>>>>>> <jasow...@redhat.com> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> > >>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
> > >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > >>>>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
> > >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > >>>>>>>>>>>>>>>>>>> <jasow...@redhat.com> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > >>>>>>>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote:
> > >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with 
> > >>>>>>>>>>>>>>>>>> vhost=on in terms of PPS.
> > >>>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that 
> > >>>>>>>>>>>>>>>>>> just rcu lock and
> > >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet 
> > >>>>>>>>>>>>>>>>>> copy, some batching
> > >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  
> > >>>>>>>>>>>>>>>>>> And it shouldn't be
> > >>>>>>>>>>>>>>>>>> too hard to implement.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be 
> > >>>>>>>>>>>>>>>>>> improved by creating
> > >>>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what 
> > >>>>>>>>>>>>>>>>>> io_uring allows.  Currently
> > >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that 
> > >>>>>>>>>>>>>>>>>> doesn't allow to
> > >>>>>>>>>>>>>>>>>> scale well.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" 
> > >>>>>>>>>>>>>>>>> between
> > >>>>>>>>>>>>>>>>> io_uring and AF_XDP:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register)
> > >>>>>>>>>>>>>>>>> 2) both use ring for communication
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, 
> > >>>>>>>>>>>>>>>> then we can
> > >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, 
> > >>>>>>>>>>>>>>>> i.e. for
> > >>>>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be 
> > >>>>>>>>>>>>>>>> able to
> > >>>>>>>>>>>>>>>> perform transmission for us.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than 
> > >>>>>>>>>>>>>>> the main loop
> > >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory 
> > >>>>>>>>>>>>>>> translation
> > >>>>>>>>>>>>>>> cost.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
> > >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. 
> > >>>>>>>>>>>>>> I'm working
> > >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in 
> > >>>>>>>>>>>>>> July. The
> > >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring 
> > >>>>>>>>>>>>>> operations so
> > >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. 
> > >>>>>>>>>>>>>> Both the
> > >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on 
> > >>>>>>>>>>>>>> Linux hosts.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from 
> > >>>>>>>>>>>>> guest to
> > >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA 
> > >>>>>>>>>>>>> which
> > >>>>>>>>>>>>> seems expensive.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Vhost seems to be a shortcut for this.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor 
> > >>>>>>>>>>>> monitoring)
> > >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still 
> > >>>>>>>>>>>> needs to
> > >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory 
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>> umem.
> > >>>>>>>>>>>
> > >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
> > >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And 
> > >>>>>>>>>>> this
> > >>>>>>>>>>> part seems to be very expensive according to my test in the 
> > >>>>>>>>>>> past.
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as 
> > >>>>>>>>>> a QEMU
> > >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
> > >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in 
> > >>>>>>>>>> device
> > >>>>>>>>>> emulation.
> > >>>>>>>>>
> > >>>>>>>>> Just to make sure we're on the same page.
> > >>>>>>>>>
> > >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use 
> > >>>>>>>>> the
> > >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
> > >>>>>>>>> using the Qemu memory core translations which need to take care 
> > >>>>>>>>> about
> > >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io 
> > >>>>>>>>> threads
> > >>>>>>>>> which only cares about ram so the translation could be very fast.
> > >>>>>>>>
> > >>>>>>>> What does using "vhost in io threads" mean?
> > >>>>>>>
> > >>>>>>> It means a vhost userspace dataplane that is implemented via io 
> > >>>>>>> threads.
> > >>>>>>
> > >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use
> > >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> > >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
> > >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and
> > >>>>>> use AioContext APIs to run in IOThreads.
> > >>>>>
> > >>>>> Yes.
> > >>>>>
> > >>>>>>
> > >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe
> > >>>>>> it's fastest if you explain your idea and its advantages instead of 
> > >>>>>> me
> > >>>>>> guessing.
> > >>>>>
> > >>>>> It's something like I'd proposed in [1]:
> > >>>>>
> > >>>>> 1) a vhost that is implemented via IOThreads
> > >>>>> 2) memory translation is done via vhost memory table/IOTLB
> > >>>>>
> > >>>>> The advantages are:
> > >>>>>
> > >>>>> 1) No 3rd application like DPDK application
> > >>>>> 2) Attack surface were reduced
> > >>>>> 3) Better understanding/interactions with device model for things like
> > >>>>> RSS and IOMMU
> > >>>>>
> > >>>>> There could be some dis-advantages but it's not obvious to me :)
> > >>>>
> > >>>> Why is QEMU's native device emulation API not the natural choice for
> > >>>> writing built-in devices? I don't understand why the vhost interface
> > >>>> is desirable for built-in devices.
> > >>>
> > >>> Unless the memory helpers (like address translations) were optimized
> > >>> fully to satisfy this 10M+ PPS.
> > >>>
> > >>> Not sure if this is too hard, but last time I benchmark, perf told me
> > >>> most of the time spent in the translation.
> > >>>
> > >>> Using a vhost is a workaround since its memory model is much more
> > >>> simpler so it can skip lots of memory sections like I/O and ROM etc.
> > >>
> > >> So, we can have a thread running as part of QEMU process that implements
> > >> vhost functionality for a virtio-net device.  And this thread has an
> > >> optimized way to access memory.  What prevents current virtio-net 
> > >> emulation
> > >> code accessing memory in the same optimized way?
> > >
> > > Current emulation using memory core accessors which needs to take care
> > > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
> > > considered since day0 of vhost. You can do some experiment on this e.g
> > > just dropping packets after fetching it from the TX ring.
> >
> > If I'm reading that right, virtio implementation is using address space
> > caching by utilizing a memory listener and pre-translated addresses of
> > interesting memory regions.  Then it's performing address_space_read_cached,
> > which is bypassing all the memory address translation logic on a cache hit.
> > That sounds pretty similar to how memory table is prepared for vhost.
>
> Exactly, but only for the vring memory structures (avail, used, and
> descriptor rings in the Split Virtqueue Layout).

Yes. It should speed up somehow.

>
> The packet headers and payloads are still translated using the
> uncached virtqueue_pop() -> dma_memory_map() -> address_space_map()
> API.
>
> Running a tx packet drop benchmark as Jason suggested and checking if
> memory translation is a bottleneck seems worthwhile. Improving
> dma_memory_map() performance would speed up all built-in QEMU devices.

+1

>
> Jason: When you noticed this bottleneck, were you using a normal
> virtio-net-pci device without vIOMMU?

Normal virtio-net-pci device without vIOMMU.

Thanks

>
> Stefan
>


Reply via email to