On 27.03.25 16:35, Stefan Hajnoczi wrote:
On Tue, Mar 25, 2025 at 05:06:51PM +0100, Hanna Czenczek wrote:
Manually read requests from the /dev/fuse FD and process them, without
using libfuse.  This allows us to safely add parallel request processing
in coroutines later, without having to worry about libfuse internals.
(Technically, we already have exactly that problem with
read_from_fuse_export()/read_from_fuse_fd() nesting.)

We will continue to use libfuse for mounting the filesystem; fusermount3
is a effectively a helper program of libfuse, so it should know best how
to interact with it.  (Doing it manually without libfuse, while doable,
is a bit of a pain, and it is not clear to me how stable the "protocol"
actually is.)

Take this opportunity of quite a major rewrite to update the Copyright
line with corrected information that has surfaced in the meantime.

Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
except 'sync', which are iodepth=1 and pvsync2):

file:
   read:
     seq aio:   78.6k ±1.3k IOPS
     rand aio:  39.3k ±2.9k
     seq sync:  32.5k ±0.7k
     rand sync:  9.9k ±0.1k
   write:
     seq aio:   61.9k ±0.5k
     rand aio:  61.2k ±0.6k
     seq sync:  27.9k ±0.2k
     rand sync: 27.6k ±0.4k
null:
   read:
     seq aio:   214.0k ±5.9k
     rand aio:  212.7k ±4.5k
     seq sync:   90.3k ±6.5k
     rand sync:  89.7k ±5.1k
   write:
     seq aio:   203.9k ±1.5k
     rand aio:  201.4k ±3.6k
     seq sync:   86.1k ±6.2k
     rand sync:  84.9k ±5.3k

And with this patch applied:

file:
   read:
     seq aio:   76.6k ±1.8k (- 3 %)
     rand aio:  26.7k ±0.4k (-32 %)
     seq sync:  47.7k ±1.2k (+47 %)
     rand sync: 10.1k ±0.2k (+ 2 %)
   write:
     seq aio:   58.1k ±0.5k (- 6 %)
     rand aio:  58.1k ±0.5k (- 5 %)
     seq sync:  36.3k ±0.3k (+30 %)
     rand sync: 36.1k ±0.4k (+31 %)
null:
   read:
     seq aio:   268.4k ±3.4k (+25 %)
     rand aio:  265.3k ±2.1k (+25 %)
     seq sync:  134.3k ±2.7k (+49 %)
     rand sync: 132.4k ±1.4k (+48 %)
   write:
     seq aio:   275.3k ±1.7k (+35 %)
     rand aio:  272.3k ±1.9k (+35 %)
     seq sync:  130.7k ±1.6k (+52 %)
     rand sync: 127.4k ±2.4k (+50 %)

So clearly the AIO file results are actually not good, and random reads
are indeed quite terrible.  On the other hand, we can see from the sync
and null results that request handling should in theory be quicker.  How
does this fit together?

I believe the bad AIO results are an artifact of the accidental parallel
request processing we have due to nested polling: Depending on how the
actual request processing is structured and how long request processing
takes, more or less requests will be submitted in parallel.  So because
of the restructuring, I think this patch accidentally changes how many
requests end up being submitted in parallel, which decreases
performance.

(I have seen something like this before: In RSD, without having
implemented a polling mode, the debug build tended to have better
performance than the more optimized release build, because the debug
build, taking longer to submit requests, ended up processing more
requests in parallel.)

In any case, once we use coroutines throughout the code, performance
will improve again across the board.

Signed-off-by: Hanna Czenczek <hre...@redhat.com>
---
  block/export/fuse.c | 793 +++++++++++++++++++++++++++++++-------------
  1 file changed, 567 insertions(+), 226 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 3dd50badb3..407b101018 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c

[...]

+/**
+ * Check the FUSE FD for whether it is readable or not.  Because we cannot
+ * reasonably do this without reading a request at the same time, also read and
+ * process that request if any.
+ * (To be used as a poll handler for the FUSE FD.)
+ */
+static bool poll_fuse_fd(void *opaque)
+{
+    return read_from_fuse_fd(opaque);
+}
The other io_poll() callbacks in QEMU peek at memory whereas this one
invokes the read(2) syscall. Two reasons why this is a problem:
1. Syscall latency is too high. Other fd handlers will be delayed by
    microseconds.
2. This doesn't scale. If every component in QEMU does this then the
    event loop degrades to O(n) of non-blocking read(2) syscalls where n
    is the number of fds.

Also, handling the request inside the io_poll() callback skews
AioContext's time accounting because time spent handling the request
will be accounted as "polling time". The adaptive polling calculation
will think it polled for longer than it did.

If there is no way to peek at memory, please don't implement the
io_poll() callback.

Got it, thanks!

Hanna


Reply via email to