This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations, addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes demonstrate measurable performance improvements while simplifying resource management.
1. technology implementation according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management. and change the fuse_getattr to call: bdrv_co_get_allocated_file_size(). 2. performance summary For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows: ------------------------------- Average results for iodepth=1: Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement -------------------------------- ------------------------------- Average results for iodepth=64: Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement -------------------------------- Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement. 3. Performance Bottlenecks Identified via gprof After running a fio test with the following command: fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \ --rw=randrw --bs=4k --time_based=1 --name=job1 \ --filename=/mnt/qemu-fuse --iopath=64 and analyzing the execution profile using gprof, the following issues were identified: 3.1 Increased Overall Execution Time In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%). After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%). This suggests that coroutine overhead is contributing significantly to execution time. 3.2 Increased Read and Write Calls fuse_write calls increased from 173,400 → 333,232. fuse_read calls increased from 173,526 → 332,931. This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches. 3.3 Significant Coroutine Overhead qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously. This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements. saz97 (1): Integration coroutines into fuse export block/export/fuse.c | 190 +++++++++++++++++++++++++++++--------------- 1 file changed, 126 insertions(+), 64 deletions(-) -- 2.34.1