Stefan Hajnoczi writes:
> On Thu, Mar 14, 2019 at 06:31:34PM +0100, Sergio Lopez wrote: >> Our current AIO path does a great job at unloading the work from the VM, >> and combined with IOThreads provides a good performance in most >> scenarios. But it also comes with its costs, in both a longer execution >> path and the need of the intervention of the scheduler at various >> points. >> >> There's one particular workload that suffers from this cost, and that's >> when you have just 1 or 2 cores on the Guest issuing synchronous >> requests. This happens to be a pretty common workload for some DBs and, >> in a general sense, on small VMs. >> >> I did a quick'n'dirty implementation on top of virtio-blk to get some >> numbers. This comes from a VM with 4 CPUs running on an idle server, >> with a secondary virtio-blk disk backed by a null_blk device with a >> simulated latency of 30us. > > Can you describe the implementation in more detail? Does "synchronous" > mean that hw/block/virtio_blk.c makes a blocking preadv()/pwritev() call > instead of calling blk_aio_preadv/pwritev()? If so, then you are also > bypassing the QEMU block layer (coroutines, request tracking, etc) and > that might explain some of the latency. The first implementation, the one I've used for getting these numbers, it's just preadv/pwrite from virtio_blk.c, as you correctly guessed. I know it's unfair, but I wanted to take a look at the best possible scenario, and then measure the cost of the other layers. I'm working now on writing non-coroutine counterparts for blk_co_[preadv|pwrite], so we have SIO without bypassing the block layer. > It's important for this discussion that we understand what your tried > out. "Synchronous" can mean different things. Since iothread is in > play the code path is still asynchronous from the vcpu thread's > perspective (thanks ioeventfd!). The guest CPU is not stuck during I/O > (good for quality of service) - however SIO+iothread may need to be > woken up and scheduled on a host CPU (bad for latency). I've tried SIO with ioeventfd=off, to make it fully synchronous, but the performance it's significantly worse. Not sure if this is due to cache pollution, or simply the guest CPU is able to move on early and be ready to process the IRQ when it's signalled. Or maybe both. >> - Average latency (us) >> >> ---------------------------------------- >> | | AIO+iothread | SIO+iothread | >> | 1 job | 70 | 55 | >> | 2 jobs | 83 | 82 | >> | 4 jobs | 90 | 159 | >> ---------------------------------------- > > BTW recently I've found that the latency distribution can contain > important clues that a simple average doesn't show (e.g. multiple peaks, > outliers, etc). If you continue to investigate this it might be > interesting to plot the distribution. Interesting, noted. >> In this case the intuition matches the reality, and synchronous IO wins >> when there's just 1 job issuing the requests, while it loses hard when >> the are 4. > > Have you looked at the overhead of AIO+event loop? ppoll()/epoll(), > read()ing the eventfd to clear it, and Linux AIO io_submit(). Not since a while, and that reminds me I wanted to check if we could improve the poll-max-ns heuristics. > I had some small patches that try to reorder/optimize these operations > but never got around to benchmarking and publishing them. They do not > reduce latency as low as SIO but they shave off a handful of > microseconds. > > Resuming this work might be useful. Let me know if you'd like me to dig > out the old patches. I would definitely like to take a look at those patches. >> >> While my first thought was implementing this as a tunable, turns out we >> have a hint about the nature of the workload in the number of the >> requests in the VQ. So I updated the code to use SIO if there's just 1 >> request and AIO otherwise, with these results: > > Nice, avoiding tunables is good. That way it can automatically adjust > depending on the current workload and we don't need to educate users on > tweaking a tunable. > >> >> ----------------------------------------------------------- >> | | AIO+iothread | SIO+iothread | AIO+SIO+iothread | >> | 1 job | 70 | 55 | 55 | >> | 2 jobs | 83 | 82 | 78 | >> | 4 jobs | 90 | 159 | 90 | >> ----------------------------------------------------------- >> >> This data makes me think this is something worth pursuing, but I'd like >> to hear your opinion on it. > > I think it's definitely worth experimenting with more. One thing to > consider: the iothread is a shared resource when multiple devices are > assigned to a single iothread. In that case we probably do not want SIO > since it would block the other emulated devices from processing > requests. Good point. > On a related note, there is a summer internship project to implement > support for the new io_uring API (successor to Linux AIO): > https://wiki.qemu.org/Google_Summer_of_Code_2019#io_uring_AIO_engine > > So please *don't* implement io_uring support right now ;-). Heh, you got me. That was my initial idea, but luckily I took a look at the GSoC page first ;-) Thanks, Sergio.