On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote: > Recent performance investigation work done by Karl Rister shows that the > guest->host notification takes around 20 us. This is more than the "overhead" > of QEMU itself (e.g. block layer). > > One way to avoid the costly exit is to use polling instead of notification. > The main drawback of polling is that it consumes CPU resources. In order to > benefit performance the host must have extra CPU cycles available on physical > CPUs that aren't used by the guest. > > This is an experimental AioContext polling implementation. It adds a polling > callback into the event loop. Polling functions are implemented for > virtio-blk > virtqueue guest->host kick and Linux AIO completion. > > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds > to > poll before entering the usual blocking poll(2) syscall. Try setting this > variable to the time from old request completion to new virtqueue kick. > > By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get > any > polling! > > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS > values. If you don't find a good value we should double-check the tracing > data > to see if this experimental code can be improved.
Stefan I ran some quick tests with your patches and got some pretty good gains, but also some seemingly odd behavior. These results are for a 5 minute test doing sequential 4KB requests from fio using O_DIRECT, libaio, and IO depth of 1. The requests are performed directly against the virtio-blk device (no filesystem) which is backed by a 400GB NVme card. QEMU_AIO_POLL_MAX_NS IOPs unset 31,383 1 46,860 2 46,440 4 35,246 8 34,973 16 46,794 32 46,729 64 35,520 128 45,902 I found the results for 4, 8, and 64 odd so I re-ran some tests to check for consistency. I used values of 2 and 4 and ran each 5 times. Here is what I got: Iteration QEMU_AIO_POLL_MAX_NS=2 QEMU_AIO_POLL_MAX_NS=4 1 46,972 35,434 2 46,939 35,719 3 47,005 35,584 4 47,016 35,615 5 47,267 35,474 So the results seem consistent. I saw some discussion on the patches made which make me think you'll be making some changes, is that right? If so, I may wait for the updates and then we can run the much more exhaustive set of workloads (sequential read and write, random read and write) at various block sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32) that we were doing when we started looking at this. Karl > > Stefan Hajnoczi (3): > aio-posix: add aio_set_poll_handler() > virtio: poll virtqueues fr new buffers > linux-aio: poll ring for ompletions > > aio-posix.c | 133 > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > block/linux-aio.c | 17 +++++++ > hw/virtio/virtio.c | 19 ++++++++ > include/block/aio.h | 16 +++++++ > 4 files changed, 185 insertions(+) > -- Karl Rister <kris...@redhat.com>