On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote: > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: >> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote: >>> Recent performance investigation work done by Karl Rister shows that the >>> guest->host notification takes around 20 us. This is more than the >>> "overhead" >>> of QEMU itself (e.g. block layer). >>> >>> One way to avoid the costly exit is to use polling instead of notification. >>> The main drawback of polling is that it consumes CPU resources. In order to >>> benefit performance the host must have extra CPU cycles available on >>> physical >>> CPUs that aren't used by the guest. >>> >>> This is an experimental AioContext polling implementation. It adds a >>> polling >>> callback into the event loop. Polling functions are implemented for >>> virtio-blk >>> virtqueue guest->host kick and Linux AIO completion. >>> >>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of >>> nanoseconds to >>> poll before entering the usual blocking poll(2) syscall. Try setting this >>> variable to the time from old request completion to new virtqueue kick. >>> >>> By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get >>> any >>> polling! >>> >>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS >>> values. If you don't find a good value we should double-check the tracing >>> data >>> to see if this experimental code can be improved. >> >> Stefan >> >> I ran some quick tests with your patches and got some pretty good gains, >> but also some seemingly odd behavior. >> >> These results are for a 5 minute test doing sequential 4KB requests from >> fio using O_DIRECT, libaio, and IO depth of 1. The requests are >> performed directly against the virtio-blk device (no filesystem) which >> is backed by a 400GB NVme card. >> >> QEMU_AIO_POLL_MAX_NS IOPs >> unset 31,383 >> 1 46,860 >> 2 46,440 >> 4 35,246 >> 8 34,973 >> 16 46,794 >> 32 46,729 >> 64 35,520 >> 128 45,902 > > The environment variable is in nanoseconds. The range of values you > tried are very small (all <1 usec). It would be interesting to try > larger values in the ballpark of the latencies you have traced. For > example 2000, 4000, 8000, 16000, and 32000 ns.
Here are some more numbers with higher values. I continued the power of 2 values and added in your examples as well: QEMU_AIO_POLL_MAX_NS IOPs 256 46,929 512 35,627 1,024 46,477 2,000 35,247 2,048 46,322 4,000 46,540 4,096 46,368 8,000 47,054 8,192 46,671 16,000 46,466 16,384 32,504 32,000 20,620 32,768 20,807 > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without > much CPU overhead. > >> I found the results for 4, 8, and 64 odd so I re-ran some tests to check >> for consistency. I used values of 2 and 4 and ran each 5 times. Here >> is what I got: >> >> Iteration QEMU_AIO_POLL_MAX_NS=2 QEMU_AIO_POLL_MAX_NS=4 >> 1 46,972 35,434 >> 2 46,939 35,719 >> 3 47,005 35,584 >> 4 47,016 35,615 >> 5 47,267 35,474 >> >> So the results seem consistent. > > That is interesting. I don't have an explanation for the consistent > difference between 2 and 4 ns polling time. The time difference is so > small yet the IOPS difference is clear. > > Comparing traces could shed light on the cause for this difference. > >> I saw some discussion on the patches made which make me think you'll be >> making some changes, is that right? If so, I may wait for the updates >> and then we can run the much more exhaustive set of workloads >> (sequential read and write, random read and write) at various block >> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32) >> that we were doing when we started looking at this. > > I'll send an updated version of the patches. > > Stefan > -- Karl Rister <kris...@redhat.com>