On Mon, Dec 05, 2016 at 06:25:49PM +0800, Fam Zheng wrote: > On Thu, 12/01 19:26, Stefan Hajnoczi wrote: > > One way to avoid the costly exit is to use polling instead of notification. > > Testing with fio on virtio-blk backed by NVMe device, I can see significant > performance improvement with this series: > > poll-max-ns iodepth=1 iodepth=32 > -------------------------------------- > 0 24806.94 151430.36 > 1 24435.02 155285.59 > 100 24702.41 149937.2 > 1000 24457.08 152936.3 > 10000 24605.05 156158.12 > 13000 24412.95 154908.73 > 16000 30654.24 185731.58 > 18000 36365.69 185538.78 > 20000 35640.48 188640.61 > 30000 37411.05 184198.39 > 60000 35230.66 190427.42 > > So this definitely helps synchronous I/O with queue depth = 1. Great!
Nice. Even with iodepth=32 the improvement is significant. > I have a small concern with high queue depth, though. The more frequent we > check > the virtqueue, the less likely the requests can be batched, and the more > submissions (both from guest to QEMU and from QEMU to host) are needed to > achieve the same bandwidth, because we'd do less plug/unplug. This could be a > problem under heavy workload. We may want to consider the driver's transient > state when it is appending requests to the virtqueue. For example, virtio-blk > driver in Linux updates avail idx after adding each request. If QEMU looks at > the index in the middle, it will miss the opportunities to plug/unplug and > merge > requests. On the other hand, though virtio-blk driver doesn't have IO > scheduler, > it does have some batch submission semantics passed down by blk-mq (see the > "notify" condition in drivers/block/virtio_blk.c:virtio_queue_rq()). So I'm > wondering whether it makes sense to wait for the whole batch of requests to be > added to the virtqueue before processing it? This can be done by changing the > driver to only update "avail" index after all requests are added to the queue, > or even adding a flag in the virtio ring descriptor to suppress busy polling. Only benchmarking can tell. It would be interesting to extract the vq->vring.avail->idx update from virtqueue_add(). I don't think a new flag is necessary. > > The main drawback of polling is that it consumes CPU resources. In order to > > benefit performance the host must have extra CPU cycles available on > > physical > > CPUs that aren't used by the guest. > > > > This is an experimental AioContext polling implementation. It adds a > > polling > > callback into the event loop. Polling functions are implemented for > > virtio-blk > > virtqueue guest->host kick and Linux AIO completion. > > > > The -object iothread,poll-max-ns=NUM parameter sets the number of > > nanoseconds > > to poll before entering the usual blocking poll(2) syscall. Try setting > > this > > parameter to the time from old request completion to new virtqueue kick. By > > default no polling is done so you must set this parameter to get busy > > polling. > > Given the self tuning, should we document the best practice in setting the > value? Is it okay for user or even QEMU to use a relatively large value by > default and expect it to tune itself sensibly? Karl: do you have time to run a bigger suite of benchmarks to identify a reasonable default poll-max-ns value? Both aio=native and aio=threads are important. If there is a sweet spot that improves performance without pathological cases then we could even enable polling by default in QEMU. Otherwise we'd just document the recommended best polling duration as a starting point for users. Stefan
signature.asc
Description: PGP signature