Il 09/10/2012 12:52, Avi Kivity ha scritto: > On 10/09/2012 12:36 PM, Paolo Bonzini wrote: >> Il 09/10/2012 11:26, Avi Kivity ha scritto: >>> On 10/09/2012 11:08 AM, Stefan Hajnoczi wrote: >>>> Here are the steps that have been mentioned: >>>> >>>> 1. aio fastpath - for raw-posix and other aio block drivers, can we reduce >>>> I/O >>>> request latency by skipping block layer coroutines? >>> >>> Is coroutine overhead noticable? >> >> I'm thinking more about throughput than latency. If the iothread >> becomes CPU-bound, then everything is noticeable. > > That's not strictly a coroutine issue. Switching to ordinary threads > may make the problem worse, since there will clearly be contention.
The point is you don't need either coroutines or userspace threads if you use native AIO. longjmp/setjmp is probably a smaller overhead compared to the many syscalls involved in poll+eventfd reads+io_submit+io_getevents, but it's also not cheap. Also, if you process AIO in batches you risk overflowing the pool of free coroutines, which gets expensive real fast (allocate/free the stack, do the expensive getcontext/swapcontext instead of the cheaper longjmp/setjmp, etc.). It seems better to sidestep the issue completely, it's a small amount of work. > What is the I/O processing time we have? If it's say 10 microseconds, > then we'll have 100,000 context switches per second assuming a device > lock and a saturated iothread (split into multiple threads). Hopefully with a saturated dedicated iothread you would not have any context switches and a single CPU will be just dedicated to virtio processing. > The coroutine work may have laid the groundwork for fine-grained > locking. I'm doubtful we should use qcow when we want >100K IOPS though. Yep. Going away from coroutines is a solution in search of a problem, it will introduce several new variables (kernel scheduling, more expensive lock contention, starving the thread pool with locked threads, ...), all for a case where performance hardly matters. >>>> I'm also curious about virtqueue_pop()/virtqueue_push() outside the QEMU >>>> mutex >>>> although that might be blocked by the current work around MMIO/PIO dispatch >>>> outside the global mutex. >>> >>> It is, yes. >> >> It should only require unlocked memory map/unmap, not MMIO dispatch. >> The MMIO/PIO bits are taken care of by ioeventfd. > > The ring, or indirect descriptors, or the data, can all be on mmio. > IIRC the virtio spec forbids that, but the APIs have to be general. We > don't have cpu_physical_memory_map_nommio() (or > address_space_map_nommio(), as soon as the coding style committee > ratifies srtuct literals). cpu_physical_memory_map could still take the QEMU lock in the slow bounce-buffer case. BTW the block layer has been using struct literals for a long time and we're just as happy as you are about them. :) Paolo