On Wed, Aug 13, 2014 at 9:16 PM, Paolo Bonzini <pbonz...@redhat.com> wrote: > Il 13/08/2014 11:54, Kevin Wolf ha scritto: >> Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben: >>> Il 12/08/2014 10:12, Ming Lei ha scritto: >>>>>> The below patch is basically the minimal change to bypass coroutines. >>>>>> Of course >>>>>> the block.c part is not acceptable as is (the change to >>>>>> refresh_total_sectors >>>>>> is broken, the others are just ugly), but it is a start. Please run it >>>>>> with >>>>>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io >>>>>> *I/O* >>>>>> benchmark. >>>> Could you explain why the new change is introduced? >>> >>> It provides a fast path for bdrv_aio_readv/writev whenever there is >>> nothing to do after the driver routine returns. In this case there is >>> no need to wrap the AIOCB returned by the driver routine. >>> >>> It doesn't go all the way, and in particular it doesn't reverse >>> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev. >> >> That's actually why I think it's an option. Remember that, like you say >> below, we're optimising for an extreme case here, and I certainly don't >> want to hurt the common case for it. I can't imagine a way of reversing >> the roles without multiplying the cost for the coroutine path. > > I'm not that worried about it. Perhaps it's enough to add an > !qemu_in_coroutine() to the AIO fast path, and let the driver provide > optimized coroutine paths like in your patches that allocate AIOCBs on > the stack.
IMO, it will not be a extreme case as SSD or high performance storage becomes more popular, coroutine starts to affect performance if IOPS is more than 100K, as previous computation. > >> Or do you have a clever solution how you'd go about it without having an >> impact on the common case? > > I don't really have any ace up my sleeve, but there are some things that > bother me in the block layer's AIO API and in block.c in general. > > One is that block.c can do all the pre-processing it wants before > issuing AIO, but nothing before calling the callback. This means that > my patches break bdrv_drain_all (they cannot call tracked_request_end). > > Another is all the similar structs that we have (RwCo, > BdrvTrackedRequest, BlockRequest, etc.). > > Perhaps it would help if we had a single "real" block request object, > which is an extension of the BlockDriverAIOCB and includes enough data > to subsume all these request structs. That should help commonizing > stuff between the coroutine and AIO paths, for the common case where a > single yield is enough. I think the single-yield case is the one that > is really worth optimizing for. If done properly, I think this can > simplify a lot of block.c code, but it is really difficult to get it > right, and unless the design is sound the code is going to come up > really ugly. :( > > Another thing to evaluate is the performance gap (if there is any) > between aio=threads and aio=native. The only advantage of aio=native, > AFAIU, is the batch submission of requests (plug/unplug). But > aio=threads often ends up having better performance because the kernel > folks have optimized VFS a lot. So, in the aio=threads case, we might >From my test, aio=native is much better than aio=threads at least for read. For aio=thread, it may be possible to use batch submission too with readv/writev to decrease sycall. > as well move the format code out of the iothread and into the worker > thread, and get rid of the coroutine cost simply by making everything > synchronous. Looking within QEMU, this worked out very well for migration. > > (We could do batch submission of requests to the thread pool if there > were a variant of sem_post that can add multiple signals to the same > semaphore, similar to ReleaseSemaphore on Windows). > >>> The problem is that your patches to do touch too much code and subtly >>> break too much stuff. The one I wrote does have a little breakage >>> because I don't understand bs->growable 100% and I didn't really put >>> much effort into it (my deadline being basically "be done as soon as the >>> shower is free"), and it is ugly as hell, _but_ it should be compatible >>> with the way the block layer works. >> >> Yes, your patch is definitely much more palatable than Ming's. The part >> that I still don't like about it is that it would be stating "in the >> common case, we're only doing the second best thing". I'm not yet >> convinced that coroutines perform necessarily worse than state-passing >> callbacks. > > Coroutines lump all the allocation costs together at the time you > allocate the stack, but have (much) more expensive context switching. Yes, I agree. In my tests, one pair of malloc(128)/free(128) only takes 57ns, but two enter and one yield takes 240ns, not mention dcache reload & miss caused by switching stack, and allocation fallback. Ming