On Tue, Aug 12, 2014 at 3:37 AM, Paolo Bonzini <pbonz...@redhat.com> wrote: > Il 10/08/2014 05:46, Ming Lei ha scritto: >> Hi Kevin, Paolo, Stefan and all, >> >> >> On Wed, 6 Aug 2014 10:48:55 +0200 >> Kevin Wolf <kw...@redhat.com> wrote: >> >>> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben: >> >>> >>> Anyhow, the coroutine version of your benchmark is buggy, it leaks all >>> coroutines instead of exiting them, so it can't make any use of the >>> coroutine pool. On my laptop, I get this (where fixed coroutine is a >>> version that simply removes the yield at the end): >>> >>> | bypass | fixed coro | buggy coro >>> ----------------+---------------+---------------+-------------- >>> time | 1.09s | 1.10s | 1.62s >>> L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 >>> insns per cycle | 2.39 | 2.39 | 1.90 >>> >>> Begs the question whether you see a similar effect on a real qemu and >>> the coroutine pool is still not big enough? With correct use of >>> coroutines, the difference seems to be barely measurable even without >>> any I/O involved. >> >> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high >> loading, and cause operations per sec very low(~40K/sec), finally I write a >> new >> and simple one which can generate hundreds of kilo operations per sec and >> the number should match with some fast storage devices, and it does show >> there >> is not small effect from coroutine. >> >> Extremely if just getppid() syscall is run in each iteration, with using >> coroutine, >> only 3M operations/sec can be got, and without using coroutine, the number >> can >> reach 16M/sec, and there is more than 4 times difference!!! > > I should be on vacation, but I'm following a couple threads in the mailing > list > and I'm a bit tired to hear the same argument again and again...
I am sorry to interrupt your vocation and make you tired, but the discussion isn't simply again and again, and something new always comes every time or most of times. > > The different characteristics of asynchronous I/O vs. any synchronous workload > are such that it is hard to be sure that microbenchmarks make sense. I don't think it is related with asynchronous I/O or synchronous I/O, and there isn't sleep(or wait for completion) at all, and we can treat it as aio by thinking completion as nop in this case(AIO model: submit and complete) IMO the getppid() bench is a simple simulation on bdrv_aio_readv/writev() with I/O plug/unplug wrt. coroutine usage. BTW, do you agree the computation on coroutine cost in my previous mail? And I don't think the computation is related with I/O type. > > The below patch is basically the minimal change to bypass coroutines. Of > course > the block.c part is not acceptable as is (the change to refresh_total_sectors > is broken, the others are just ugly), but it is a start. Please run it with > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O* > benchmark. Could you explain why the new change is introduced? I will hold it until we can align to the coroutine cost computation, because it is very important for the discussion. Thank you again for taking time in the discussion. Thanks,