Am 11.08.2014 um 21:37 hat Paolo Bonzini geschrieben: > Il 10/08/2014 05:46, Ming Lei ha scritto: > > Hi Kevin, Paolo, Stefan and all, > > > > > > On Wed, 6 Aug 2014 10:48:55 +0200 > > Kevin Wolf <kw...@redhat.com> wrote: > > > >> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben: > > > >> > >> Anyhow, the coroutine version of your benchmark is buggy, it leaks all > >> coroutines instead of exiting them, so it can't make any use of the > >> coroutine pool. On my laptop, I get this (where fixed coroutine is a > >> version that simply removes the yield at the end): > >> > >> | bypass | fixed coro | buggy coro > >> ----------------+---------------+---------------+-------------- > >> time | 1.09s | 1.10s | 1.62s > >> L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 > >> insns per cycle | 2.39 | 2.39 | 1.90 > >> > >> Begs the question whether you see a similar effect on a real qemu and > >> the coroutine pool is still not big enough? With correct use of > >> coroutines, the difference seems to be barely measurable even without > >> any I/O involved. > > > > Now I fixes the coroutine leak bug, and previous crypt bench is a bit high > > loading, and cause operations per sec very low(~40K/sec), finally I write a > > new > > and simple one which can generate hundreds of kilo operations per sec and > > the number should match with some fast storage devices, and it does show > > there > > is not small effect from coroutine. > > > > Extremely if just getppid() syscall is run in each iteration, with using > > coroutine, > > only 3M operations/sec can be got, and without using coroutine, the number > > can > > reach 16M/sec, and there is more than 4 times difference!!! > > I should be on vacation, but I'm following a couple threads in the mailing > list > and I'm a bit tired to hear the same argument again and again... > > The different characteristics of asynchronous I/O vs. any synchronous workload > are such that it is hard to be sure that microbenchmarks make sense. > > The below patch is basically the minimal change to bypass coroutines. Of > course > the block.c part is not acceptable as is (the change to refresh_total_sectors > is broken, the others are just ugly), but it is a start. Please run it with > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O* > benchmark.
So to finally reply with some numbers... I'm running fio tests based on Ming's configuration on a loop-mounted tmpfs image using dataplane. I've extended the tests to not only test random reads, but also sequential reads. I did not yet test writes and almost no test for block sizes larger than 4k, so I'm not including it here. The "base" case is with Ming's patches applied, but the set_bypass(true) call commented out in the virtio-blk code. All other cases are patches applied on top of this. | Random throughput | Sequential throughput ----------------+-------------------+----------------------- master | 442 MB/s | 730 MB/s base | 453 MB/s | 757 MB/s bypass (Ming) | 461 MB/s | 734 MB/s coroutine | 468 MB/s | 716 MB/s bypass (Paolo) | 476 MB/s | 682 MB/s So while your patches look pretty good in Ming's test case of random reads, I think the sequential case is worrying. The same is true for my latest coroutine optimisations, even though the degradation is smaller there. This needs some more investigation. Kevin