On Thu, Aug 14, 2014 at 6:46 PM, Kevin Wolf <kw...@redhat.com> wrote: > Am 11.08.2014 um 21:37 hat Paolo Bonzini geschrieben: >> Il 10/08/2014 05:46, Ming Lei ha scritto: >> > Hi Kevin, Paolo, Stefan and all, >> > >> > >> > On Wed, 6 Aug 2014 10:48:55 +0200 >> > Kevin Wolf <kw...@redhat.com> wrote: >> > >> >> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben: >> > >> >> >> >> Anyhow, the coroutine version of your benchmark is buggy, it leaks all >> >> coroutines instead of exiting them, so it can't make any use of the >> >> coroutine pool. On my laptop, I get this (where fixed coroutine is a >> >> version that simply removes the yield at the end): >> >> >> >> | bypass | fixed coro | buggy coro >> >> ----------------+---------------+---------------+-------------- >> >> time | 1.09s | 1.10s | 1.62s >> >> L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 >> >> insns per cycle | 2.39 | 2.39 | 1.90 >> >> >> >> Begs the question whether you see a similar effect on a real qemu and >> >> the coroutine pool is still not big enough? With correct use of >> >> coroutines, the difference seems to be barely measurable even without >> >> any I/O involved. >> > >> > Now I fixes the coroutine leak bug, and previous crypt bench is a bit high >> > loading, and cause operations per sec very low(~40K/sec), finally I write >> > a new >> > and simple one which can generate hundreds of kilo operations per sec and >> > the number should match with some fast storage devices, and it does show >> > there >> > is not small effect from coroutine. >> > >> > Extremely if just getppid() syscall is run in each iteration, with using >> > coroutine, >> > only 3M operations/sec can be got, and without using coroutine, the number >> > can >> > reach 16M/sec, and there is more than 4 times difference!!! >> >> I should be on vacation, but I'm following a couple threads in the mailing >> list >> and I'm a bit tired to hear the same argument again and again... >> >> The different characteristics of asynchronous I/O vs. any synchronous >> workload >> are such that it is hard to be sure that microbenchmarks make sense. >> >> The below patch is basically the minimal change to bypass coroutines. Of >> course >> the block.c part is not acceptable as is (the change to refresh_total_sectors >> is broken, the others are just ugly), but it is a start. Please run it with >> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O* >> benchmark. > > So to finally reply with some numbers... I'm running fio tests based on > Ming's configuration on a loop-mounted tmpfs image using dataplane. I've > extended the tests to not only test random reads, but also sequential > reads. I did not yet test writes and almost no test for block sizes > larger than 4k, so I'm not including it here. > > The "base" case is with Ming's patches applied, but the set_bypass(true) > call commented out in the virtio-blk code. All other cases are patches > applied on top of this. > > | Random throughput | Sequential throughput > ----------------+-------------------+----------------------- > master | 442 MB/s | 730 MB/s > base | 453 MB/s | 757 MB/s > bypass (Ming) | 461 MB/s | 734 MB/s > coroutine | 468 MB/s | 716 MB/s > bypass (Paolo) | 476 MB/s | 682 MB/s
Looks the difference between rand read and sequential read is quite big, which shouldn't have been so since the whole file is cached in ram. > > So while your patches look pretty good in Ming's test case of random > reads, I think the sequential case is worrying. The same is true for my > latest coroutine optimisations, even though the degradation is smaller > there. In my VM test, both rand read and sequential read result are basically same, and IO thread's CPU utilization is more than 93% with Paolo's patch, over both nullblk and loop on file in tmpfs. I am using 3.16 kernel. > > This needs some more investigation. Maybe it is caused by your test setup and environment, or your VM kernel, not sure. Thanks, -- Ming Lei