On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kw...@redhat.com> wrote: > Am 06.08.2014 um 13:28 hat Ming Lei geschrieben: >> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kw...@redhat.com> wrote: >> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben: >> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kw...@redhat.com> wrote: >> >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben: >> >> >> Hi Kevin, >> >> >> >> >> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kw...@redhat.com> wrote: >> >> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben: >> >> >> >> I have been wondering how to prove that the root cause is the >> >> >> >> ucontext >> >> >> >> coroutine mechanism (stack switching). Here is an idea: >> >> >> >> >> >> >> >> Hack your "bypass" code path to run the request inside a coroutine. >> >> >> >> That way you can compare "bypass without coroutine" against "bypass >> >> >> >> with >> >> >> >> coroutine". >> >> >> >> >> >> >> >> Right now I think there are doubts because the bypass code path is >> >> >> >> indeed a different (and not 100% correct) code path. So this >> >> >> >> approach >> >> >> >> might prove that the coroutines are adding the overhead and not >> >> >> >> something that you bypassed. >> >> >> > >> >> >> > My doubts aren't only that the overhead might not come from the >> >> >> > coroutines, but also whether any coroutine-related overhead is really >> >> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do >> >> >> > just that instead of introducing additional code paths. >> >> >> >> >> >> OK, thank you for taking look at the problem, and hope we can >> >> >> figure out the root cause, :-) >> >> >> >> >> >> > >> >> >> > Another thought I had was this: If the performance difference is >> >> >> > indeed >> >> >> > only coroutines, then that is completely inside the block layer and >> >> >> > we >> >> >> > don't actually need a VM to test it. We could instead have something >> >> >> > like a simple qemu-img based benchmark and should be observing the >> >> >> > same. >> >> >> >> >> >> Even it is simpler to run a coroutine-only benchmark, and I just >> >> >> wrote a raw one, and looks coroutine does decrease performance >> >> >> a lot, please see the attachment patch, and thanks for your template >> >> >> to help me add the 'co_bench' command in qemu-img. >> >> > >> >> > Yes, we can look at coroutines microbenchmarks in isolation. I actually >> >> > did do that yesterday with the yield test from tests/test-coroutine.c. >> >> > And in fact profiling immediately showed something to optimise: >> >> > pthread_getspecific() was quite high, replacing it by __thread on >> >> > systems where it works is more efficient and helped the numbers a bit. >> >> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even >> >> > in qemu-img bench), maybe there's even something that can be done here. >> >> >> >> The lock/unlock in dataplane is often from memory_region_find(), and Paolo >> >> should have done lots of work on that. > > qemu-img bench doesn't run that code. We have a few more locks that are > taken, and one of them (the coroutine pool lock) is avoided by your > bypass patches. > >> >> > >> >> > However, I just wasn't sure whether a change on this level would be >> >> > relevant in a realistic environment. This is the reason why I wanted to >> >> > get a benchmark involving the block layer and some I/O. >> >> > >> >> >> From the profiling data in below link: >> >> >> >> >> >> http://pastebin.com/YwH2uwbq >> >> >> >> >> >> With coroutine, the running time for same loading is increased >> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased >> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%( >> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter). >> >> >> >> >> >> The bypass code in the benchmark is very similar with the approach >> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom >> >> >> blocks in the the kernel I/O path. >> >> >> >> >> >> Maybe the benchmark is a bit extremely, but given modern storage >> >> >> device may reach millions of IOPS, and it is very easy to slow down >> >> >> the I/O by coroutine. >> >> > >> >> > I think in order to optimise coroutines, such benchmarks are fair game. >> >> > It's just not guaranteed that the effects are exactly the same on real >> >> > workloads, so we should take the results with a grain of salt. >> >> > >> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all >> >> > coroutines instead of exiting them, so it can't make any use of the >> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a >> >> > version that simply removes the yield at the end): >> >> > >> >> > | bypass | fixed coro | buggy coro >> >> > ----------------+---------------+---------------+-------------- >> >> > time | 1.09s | 1.10s | 1.62s >> >> > L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 >> >> > insns per cycle | 2.39 | 2.39 | 1.90 >> >> > >> >> > Begs the question whether you see a similar effect on a real qemu and >> >> > the coroutine pool is still not big enough? With correct use of >> >> > coroutines, the difference seems to be barely measurable even without >> >> > any I/O involved. >> >> >> >> When I comment qemu_coroutine_yield(), looks result of >> >> bypass and fixed coro is very similar as your test, and I am just >> >> wondering if stack is always switched in qemu_coroutine_enter() >> >> without calling qemu_coroutine_yield(). >> > >> > Yes, definitely. qemu_coroutine_enter() always involves calling >> > qemu_coroutine_switch(), which is the stack switch. >> > >> >> Without the yield, the benchmark can't emulate coroutine usage in >> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset >> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield() >> >> for each bdrv_aio_readv/writev(). >> > >> > It's not completely comparable anyway because you're not going through a >> > main loop and callbacks from there for your benchmark. >> > >> > But fair enough: Keep the yield, but enter the coroutine twice then. You >> > get slightly worse results then, but that's more like doubling the very >> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327 >> > / 2.37), not like the horrible performance of the buggy version. >> >> Yes, I compared that too, looks no big difference. >> >> > >> > Actually, that's within the error of measurement for time and >> > insns/cycle, so running it for a bit longer: >> > >> > | bypass | coro | + yield | buggy coro >> > ----------------+-----------+-----------+-----------+-------------- >> > time | 21.45s | 21.68s | 21.83s | 97.05s >> > L1-dcache-loads | 18,049 M | 18,387 M | 18,618 M | 26,062 M >> > insns per cycle | 2.42 | 2.40 | 2.41 | 1.75 >> > >> >> >> > I played a bit with the following, I hope it's not too naive. I >> >> >> > couldn't >> >> >> > see a difference with your patches, but at least one reason for this >> >> >> > is >> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the >> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the >> >> >> > next >> >> >> > thing. (I actually wrote the patch up just for some profiling on my >> >> >> > own, >> >> >> > not for comparing throughput, but it should be usable for that as >> >> >> > well.) >> >> >> >> >> >> This might not be good for the test since it is basically a sequential >> >> >> read test, which can be optimized a lot by kernel. And I always use >> >> >> randread benchmark. >> >> > >> >> > Yes, I shortly pondered whether I should implement random offsets >> >> > instead. But then I realised that a quicker kernel operation would only >> >> > help the benchmark because we want it to test the CPU consumption in >> >> > userspace. So the faster the kernel gets, the better for us, because it >> >> > should make the impact of coroutines bigger. >> >> >> >> OK, I will compare coroutine vs. bypass-co with the benchmark. >> >> I use the /dev/nullb0 block device to test, which is available in linux >> kernel >> 3.13+, and follows the difference, which looks not very big(< 10%): > > Sounds useful. I'm running on an older kernel, so I used a loop-mounted > file on tmpfs instead for my tests.
Actually loop is a slow device, and recently I used kernel aio and blk-mq to speedup it a lot. > > Anyway, at some point today I figured I should take a different approach > and not try to minimise the problems that coroutines introduce, but > rather make the most use of them when we have them. After all, the > raw-posix driver is still very callback-oriented and does things that > aren't really necessary with coroutines (such as AIOCB allocation). > > The qemu-img bench time I ended up with looked quite nice. Maybe you > want to take a look if you can reproduce these results, both with > qemu-img bench and your real benchmark. > > > $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 > /dev/loop0; done > Sending 2000000 requests, 4096 bytes each, 64 in parallel > > bypass (base) | bypass (patch) | coro (base) | coro (patch) > ----------------------+----------------+-------------+--------------- > run 1 0m5.966s | 0m5.687s | 0m6.224s | 0m5.362s > run 2 0m5.826s | 0m5.831s | 0m5.994s | 0m5.541s > run 3 0m6.145s | 0m5.495s | 0m6.253s | 0m5.408s > run 4 0m5.683s | 0m5.527s | 0m6.045s | 0m5.293s > run 5 0m5.904s | 0m5.607s | 0m6.238s | 0m5.207s I suggest to run the test a bit long. > > You can find my working tree at: > > git://repo.or.cz/qemu/kevin.git perf-bypass I just tried your work tree, and looks qemu-img can work well with your linux-aio coro patches, but unfortunately there is little improvement observed in my server, basically the result is same without bypass; in my laptop, the improvement can be observed but it is still at least 5% less than bypass. Let's see the result in my server: ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 /dev/nullb5 Sending 6400000 requests, 4096 bytes each, 64 in parallel read time: 38351ms, 166.000000K IOPS ming@:~/git/qemu$ ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b /dev/nullb5 Sending 6400000 requests, 4096 bytes each, 64 in parallel read time: 35241ms, 181.000000K IOPS Also there are some problems with your patches which can't boot a VM in my environment: - __thread patch: looks there is no '__thread' used, and the patch basically makes bypass not workable. - bdrv_co_writev callback isn't set for raw-posix, looks my rootfs need to write during booting - another problem, I am investigating: laio isn't accessable in qemu_laio_process_completion() sometimes Actually I do care about performance boost with multi queue, since multi-queue can improve performance a lots against QEMU 2.0, once I fixed these problems, I will run VM to test mq performance with linu-aio coroutine. Or could you give suggestions about these problem? > Please note that I added an even worse and even wronger hack to keep the > bypass working so I can compare it (raw-posix exposes now both bdrv_aio* > and bdrv_co_*, and enabling the bypass also switches). Also, once the > AIO code that I kept for the bypass mode is gone, we can make the > coroutine path even nicer. This approach looks nice since it saves the intermediate callback. Basically current bypass approach is to bypass coroutine in block, but linux-aio takes a new coroutine, which are two different path. And linux-aio's coroutine still can be bypassed easily too , :-) Thanks,