Am 06.08.2014 um 13:28 hat Ming Lei geschrieben: > On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kw...@redhat.com> wrote: > > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben: > >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kw...@redhat.com> wrote: > >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben: > >> >> Hi Kevin, > >> >> > >> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kw...@redhat.com> wrote: > >> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben: > >> >> >> I have been wondering how to prove that the root cause is the > >> >> >> ucontext > >> >> >> coroutine mechanism (stack switching). Here is an idea: > >> >> >> > >> >> >> Hack your "bypass" code path to run the request inside a coroutine. > >> >> >> That way you can compare "bypass without coroutine" against "bypass > >> >> >> with > >> >> >> coroutine". > >> >> >> > >> >> >> Right now I think there are doubts because the bypass code path is > >> >> >> indeed a different (and not 100% correct) code path. So this > >> >> >> approach > >> >> >> might prove that the coroutines are adding the overhead and not > >> >> >> something that you bypassed. > >> >> > > >> >> > My doubts aren't only that the overhead might not come from the > >> >> > coroutines, but also whether any coroutine-related overhead is really > >> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do > >> >> > just that instead of introducing additional code paths. > >> >> > >> >> OK, thank you for taking look at the problem, and hope we can > >> >> figure out the root cause, :-) > >> >> > >> >> > > >> >> > Another thought I had was this: If the performance difference is > >> >> > indeed > >> >> > only coroutines, then that is completely inside the block layer and we > >> >> > don't actually need a VM to test it. We could instead have something > >> >> > like a simple qemu-img based benchmark and should be observing the > >> >> > same. > >> >> > >> >> Even it is simpler to run a coroutine-only benchmark, and I just > >> >> wrote a raw one, and looks coroutine does decrease performance > >> >> a lot, please see the attachment patch, and thanks for your template > >> >> to help me add the 'co_bench' command in qemu-img. > >> > > >> > Yes, we can look at coroutines microbenchmarks in isolation. I actually > >> > did do that yesterday with the yield test from tests/test-coroutine.c. > >> > And in fact profiling immediately showed something to optimise: > >> > pthread_getspecific() was quite high, replacing it by __thread on > >> > systems where it works is more efficient and helped the numbers a bit. > >> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even > >> > in qemu-img bench), maybe there's even something that can be done here. > >> > >> The lock/unlock in dataplane is often from memory_region_find(), and Paolo > >> should have done lots of work on that.
qemu-img bench doesn't run that code. We have a few more locks that are taken, and one of them (the coroutine pool lock) is avoided by your bypass patches. > >> > > >> > However, I just wasn't sure whether a change on this level would be > >> > relevant in a realistic environment. This is the reason why I wanted to > >> > get a benchmark involving the block layer and some I/O. > >> > > >> >> From the profiling data in below link: > >> >> > >> >> http://pastebin.com/YwH2uwbq > >> >> > >> >> With coroutine, the running time for same loading is increased > >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased > >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%( > >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter). > >> >> > >> >> The bypass code in the benchmark is very similar with the approach > >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom > >> >> blocks in the the kernel I/O path. > >> >> > >> >> Maybe the benchmark is a bit extremely, but given modern storage > >> >> device may reach millions of IOPS, and it is very easy to slow down > >> >> the I/O by coroutine. > >> > > >> > I think in order to optimise coroutines, such benchmarks are fair game. > >> > It's just not guaranteed that the effects are exactly the same on real > >> > workloads, so we should take the results with a grain of salt. > >> > > >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all > >> > coroutines instead of exiting them, so it can't make any use of the > >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a > >> > version that simply removes the yield at the end): > >> > > >> > | bypass | fixed coro | buggy coro > >> > ----------------+---------------+---------------+-------------- > >> > time | 1.09s | 1.10s | 1.62s > >> > L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 > >> > insns per cycle | 2.39 | 2.39 | 1.90 > >> > > >> > Begs the question whether you see a similar effect on a real qemu and > >> > the coroutine pool is still not big enough? With correct use of > >> > coroutines, the difference seems to be barely measurable even without > >> > any I/O involved. > >> > >> When I comment qemu_coroutine_yield(), looks result of > >> bypass and fixed coro is very similar as your test, and I am just > >> wondering if stack is always switched in qemu_coroutine_enter() > >> without calling qemu_coroutine_yield(). > > > > Yes, definitely. qemu_coroutine_enter() always involves calling > > qemu_coroutine_switch(), which is the stack switch. > > > >> Without the yield, the benchmark can't emulate coroutine usage in > >> bdrv_aio_readv/writev() path any more, and bypass in the patchset > >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield() > >> for each bdrv_aio_readv/writev(). > > > > It's not completely comparable anyway because you're not going through a > > main loop and callbacks from there for your benchmark. > > > > But fair enough: Keep the yield, but enter the coroutine twice then. You > > get slightly worse results then, but that's more like doubling the very > > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327 > > / 2.37), not like the horrible performance of the buggy version. > > Yes, I compared that too, looks no big difference. > > > > > Actually, that's within the error of measurement for time and > > insns/cycle, so running it for a bit longer: > > > > | bypass | coro | + yield | buggy coro > > ----------------+-----------+-----------+-----------+-------------- > > time | 21.45s | 21.68s | 21.83s | 97.05s > > L1-dcache-loads | 18,049 M | 18,387 M | 18,618 M | 26,062 M > > insns per cycle | 2.42 | 2.40 | 2.41 | 1.75 > > > >> >> > I played a bit with the following, I hope it's not too naive. I > >> >> > couldn't > >> >> > see a difference with your patches, but at least one reason for this > >> >> > is > >> >> > probably that my laptop SSD isn't fast enough to make the CPU the > >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next > >> >> > thing. (I actually wrote the patch up just for some profiling on my > >> >> > own, > >> >> > not for comparing throughput, but it should be usable for that as > >> >> > well.) > >> >> > >> >> This might not be good for the test since it is basically a sequential > >> >> read test, which can be optimized a lot by kernel. And I always use > >> >> randread benchmark. > >> > > >> > Yes, I shortly pondered whether I should implement random offsets > >> > instead. But then I realised that a quicker kernel operation would only > >> > help the benchmark because we want it to test the CPU consumption in > >> > userspace. So the faster the kernel gets, the better for us, because it > >> > should make the impact of coroutines bigger. > >> > >> OK, I will compare coroutine vs. bypass-co with the benchmark. > > I use the /dev/nullb0 block device to test, which is available in linux kernel > 3.13+, and follows the difference, which looks not very big(< 10%): Sounds useful. I'm running on an older kernel, so I used a loop-mounted file on tmpfs instead for my tests. Anyway, at some point today I figured I should take a different approach and not try to minimise the problems that coroutines introduce, but rather make the most use of them when we have them. After all, the raw-posix driver is still very callback-oriented and does things that aren't really necessary with coroutines (such as AIOCB allocation). The qemu-img bench time I ended up with looked quite nice. Maybe you want to take a look if you can reproduce these results, both with qemu-img bench and your real benchmark. $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 /dev/loop0; done Sending 2000000 requests, 4096 bytes each, 64 in parallel bypass (base) | bypass (patch) | coro (base) | coro (patch) ----------------------+----------------+-------------+--------------- run 1 0m5.966s | 0m5.687s | 0m6.224s | 0m5.362s run 2 0m5.826s | 0m5.831s | 0m5.994s | 0m5.541s run 3 0m6.145s | 0m5.495s | 0m6.253s | 0m5.408s run 4 0m5.683s | 0m5.527s | 0m6.045s | 0m5.293s run 5 0m5.904s | 0m5.607s | 0m6.238s | 0m5.207s You can find my working tree at: git://repo.or.cz/qemu/kevin.git perf-bypass Please note that I added an even worse and even wronger hack to keep the bypass working so I can compare it (raw-posix exposes now both bdrv_aio* and bdrv_co_*, and enabling the bypass also switches). Also, once the AIO code that I kept for the bypass mode is gone, we can make the coroutine path even nicer. Kevin