On Thu, Aug 7, 2014 at 9:51 PM, Kevin Wolf <kw...@redhat.com> wrote: > Am 07.08.2014 um 12:27 hat Ming Lei geschrieben: >> On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kw...@redhat.com> wrote: >> > Am 06.08.2014 um 13:28 hat Ming Lei geschrieben: >> >> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kw...@redhat.com> wrote: >> >> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben: >> >> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kw...@redhat.com> wrote: >> >> >> > However, I just wasn't sure whether a change on this level would be >> >> >> > relevant in a realistic environment. This is the reason why I wanted >> >> >> > to >> >> >> > get a benchmark involving the block layer and some I/O. >> >> >> > >> >> >> >> From the profiling data in below link: >> >> >> >> >> >> >> >> http://pastebin.com/YwH2uwbq >> >> >> >> >> >> >> >> With coroutine, the running time for same loading is increased >> >> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased >> >> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%( >> >> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter). >> >> >> >> >> >> >> >> The bypass code in the benchmark is very similar with the approach >> >> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom >> >> >> >> blocks in the the kernel I/O path. >> >> >> >> >> >> >> >> Maybe the benchmark is a bit extremely, but given modern storage >> >> >> >> device may reach millions of IOPS, and it is very easy to slow down >> >> >> >> the I/O by coroutine. >> >> >> > >> >> >> > I think in order to optimise coroutines, such benchmarks are fair >> >> >> > game. >> >> >> > It's just not guaranteed that the effects are exactly the same on >> >> >> > real >> >> >> > workloads, so we should take the results with a grain of salt. >> >> >> > >> >> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks >> >> >> > all >> >> >> > coroutines instead of exiting them, so it can't make any use of the >> >> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a >> >> >> > version that simply removes the yield at the end): >> >> >> > >> >> >> > | bypass | fixed coro | buggy coro >> >> >> > ----------------+---------------+---------------+-------------- >> >> >> > time | 1.09s | 1.10s | 1.62s >> >> >> > L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 >> >> >> > insns per cycle | 2.39 | 2.39 | 1.90 >> >> >> > >> >> >> > Begs the question whether you see a similar effect on a real qemu and >> >> >> > the coroutine pool is still not big enough? With correct use of >> >> >> > coroutines, the difference seems to be barely measurable even without >> >> >> > any I/O involved. >> >> >> >> >> >> When I comment qemu_coroutine_yield(), looks result of >> >> >> bypass and fixed coro is very similar as your test, and I am just >> >> >> wondering if stack is always switched in qemu_coroutine_enter() >> >> >> without calling qemu_coroutine_yield(). >> >> > >> >> > Yes, definitely. qemu_coroutine_enter() always involves calling >> >> > qemu_coroutine_switch(), which is the stack switch. >> >> > >> >> >> Without the yield, the benchmark can't emulate coroutine usage in >> >> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset >> >> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield() >> >> >> for each bdrv_aio_readv/writev(). >> >> > >> >> > It's not completely comparable anyway because you're not going through a >> >> > main loop and callbacks from there for your benchmark. >> >> > >> >> > But fair enough: Keep the yield, but enter the coroutine twice then. You >> >> > get slightly worse results then, but that's more like doubling the very >> >> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327 >> >> > / 2.37), not like the horrible performance of the buggy version. >> >> >> >> Yes, I compared that too, looks no big difference. >> >> >> >> > >> >> > Actually, that's within the error of measurement for time and >> >> > insns/cycle, so running it for a bit longer: >> >> > >> >> > | bypass | coro | + yield | buggy coro >> >> > ----------------+-----------+-----------+-----------+-------------- >> >> > time | 21.45s | 21.68s | 21.83s | 97.05s >> >> > L1-dcache-loads | 18,049 M | 18,387 M | 18,618 M | 26,062 M >> >> > insns per cycle | 2.42 | 2.40 | 2.41 | 1.75 >> >> > >> >> >> >> > I played a bit with the following, I hope it's not too naive. I >> >> >> >> > couldn't >> >> >> >> > see a difference with your patches, but at least one reason for >> >> >> >> > this is >> >> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the >> >> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the >> >> >> >> > next >> >> >> >> > thing. (I actually wrote the patch up just for some profiling on >> >> >> >> > my own, >> >> >> >> > not for comparing throughput, but it should be usable for that as >> >> >> >> > well.) >> >> >> >> >> >> >> >> This might not be good for the test since it is basically a >> >> >> >> sequential >> >> >> >> read test, which can be optimized a lot by kernel. And I always use >> >> >> >> randread benchmark. >> >> >> > >> >> >> > Yes, I shortly pondered whether I should implement random offsets >> >> >> > instead. But then I realised that a quicker kernel operation would >> >> >> > only >> >> >> > help the benchmark because we want it to test the CPU consumption in >> >> >> > userspace. So the faster the kernel gets, the better for us, because >> >> >> > it >> >> >> > should make the impact of coroutines bigger. >> >> >> >> >> >> OK, I will compare coroutine vs. bypass-co with the benchmark. >> >> >> >> I use the /dev/nullb0 block device to test, which is available in linux >> >> kernel >> >> 3.13+, and follows the difference, which looks not very big(< 10%): >> > >> > Sounds useful. I'm running on an older kernel, so I used a loop-mounted >> > file on tmpfs instead for my tests. >> >> Actually loop is a slow device, and recently I used kernel aio and blk-mq >> to speedup it a lot. > > Yes, I have no doubts that it's slower than a proper ramdisk, but it > should still be way faster than my normal disk. > >> > Anyway, at some point today I figured I should take a different approach >> > and not try to minimise the problems that coroutines introduce, but >> > rather make the most use of them when we have them. After all, the >> > raw-posix driver is still very callback-oriented and does things that >> > aren't really necessary with coroutines (such as AIOCB allocation). >> > >> > The qemu-img bench time I ended up with looked quite nice. Maybe you >> > want to take a look if you can reproduce these results, both with >> > qemu-img bench and your real benchmark. >> > >> > >> > $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 >> > /dev/loop0; done >> > Sending 2000000 requests, 4096 bytes each, 64 in parallel >> > >> > bypass (base) | bypass (patch) | coro (base) | coro (patch) >> > ----------------------+----------------+-------------+--------------- >> > run 1 0m5.966s | 0m5.687s | 0m6.224s | 0m5.362s >> > run 2 0m5.826s | 0m5.831s | 0m5.994s | 0m5.541s >> > run 3 0m6.145s | 0m5.495s | 0m6.253s | 0m5.408s >> > run 4 0m5.683s | 0m5.527s | 0m6.045s | 0m5.293s >> > run 5 0m5.904s | 0m5.607s | 0m6.238s | 0m5.207s >> >> I suggest to run the test a bit long. > > Okay, ran it again with -c 10000000 this time. I also used the updated > branch for the patched version. This means that the __thread patch is > not enabled; this is probably why the improvement for the bypass has > disappeared and the coroutine based version only approaches, but doesn't > beat it this time. > > bypass (base) | bypass (patch) | coro (base) | coro (patch) > ----------------------+----------------+-------------+--------------- > run 1 28.255s | 28.615s | 30.364s | 28.318s > run 2 28.190s | 28.926s | 30.096s | 28.437s > run 3 28.079s | 29.603s | 30.084s | 28.567s > run 4 28.888s | 28.581s | 31.343s | 28.605s > run 5 28.196s | 28.924s | 30.033s | 27.935s
Your result is quite good(>300K IOPS), much better than my result with /dev/nullb0(less than 200K), and I also tried loop over file in tmpfs, which looks a bit quicker than /dev/nullb0(still ~200K IOPS in my server), so I guess your machine is very fast. It is a bit similar with my observation: - in my laptop(CPU: 2.6GHz), your coro patch improved much, and only less 5% than bypass - in my server(CPU: 1.6GHz, same L1/L2 cache with laptop, bigger L3 cache), your coro patch improved little, and it is less 10% than bypass so looks coroutine behaves better on fast CPUs? instead of slow CPU? I appreciate if you may run test in VM, especially with 2 virtqueue or 4 and run 2/4 jobs to see if what the IOPS can reach. >> > You can find my working tree at: >> > >> > git://repo.or.cz/qemu/kevin.git perf-bypass >> >> I just tried your work tree, and looks qemu-img can work well >> with your linux-aio coro patches, but unfortunately there is >> little improvement observed in my server, basically the result is >> same without bypass; in my laptop, the improvement can be >> observed but it is still at least 5% less than bypass. >> >> Let's see the result in my server: >> >> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 >> /dev/nullb5 >> Sending 6400000 requests, 4096 bytes each, 64 in parallel >> read time: 38351ms, 166.000000K IOPS >> ming@:~/git/qemu$ >> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b >> /dev/nullb5 >> Sending 6400000 requests, 4096 bytes each, 64 in parallel >> read time: 35241ms, 181.000000K IOPS > > Hm, interesting. Apparently our environments are different enough to > come to opposite conclusions. Yes, looks coroutine behaves better in fast CPU instead of slow CPU, as you see, my result is much worse than yours. ming@:~/git/qemu$ sudo losetup -a /dev/loop0: [0014]:64892 (/run/shm/dd.img) ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -n -t off -c 2000000 -b /dev/loop0 Sending 2000000 requests, 4096 bytes each, 64 in parallel read time: 9692ms, 206.000000K IOPS ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -n -t off -c 2000000 /dev/loop0 Sending 2000000 requests, 4096 bytes each, 64 in parallel read time: 10683ms, 187.000000K IOPS > > I also tried running some fio benchmarks based on the configuration you > had in the cover letter (just a bit downsized to fit it in the ramdisk) > and came to completely different results: For me, git master is a lot > better than qemu 2.0. The optimisation branch showed small, but > measurable additional improvements, with coroutines consistently being a > bit ahead of the bypass mode. > >> > Please note that I added an even worse and even wronger hack to keep the >> > bypass working so I can compare it (raw-posix exposes now both bdrv_aio* >> > and bdrv_co_*, and enabling the bypass also switches). Also, once the >> > AIO code that I kept for the bypass mode is gone, we can make the >> > coroutine path even nicer. >> >> This approach looks nice since it saves the intermediate callback. >> >> Basically current bypass approach is to bypass coroutine in block, but >> linux-aio takes a new coroutine, which are two different path. And >> linux-aio's coroutine still can be bypassed easily too , :-) > > The patched linux-aio doesn't create a new coroutine, it simply stays > in the one coroutine that we have and in which we already are. Bypassing > it by making the yield conditional would still be possible, of course > (for testing anyway; I don't think anything like that can be merged > easily). Thanks,