Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Ming Lei Fri, 08 Aug 2014 03:33:31 -0700

On Thu, Aug 7, 2014 at 9:51 PM, Kevin Wolf <kw...@redhat.com> wrote:
> Am 07.08.2014 um 12:27 hat Ming Lei geschrieben:
>> On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kw...@redhat.com> wrote:
>> > Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
>> >> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kw...@redhat.com> wrote:
>> >> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>> >> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kw...@redhat.com> wrote:
>> >> >> > However, I just wasn't sure whether a change on this level would be
>> >> >> > relevant in a realistic environment. This is the reason why I wanted 
>> >> >> > to
>> >> >> > get a benchmark involving the block layer and some I/O.
>> >> >> >
>> >> >> >> From the profiling data in below link:
>> >> >> >>
>> >> >> >>     http://pastebin.com/YwH2uwbq
>> >> >> >>
>> >> >> >> With coroutine, the running time for same loading is increased
>> >> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> >> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> >> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>> >> >> >>
>> >> >> >> The bypass code in the benchmark is very similar with the approach
>> >> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> >> >> >> blocks in the the kernel I/O path.
>> >> >> >>
>> >> >> >> Maybe the benchmark is a bit extremely, but given modern storage
>> >> >> >> device may reach millions of IOPS, and it is very easy to slow down
>> >> >> >> the I/O by coroutine.
>> >> >> >
>> >> >> > I think in order to optimise coroutines, such benchmarks are fair 
>> >> >> > game.
>> >> >> > It's just not guaranteed that the effects are exactly the same on 
>> >> >> > real
>> >> >> > workloads, so we should take the results with a grain of salt.
>> >> >> >
>> >> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks 
>> >> >> > all
>> >> >> > coroutines instead of exiting them, so it can't make any use of the
>> >> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> >> >> > version that simply removes the yield at the end):
>> >> >> >
>> >> >> >                 | bypass        | fixed coro    | buggy coro
>> >> >> > ----------------+---------------+---------------+--------------
>> >> >> > time            | 1.09s         | 1.10s         | 1.62s
>> >> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> >> >> > insns per cycle | 2.39          | 2.39          | 1.90
>> >> >> >
>> >> >> > Begs the question whether you see a similar effect on a real qemu and
>> >> >> > the coroutine pool is still not big enough? With correct use of
>> >> >> > coroutines, the difference seems to be barely measurable even without
>> >> >> > any I/O involved.
>> >> >>
>> >> >> When I comment qemu_coroutine_yield(), looks result of
>> >> >> bypass and fixed coro is very similar as your test, and I am just
>> >> >> wondering if stack is always switched in qemu_coroutine_enter()
>> >> >> without calling qemu_coroutine_yield().
>> >> >
>> >> > Yes, definitely. qemu_coroutine_enter() always involves calling
>> >> > qemu_coroutine_switch(), which is the stack switch.
>> >> >
>> >> >> Without the yield, the benchmark can't emulate coroutine usage in
>> >> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>> >> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>> >> >> for each bdrv_aio_readv/writev().
>> >> >
>> >> > It's not completely comparable anyway because you're not going through a
>> >> > main loop and callbacks from there for your benchmark.
>> >> >
>> >> > But fair enough: Keep the yield, but enter the coroutine twice then. You
>> >> > get slightly worse results then, but that's more like doubling the very
>> >> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
>> >> > / 2.37), not like the horrible performance of the buggy version.
>> >>
>> >> Yes, I compared that too, looks no big difference.
>> >>
>> >> >
>> >> > Actually, that's within the error of measurement for time and
>> >> > insns/cycle, so running it for a bit longer:
>> >> >
>> >> >                 | bypass    | coro      | + yield   | buggy coro
>> >> > ----------------+-----------+-----------+-----------+--------------
>> >> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
>> >> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
>> >> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>> >> >
>> >> >> >> > I played a bit with the following, I hope it's not too naive. I 
>> >> >> >> > couldn't
>> >> >> >> > see a difference with your patches, but at least one reason for 
>> >> >> >> > this is
>> >> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>> >> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the 
>> >> >> >> > next
>> >> >> >> > thing. (I actually wrote the patch up just for some profiling on 
>> >> >> >> > my own,
>> >> >> >> > not for comparing throughput, but it should be usable for that as 
>> >> >> >> > well.)
>> >> >> >>
>> >> >> >> This might not be good for the test since it is basically a 
>> >> >> >> sequential
>> >> >> >> read test, which can be optimized a lot by kernel. And I always use
>> >> >> >> randread benchmark.
>> >> >> >
>> >> >> > Yes, I shortly pondered whether I should implement random offsets
>> >> >> > instead. But then I realised that a quicker kernel operation would 
>> >> >> > only
>> >> >> > help the benchmark because we want it to test the CPU consumption in
>> >> >> > userspace. So the faster the kernel gets, the better for us, because 
>> >> >> > it
>> >> >> > should make the impact of coroutines bigger.
>> >> >>
>> >> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
>> >>
>> >> I use the /dev/nullb0 block device to test, which is available in linux 
>> >> kernel
>> >> 3.13+, and follows the difference, which looks not very big(< 10%):
>> >
>> > Sounds useful. I'm running on an older kernel, so I used a loop-mounted
>> > file on tmpfs instead for my tests.
>>
>> Actually loop is a slow device, and recently I used kernel aio and blk-mq
>> to speedup it a lot.
>
> Yes, I have no doubts that it's slower than a proper ramdisk, but it
> should still be way faster than my normal disk.
>
>> > Anyway, at some point today I figured I should take a different approach
>> > and not try to minimise the problems that coroutines introduce, but
>> > rather make the most use of them when we have them. After all, the
>> > raw-posix driver is still very callback-oriented and does things that
>> > aren't really necessary with coroutines (such as AIOCB allocation).
>> >
>> > The qemu-img bench time I ended up with looked quite nice. Maybe you
>> > want to take a look if you can reproduce these results, both with
>> > qemu-img bench and your real benchmark.
>> >
>> >
>> > $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 
>> > /dev/loop0; done
>> > Sending 2000000 requests, 4096 bytes each, 64 in parallel
>> >
>> >         bypass (base) | bypass (patch) | coro (base) | coro (patch)
>> > ----------------------+----------------+-------------+---------------
>> > run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
>> > run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
>> > run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
>> > run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
>> > run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s
>>
>> I suggest to run the test a bit long.
>
> Okay, ran it again with -c 10000000 this time. I also used the updated
> branch for the patched version. This means that the __thread patch is
> not enabled; this is probably why the improvement for the bypass has
> disappeared and the coroutine based version only approaches, but doesn't
> beat it this time.
>
>         bypass (base) | bypass (patch) | coro (base) | coro (patch)
> ----------------------+----------------+-------------+---------------
> run 1   28.255s       |  28.615s       | 30.364s     | 28.318s
> run 2   28.190s       |  28.926s       | 30.096s     | 28.437s
> run 3   28.079s       |  29.603s       | 30.084s     | 28.567s
> run 4   28.888s       |  28.581s       | 31.343s     | 28.605s
> run 5   28.196s       |  28.924s       | 30.033s     | 27.935s


Your result is quite good(>300K IOPS), much better than my result with
/dev/nullb0(less than 200K), and I also tried loop over file in tmpfs, which
looks a bit quicker than /dev/nullb0(still ~200K IOPS in my server), so
I guess your machine is very fast.

It is a bit similar with my observation:

- in my laptop(CPU: 2.6GHz), your coro patch improved much, and only
less 5% than bypass
- in my server(CPU: 1.6GHz, same L1/L2 cache with laptop, bigger L3 cache),
your coro patch improved little, and it is less 10% than bypass

so looks coroutine behaves better on fast CPUs? instead of slow CPU?

I appreciate if you may run test in VM, especially with 2 virtqueue or 4 and
run 2/4 jobs to see if what the IOPS can reach.

>> > You can find my working tree at:
>> >
>> >     git://repo.or.cz/qemu/kevin.git perf-bypass
>>
>> I just tried your work tree, and looks qemu-img can work well
>> with your linux-aio coro patches, but unfortunately there is
>> little improvement observed in my server, basically the result is
>> same without bypass; in my laptop, the improvement can be
>> observed but it is still at least 5% less than bypass.
>>
>> Let's see the result in my server:
>>
>> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 
>> /dev/nullb5
>> Sending 6400000 requests, 4096 bytes each, 64 in parallel
>>     read time: 38351ms, 166.000000K IOPS
>> ming@:~/git/qemu$
>> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b
>> /dev/nullb5
>> Sending 6400000 requests, 4096 bytes each, 64 in parallel
>>     read time: 35241ms, 181.000000K IOPS
>
> Hm, interesting. Apparently our environments are different enough to
> come to opposite conclusions.

Yes, looks coroutine behaves better in fast CPU instead of slow CPU, as you
see, my result is much worse than yours.

ming@:~/git/qemu$ sudo losetup -a
/dev/loop0: [0014]:64892 (/run/shm/dd.img)
ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -n -t off -c 2000000 -b
/dev/loop0
Sending 2000000 requests, 4096 bytes each, 64 in parallel
    read time: 9692ms, 206.000000K IOPS
ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -n -t off -c 2000000 /dev/loop0
Sending 2000000 requests, 4096 bytes each, 64 in parallel
    read time: 10683ms, 187.000000K IOPS

>
> I also tried running some fio benchmarks based on the configuration you
> had in the cover letter (just a bit downsized to fit it in the ramdisk)
> and came to completely different results: For me, git master is a lot
> better than qemu 2.0. The optimisation branch showed small, but
> measurable additional improvements, with coroutines consistently being a
> bit ahead of the bypass mode.
>
>> > Please note that I added an even worse and even wronger hack to keep the
>> > bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
>> > and bdrv_co_*, and enabling the bypass also switches). Also, once the
>> > AIO code that I kept for the bypass mode is gone, we can make the
>> > coroutine path even nicer.
>>
>> This approach looks nice since it saves the intermediate callback.
>>
>> Basically current bypass approach is to bypass coroutine in block, but
>> linux-aio takes a new coroutine, which are two different path. And
>> linux-aio's coroutine still can be bypassed easily too , :-)
>
> The patched linux-aio doesn't create a new coroutine, it simply stays
> in the one coroutine that we have and in which we already are. Bypassing
> it by making the yield conditional would still be possible, of course
> (for testing anyway; I don't think anything like that can be merged
> easily).


Thanks,

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Reply via email to