Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Kevin Wolf Wed, 06 Aug 2014 08:42:06 -0700

Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kw...@redhat.com> wrote:
> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kw...@redhat.com> wrote:
> >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> >> >> Hi Kevin,
> >> >>
> >> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kw...@redhat.com> wrote:
> >> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
> >> >> >> I have been wondering how to prove that the root cause is the 
> >> >> >> ucontext
> >> >> >> coroutine mechanism (stack switching).  Here is an idea:
> >> >> >>
> >> >> >> Hack your "bypass" code path to run the request inside a coroutine.
> >> >> >> That way you can compare "bypass without coroutine" against "bypass 
> >> >> >> with
> >> >> >> coroutine".
> >> >> >>
> >> >> >> Right now I think there are doubts because the bypass code path is
> >> >> >> indeed a different (and not 100% correct) code path.  So this 
> >> >> >> approach
> >> >> >> might prove that the coroutines are adding the overhead and not
> >> >> >> something that you bypassed.
> >> >> >
> >> >> > My doubts aren't only that the overhead might not come from the
> >> >> > coroutines, but also whether any coroutine-related overhead is really
> >> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> >> >> > just that instead of introducing additional code paths.
> >> >>
> >> >> OK, thank you for taking look at the problem, and hope we can
> >> >> figure out the root cause, :-)
> >> >>
> >> >> >
> >> >> > Another thought I had was this: If the performance difference is 
> >> >> > indeed
> >> >> > only coroutines, then that is completely inside the block layer and we
> >> >> > don't actually need a VM to test it. We could instead have something
> >> >> > like a simple qemu-img based benchmark and should be observing the 
> >> >> > same.
> >> >>
> >> >> Even it is simpler to run a coroutine-only benchmark, and I just
> >> >> wrote a raw one, and looks coroutine does decrease performance
> >> >> a lot, please see the attachment patch, and thanks for your template
> >> >> to help me add the 'co_bench' command in qemu-img.
> >> >
> >> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
> >> > did do that yesterday with the yield test from tests/test-coroutine.c.
> >> > And in fact profiling immediately showed something to optimise:
> >> > pthread_getspecific() was quite high, replacing it by __thread on
> >> > systems where it works is more efficient and helped the numbers a bit.
> >> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
> >> > in qemu-img bench), maybe there's even something that can be done here.
> >>
> >> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
> >> should have done lots of work on that.


qemu-img bench doesn't run that code. We have a few more locks that are
taken, and one of them (the coroutine pool lock) is avoided by your
bypass patches.

> >> >
> >> > However, I just wasn't sure whether a change on this level would be
> >> > relevant in a realistic environment. This is the reason why I wanted to
> >> > get a benchmark involving the block layer and some I/O.
> >> >
> >> >> From the profiling data in below link:
> >> >>
> >> >>     http://pastebin.com/YwH2uwbq
> >> >>
> >> >> With coroutine, the running time for same loading is increased
> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
> >> >>
> >> >> The bypass code in the benchmark is very similar with the approach
> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
> >> >> blocks in the the kernel I/O path.
> >> >>
> >> >> Maybe the benchmark is a bit extremely, but given modern storage
> >> >> device may reach millions of IOPS, and it is very easy to slow down
> >> >> the I/O by coroutine.
> >> >
> >> > I think in order to optimise coroutines, such benchmarks are fair game.
> >> > It's just not guaranteed that the effects are exactly the same on real
> >> > workloads, so we should take the results with a grain of salt.
> >> >
> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> >> > coroutines instead of exiting them, so it can't make any use of the
> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> >> > version that simply removes the yield at the end):
> >> >
> >> >                 | bypass        | fixed coro    | buggy coro
> >> > ----------------+---------------+---------------+--------------
> >> > time            | 1.09s         | 1.10s         | 1.62s
> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> >> > insns per cycle | 2.39          | 2.39          | 1.90
> >> >
> >> > Begs the question whether you see a similar effect on a real qemu and
> >> > the coroutine pool is still not big enough? With correct use of
> >> > coroutines, the difference seems to be barely measurable even without
> >> > any I/O involved.
> >>
> >> When I comment qemu_coroutine_yield(), looks result of
> >> bypass and fixed coro is very similar as your test, and I am just
> >> wondering if stack is always switched in qemu_coroutine_enter()
> >> without calling qemu_coroutine_yield().
> >
> > Yes, definitely. qemu_coroutine_enter() always involves calling
> > qemu_coroutine_switch(), which is the stack switch.
> >
> >> Without the yield, the benchmark can't emulate coroutine usage in
> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
> >> for each bdrv_aio_readv/writev().
> >
> > It's not completely comparable anyway because you're not going through a
> > main loop and callbacks from there for your benchmark.
> >
> > But fair enough: Keep the yield, but enter the coroutine twice then. You
> > get slightly worse results then, but that's more like doubling the very
> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
> > / 2.37), not like the horrible performance of the buggy version.
> 
> Yes, I compared that too, looks no big difference.
> 
> >
> > Actually, that's within the error of measurement for time and
> > insns/cycle, so running it for a bit longer:
> >
> >                 | bypass    | coro      | + yield   | buggy coro
> > ----------------+-----------+-----------+-----------+--------------
> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
> >
> >> >> > I played a bit with the following, I hope it's not too naive. I 
> >> >> > couldn't
> >> >> > see a difference with your patches, but at least one reason for this 
> >> >> > is
> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
> >> >> > thing. (I actually wrote the patch up just for some profiling on my 
> >> >> > own,
> >> >> > not for comparing throughput, but it should be usable for that as 
> >> >> > well.)
> >> >>
> >> >> This might not be good for the test since it is basically a sequential
> >> >> read test, which can be optimized a lot by kernel. And I always use
> >> >> randread benchmark.
> >> >
> >> > Yes, I shortly pondered whether I should implement random offsets
> >> > instead. But then I realised that a quicker kernel operation would only
> >> > help the benchmark because we want it to test the CPU consumption in
> >> > userspace. So the faster the kernel gets, the better for us, because it
> >> > should make the impact of coroutines bigger.
> >>
> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
> 
> I use the /dev/nullb0 block device to test, which is available in linux kernel
> 3.13+, and follows the difference, which looks not very big(< 10%):

Sounds useful. I'm running on an older kernel, so I used a loop-mounted
file on tmpfs instead for my tests.

Anyway, at some point today I figured I should take a different approach
and not try to minimise the problems that coroutines introduce, but
rather make the most use of them when we have them. After all, the
raw-posix driver is still very callback-oriented and does things that
aren't really necessary with coroutines (such as AIOCB allocation).

The qemu-img bench time I ended up with looked quite nice. Maybe you
want to take a look if you can reproduce these results, both with
qemu-img bench and your real benchmark.


$ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 
/dev/loop0; done
Sending 2000000 requests, 4096 bytes each, 64 in parallel

        bypass (base) | bypass (patch) | coro (base) | coro (patch)
----------------------+----------------+-------------+---------------
run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s


You can find my working tree at:

    git://repo.or.cz/qemu/kevin.git perf-bypass

Please note that I added an even worse and even wronger hack to keep the
bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
and bdrv_co_*, and enabling the bypass also switches). Also, once the
AIO code that I kept for the bypass mode is gone, we can make the
coroutine path even nicer.

Kevin

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Reply via email to