Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Ming Lei Wed, 06 Aug 2014 02:38:20 -0700

On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kw...@redhat.com> wrote:
> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> Hi Kevin,
>>
>> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kw...@redhat.com> wrote:
>> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> >> I have been wondering how to prove that the root cause is the ucontext
>> >> coroutine mechanism (stack switching).  Here is an idea:
>> >>
>> >> Hack your "bypass" code path to run the request inside a coroutine.
>> >> That way you can compare "bypass without coroutine" against "bypass with
>> >> coroutine".
>> >>
>> >> Right now I think there are doubts because the bypass code path is
>> >> indeed a different (and not 100% correct) code path.  So this approach
>> >> might prove that the coroutines are adding the overhead and not
>> >> something that you bypassed.
>> >
>> > My doubts aren't only that the overhead might not come from the
>> > coroutines, but also whether any coroutine-related overhead is really
>> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
>> > just that instead of introducing additional code paths.
>>
>> OK, thank you for taking look at the problem, and hope we can
>> figure out the root cause, :-)
>>
>> >
>> > Another thought I had was this: If the performance difference is indeed
>> > only coroutines, then that is completely inside the block layer and we
>> > don't actually need a VM to test it. We could instead have something
>> > like a simple qemu-img based benchmark and should be observing the same.
>>
>> Even it is simpler to run a coroutine-only benchmark, and I just
>> wrote a raw one, and looks coroutine does decrease performance
>> a lot, please see the attachment patch, and thanks for your template
>> to help me add the 'co_bench' command in qemu-img.
>
> Yes, we can look at coroutines microbenchmarks in isolation. I actually
> did do that yesterday with the yield test from tests/test-coroutine.c.
> And in fact profiling immediately showed something to optimise:
> pthread_getspecific() was quite high, replacing it by __thread on
> systems where it works is more efficient and helped the numbers a bit.
> Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
> in qemu-img bench), maybe there's even something that can be done here.


The lock/unlock in dataplane is often from memory_region_find(), and Paolo
should have done lots of work on that.

>
> However, I just wasn't sure whether a change on this level would be
> relevant in a realistic environment. This is the reason why I wanted to
> get a benchmark involving the block layer and some I/O.
>
>> From the profiling data in below link:
>>
>>     http://pastebin.com/YwH2uwbq
>>
>> With coroutine, the running time for same loading is increased
>> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>>
>> The bypass code in the benchmark is very similar with the approach
>> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> blocks in the the kernel I/O path.
>>
>> Maybe the benchmark is a bit extremely, but given modern storage
>> device may reach millions of IOPS, and it is very easy to slow down
>> the I/O by coroutine.
>
> I think in order to optimise coroutines, such benchmarks are fair game.
> It's just not guaranteed that the effects are exactly the same on real
> workloads, so we should take the results with a grain of salt.
>
> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> coroutines instead of exiting them, so it can't make any use of the
> coroutine pool. On my laptop, I get this (where fixed coroutine is a
> version that simply removes the yield at the end):
>
>                 | bypass        | fixed coro    | buggy coro
> ----------------+---------------+---------------+--------------
> time            | 1.09s         | 1.10s         | 1.62s
> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> insns per cycle | 2.39          | 2.39          | 1.90
>
> Begs the question whether you see a similar effect on a real qemu and
> the coroutine pool is still not big enough? With correct use of
> coroutines, the difference seems to be barely measurable even without
> any I/O involved.

When I comment qemu_coroutine_yield(), looks result of
bypass and fixed coro is very similar as your test, and I am just
wondering if stack is always switched in qemu_coroutine_enter()
without calling qemu_coroutine_yield().

Without the yield, the benchmark can't emulate coroutine usage in
bdrv_aio_readv/writev() path any more, and bypass in the patchset
skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
for each bdrv_aio_readv/writev().

>
>> > I played a bit with the following, I hope it's not too naive. I couldn't
>> > see a difference with your patches, but at least one reason for this is
>> > probably that my laptop SSD isn't fast enough to make the CPU the
>> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> > thing. (I actually wrote the patch up just for some profiling on my own,
>> > not for comparing throughput, but it should be usable for that as well.)
>>
>> This might not be good for the test since it is basically a sequential
>> read test, which can be optimized a lot by kernel. And I always use
>> randread benchmark.
>
> Yes, I shortly pondered whether I should implement random offsets
> instead. But then I realised that a quicker kernel operation would only
> help the benchmark because we want it to test the CPU consumption in
> userspace. So the faster the kernel gets, the better for us, because it
> should make the impact of coroutines bigger.

OK, I will compare coroutine vs. bypass-co with the benchmark.


Thanks,

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Reply via email to