Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Ming Lei Tue, 12 Aug 2014 01:13:36 -0700

On Tue, Aug 12, 2014 at 3:37 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> Il 10/08/2014 05:46, Ming Lei ha scritto:
>> Hi Kevin, Paolo, Stefan and all,
>>
>>
>> On Wed, 6 Aug 2014 10:48:55 +0200
>> Kevin Wolf <kw...@redhat.com> wrote:
>>
>>> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>>
>>>
>>> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>>> coroutines instead of exiting them, so it can't make any use of the
>>> coroutine pool. On my laptop, I get this (where fixed coroutine is a
>>> version that simply removes the yield at the end):
>>>
>>>                 | bypass        | fixed coro    | buggy coro
>>> ----------------+---------------+---------------+--------------
>>> time            | 1.09s         | 1.10s         | 1.62s
>>> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>>> insns per cycle | 2.39          | 2.39          | 1.90
>>>
>>> Begs the question whether you see a similar effect on a real qemu and
>>> the coroutine pool is still not big enough? With correct use of
>>> coroutines, the difference seems to be barely measurable even without
>>> any I/O involved.
>>
>> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> loading, and cause operations per sec very low(~40K/sec), finally I write a 
>> new
>> and simple one which can generate hundreds of kilo operations per sec and
>> the number should match with some fast storage devices, and it does show 
>> there
>> is not small effect from coroutine.
>>
>> Extremely if just getppid() syscall is run in each iteration, with using 
>> coroutine,
>> only 3M operations/sec can be got, and without using coroutine, the number 
>> can
>> reach 16M/sec, and there is more than 4 times difference!!!
>
> I should be on vacation, but I'm following a couple threads in the mailing 
> list
> and I'm a bit tired to hear the same argument again and again...


I am sorry to interrupt your vocation and make you tired, but the discussion
isn't simply again and again, and something new always comes every time
or most of times.

>
> The different characteristics of asynchronous I/O vs. any synchronous workload
> are such that it is hard to be sure that microbenchmarks make sense.

I don't think it is related with asynchronous I/O or synchronous I/O, and there
isn't sleep(or wait for completion) at all, and we can treat it as aio
by thinking
completion as nop in this case(AIO model: submit and complete)

IMO the getppid() bench is a simple simulation on bdrv_aio_readv/writev()
with I/O plug/unplug wrt. coroutine usage.

BTW, do you agree the computation on coroutine cost in my previous mail?
And I don't think the computation is related with I/O type.

>
> The below patch is basically the minimal change to bypass coroutines.  Of 
> course
> the block.c part is not acceptable as is (the change to refresh_total_sectors
> is broken, the others are just ugly), but it is a start.  Please run it with
> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
> benchmark.

Could you explain why the new change is introduced?

I will hold it until we can align to the coroutine cost computation,
because it is
very important for the discussion.

Thank you again for taking time in the discussion.

Thanks,

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Reply via email to