Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

Karl Rister Mon, 14 Nov 2016 12:14:50 -0800

On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
>>> Recent performance investigation work done by Karl Rister shows that the
>>> guest->host notification takes around 20 us.  This is more than the 
>>> "overhead"
>>> of QEMU itself (e.g. block layer).
>>>
>>> One way to avoid the costly exit is to use polling instead of notification.
>>> The main drawback of polling is that it consumes CPU resources.  In order to
>>> benefit performance the host must have extra CPU cycles available on 
>>> physical
>>> CPUs that aren't used by the guest.
>>>
>>> This is an experimental AioContext polling implementation.  It adds a 
>>> polling
>>> callback into the event loop.  Polling functions are implemented for 
>>> virtio-blk
>>> virtqueue guest->host kick and Linux AIO completion.
>>>
>>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
>>> nanoseconds to
>>> poll before entering the usual blocking poll(2) syscall.  Try setting this
>>> variable to the time from old request completion to new virtqueue kick.
>>>
>>> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
>>> any
>>> polling!
>>>
>>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
>>> values.  If you don't find a good value we should double-check the tracing 
>>> data
>>> to see if this experimental code can be improved.
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS      IOPs
>>                unset    31,383
>>                    1    46,860
>>                    2    46,440
>>                    4    35,246
>>                    8    34,973
>>                   16    46,794
>>                   32    46,729
>>                   64    35,520
>>                  128    45,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.


Here are some more numbers with higher values.  I continued the power of
2 values and added in your examples as well:

QEMU_AIO_POLL_MAX_NS      IOPs
                 256    46,929
                 512    35,627
               1,024    46,477
               2,000    35,247
               2,048    46,322
               4,000    46,540
               4,096    46,368
               8,000    47,054
               8,192    46,671
              16,000    46,466
              16,384    32,504
              32,000    20,620
              32,768    20,807

> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.
> 
>> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
>> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
>> is what I got:
>>
>> Iteration    QEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
>>         1                    46,972                   35,434
>>         2                    46,939                   35,719
>>         3                    47,005                   35,584
>>         4                    47,016                   35,615
>>         5                    47,267                   35,474
>>
>> So the results seem consistent.
> 
> That is interesting.  I don't have an explanation for the consistent
> difference between 2 and 4 ns polling time.  The time difference is so
> small yet the IOPS difference is clear.
> 
> Comparing traces could shed light on the cause for this difference.
> 
>> I saw some discussion on the patches made which make me think you'll be
>> making some changes, is that right?  If so, I may wait for the updates
>> and then we can run the much more exhaustive set of workloads
>> (sequential read and write, random read and write) at various block
>> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
>> that we were doing when we started looking at this.
> 
> I'll send an updated version of the patches.
> 
> Stefan
> 


-- 
Karl Rister <kris...@redhat.com>

Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

Reply via email to