Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device

Avi Kivity Fri, 26 Feb 2010 07:40:08 -0800

On 02/26/2010 04:36 PM, Anthony Liguori wrote:

On 02/26/2010 02:47 AM, Avi Kivity wrote:
qcow2 is still not fully asynchronous. All the other format drivers(except raw) are fully synchronous. If we had a threadedinfrastructure, we could convert them all in a day. As it is, youcan only use the other block format drivers in 'qemu-img convert'.
I've got a healthy amount of scepticism that it's that easy. But I'mhappy to consider patches :-)


I'd be happy to have time to write them.

If the device models are re-entrant, that reduces a ton of thedemand on the qemu_mutex which means that IO thread can rununcontended. While we have evidence that the VCPU threads and IOthreads are competing with each other today, I don't think we haveany evidence to suggest that the IO thread is self-starving itselfwith long running events.
I agree we have no evidence and that this is all speculation. Butconsider a 64-vcpu guest, it has a 1:64 ratio of vcpu time(initiations) to iothread time (completions). If each vcpu generates5000 initiations per second, the iothread needs to handle 320,000completions per second. At that rate you will see some internalcompetition. That thread will also have a hard time shuffling datasince every completion's data will reside in the wrong cpu cache.
Ultimately, it depends on what you're optimizing for. If you've got a64-vcpu guest on a 128-way box, then sure, we want to have 64 IOthreads because that will absolutely increase throughput.
But realistically, it's more likely that if you've got a 64-vcpuguest, you're on a 1024-way box and you've got 64 guests running atonce. Having 64 IO threads per VM means you've got 4k threadsfloating. It's still just as likely that one completion will getdelayed by something less important. Now with all of these threads ona box like this, you get nasty NUMA interactions too.

I'm not suggesting to scale out - the number of vcpus (across allguests) will usually be higher than the number of cpus. But if you havemultiple device threads, the scheduler has flexibility in placing themaround and filling bubbles. A single heavily loaded iothread is moredifficult.

The difference between the two models is that with threads, we rely onpre-emption to enforce fairness and the Linux scheduler to performscheduling. With a single IO thread, we're determining executionorder and priority.

We could define priorities with multiple threads as well (using threadpriorities), and we'd never have a short task delayed behind a longtask, unless the host is out of resources.

A lot of main loops have a notion of priority for timer and idlecallbacks. For something that is latency sensitive, you absolutelycould introduce the concept of priority for bottom halves. It wouldensure that a +1 priority bottom half would get scheduled beforehandling any lower priority I/O/BHs.


What if it becomes available after the low prio task has started to run?

Note, an alternative to multiple iothreads is to move completionhandling back to vcpus, provided we can steer the handler close tothe guest completion handler.
Looking at something like linux-aio, I think we might actually want todo that. We can submit the request from the VCPU thread and we cancertainly program the signal to get delivered to that VCPU thread.Maintaining affinity for the request is likely a benefit.


Likely to benefit when we have multiqueue virtio.

For host services though, it's much more difficult to isolate themlike this.
What do you mean by host services?
Things like VNC and live migration. Things that aren't directlyrelated to a guest's activity. One model I can imagine is to continueto relegate these things to a single IO thread, but then move devicedriven callbacks either back to the originating thread or to adedicated device callback thread. Host services generally have a muchlower priority.

Or just 'a thread'. Nothing prevents vnc or live migration from runningin a thread, using the current code.

I'm not necessarily claiming that this will never be the right thingto do, but I don't think we really have the evidence today tosuggest that we should focus on this in the short term.
Agreed. We will start to see evidence (one way or the other) asfully loaded 64-vcpu guests are benchmarked. Another driver may bereal-time guests; if a timer can be deferred by some block deviceinitiation or completion, then we can say goodbye to any realtimeguarantees we want to make.
I'm wary of making decisions based on performance of a 64-vcpu guest.It's an important workload to characterize because it's an extremecase but I think 64 1-vcpu guests will continue to be significantlymore important than 1 64-vcpu guest.

Agreed. 64-vcpu guests will make the headlines and marketingchecklists, though.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device

Reply via email to