Thread model in QEMU

Avi Kivity Tue, 30 Mar 2010 06:29:32 -0700

On 03/30/2010 04:13 PM, Anthony Liguori wrote:

On 03/30/2010 05:24 AM, Avi Kivity wrote:
On 03/30/2010 12:23 AM, Anthony Liguori wrote:
It's not sufficient. If you have a single thread that runs bothlive migrations and timers, then timers will be backlogged behindlive migration, or you'll have to yield often. This is regardlessof the locking model (and of course having threads without fixingthe locking is insufficient as well, live migration accesses guestmemory so it needs the big qemu lock).
But what's the solution? Sending every timer in a separate thread?We'll hit the same problem if we implement an arbitrary limit tonumber of threads.
A completion that's expected to take a couple of microseconds at mostcan live in the iothread. A completion that's expected to take acouple of milliseconds wants its own thread. We'll have to thinkabout anything in between.
vnc and migration can perform large amounts of work in a singlecompletion; they're limited only by the socket send rate and ourinternal rate-limiting which are both outside our control. Mostdevice timers are O(1). virtio completions probably fall into theannoying "have to think about it" department.
I think it may make more sense to have vcpu completions vs. io threadcompletions and make vcpu completions target short lived operations.

vcpu completions make sense when you can tell that a completion willcause an interrupt injection and you have a good idea which cpu will beinterrupted.

What I'm skeptical of, is whether converting virtio-9p or qcow2 tohandle each request in a separate thread is really going toimprove things.
Currently qcow2 isn't even fullly asynchronous, so it can't fail toimprove things.
Unless it introduces more data corruptions which is my concern withany significant change to qcow2.
It's possible to move qcow2 to a thread without any significantchange to it (simply run the current code in its own thread,protected by a mutex). Further changes would be very incremental.
But that offers no advantage to what we have which fails theproof-by-example that threading makes the situation better.


It has an advantage, qcow2 is currently synchronous in parts:

block/qcow2-cluster.c: ret = bdrv_write(s->hd, (cluster_offset >> 9)+ n_start,block/qcow2.c: bdrv_write(s->hd, (meta.cluster_offset >> 9) + num- 1, buf, 1);

block/qcow2.c:        bdrv_write(bs, sector_num, buf, s->cluster_sectors);

block/qcow2-cluster.c: ret =bdrv_read(bs->backing_hd, sector_num, buf, n1);block/qcow2-cluster.c: ret = bdrv_read(s->hd, coffset >> 9,s->cluster_data, nb_csectors);

To convert qcow2 to be threaded, I think you would have to wrap thewhole thing in a lock, then convert the current asynchronous functionsto synchronous (that's the whole point, right). At this point, you'veregressed performance because you can only handle one read/writeoutstanding at a given time. So now you have to make the locking moregranular but because we do layered block devices, you've got to makemost of the core block driver functions thread safe.

Not at all. The first conversion will be to keep the current code asis, operating asynchronously, but running in its own thread. It willstill support multiple outstanding requests using the current statemachine code; the synchronous parts will be remain synchronous relativeto the block device, but async relative to everything else. The secondstage will convert the state machine code to threaded code. This ismore difficult but not overly so - turn every dependency list into a mutex.

Once you get basic data operations concurrent, which I expect won't beso bad, to get an improvement over the current code, you have to allowsimultaneous access to metadata which is where I think the vastmajority of the complexity will come from.

I have no plans to do that, all I want is qcow2 not to block vcpus.btw, I don't think it's all that complicated, it's simple to lockindividual L2 blocks and the L1 block.

You could argue that we stick qcow2 into a thread and stop there andthat fixes the problems with synchronous data access. If that's theargument, then let's not even bother doing at the qcow layer, let'sjust switch the block aio emulation to use a dedicated thread.

That's certainly the plan for vmdk and friends which are today useless.qcow2 deserves better treatment.

Sticking the VNC server in it's own thread would be fine. Trying tomake the VNC server multithreaded though would be problematic.
Why would it be problematic? Each client gets its own threads, theydon't interact at all do they?
Dealing with locking of the core display which each client uses forrendering. Things like CopyRect will get ugly quickly.Ultimately,this comes down to a question of lock granularity and threadgranularity. I don't think it's a good idea to start with theassumption that we want extremely fine granularity. There's certainlyvery low hanging fruit with respect to threading.

Not familiar with the code, but doesn't vnc access the display corethrough an API? Slap a lot onto that.

I meant, exposing qemu core to the threads instead of pretending theyaren't there. I'm not familiar with 9p so don't hold much of anopinion, but didn't you say you need threads in order to handle asyncsyscalls? That may not be the deep threading we're discussing here.
btw, IIUC currently disk hotunplug will stall a guest, no? We needasync aio_flush().
But aio_flush() never takes a very long time, right :-)
We had this discussion in the past re: live migration because we do anaio_flush() in the critical stage.

Live migration will stall a guest anyway. It doesn't matter ifaio_flush blocks for a few ms, since the final stage will dominate it.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

Re: [Qemu-devel] [PATCH -V3 09/32] virtio-9p: Implement P9_TWRITE/ Thread model in QEMU

Reply via email to