On 03/30/2010 04:13 PM, Anthony Liguori wrote:
On 03/30/2010 05:24 AM, Avi Kivity wrote:
On 03/30/2010 12:23 AM, Anthony Liguori wrote:
It's not sufficient. If you have a single thread that runs both
live migrations and timers, then timers will be backlogged behind
live migration, or you'll have to yield often. This is regardless
of the locking model (and of course having threads without fixing
the locking is insufficient as well, live migration accesses guest
memory so it needs the big qemu lock).
But what's the solution? Sending every timer in a separate thread?
We'll hit the same problem if we implement an arbitrary limit to
number of threads.
A completion that's expected to take a couple of microseconds at most
can live in the iothread. A completion that's expected to take a
couple of milliseconds wants its own thread. We'll have to think
about anything in between.
vnc and migration can perform large amounts of work in a single
completion; they're limited only by the socket send rate and our
internal rate-limiting which are both outside our control. Most
device timers are O(1). virtio completions probably fall into the
annoying "have to think about it" department.
I think it may make more sense to have vcpu completions vs. io thread
completions and make vcpu completions target short lived operations.
vcpu completions make sense when you can tell that a completion will
cause an interrupt injection and you have a good idea which cpu will be
interrupted.
What I'm skeptical of, is whether converting virtio-9p or qcow2 to
handle each request in a separate thread is really going to
improve things.
Currently qcow2 isn't even fullly asynchronous, so it can't fail to
improve things.
Unless it introduces more data corruptions which is my concern with
any significant change to qcow2.
It's possible to move qcow2 to a thread without any significant
change to it (simply run the current code in its own thread,
protected by a mutex). Further changes would be very incremental.
But that offers no advantage to what we have which fails the
proof-by-example that threading makes the situation better.
It has an advantage, qcow2 is currently synchronous in parts:
block/qcow2-cluster.c: ret = bdrv_write(s->hd, (cluster_offset >> 9)
+ n_start,
block/qcow2.c: bdrv_write(s->hd, (meta.cluster_offset >> 9) + num
- 1, buf, 1);
block/qcow2.c: bdrv_write(bs, sector_num, buf, s->cluster_sectors);
block/qcow2-cluster.c: ret =
bdrv_read(bs->backing_hd, sector_num, buf, n1);
block/qcow2-cluster.c: ret = bdrv_read(s->hd, coffset >> 9,
s->cluster_data, nb_csectors);
To convert qcow2 to be threaded, I think you would have to wrap the
whole thing in a lock, then convert the current asynchronous functions
to synchronous (that's the whole point, right). At this point, you've
regressed performance because you can only handle one read/write
outstanding at a given time. So now you have to make the locking more
granular but because we do layered block devices, you've got to make
most of the core block driver functions thread safe.
Not at all. The first conversion will be to keep the current code as
is, operating asynchronously, but running in its own thread. It will
still support multiple outstanding requests using the current state
machine code; the synchronous parts will be remain synchronous relative
to the block device, but async relative to everything else. The second
stage will convert the state machine code to threaded code. This is
more difficult but not overly so - turn every dependency list into a mutex.
Once you get basic data operations concurrent, which I expect won't be
so bad, to get an improvement over the current code, you have to allow
simultaneous access to metadata which is where I think the vast
majority of the complexity will come from.
I have no plans to do that, all I want is qcow2 not to block vcpus.
btw, I don't think it's all that complicated, it's simple to lock
individual L2 blocks and the L1 block.
You could argue that we stick qcow2 into a thread and stop there and
that fixes the problems with synchronous data access. If that's the
argument, then let's not even bother doing at the qcow layer, let's
just switch the block aio emulation to use a dedicated thread.
That's certainly the plan for vmdk and friends which are today useless.
qcow2 deserves better treatment.
Sticking the VNC server in it's own thread would be fine. Trying to
make the VNC server multithreaded though would be problematic.
Why would it be problematic? Each client gets its own threads, they
don't interact at all do they?
Dealing with locking of the core display which each client uses for
rendering. Things like CopyRect will get ugly quickly.Ultimately,
this comes down to a question of lock granularity and thread
granularity. I don't think it's a good idea to start with the
assumption that we want extremely fine granularity. There's certainly
very low hanging fruit with respect to threading.
Not familiar with the code, but doesn't vnc access the display core
through an API? Slap a lot onto that.
I meant, exposing qemu core to the threads instead of pretending they
aren't there. I'm not familiar with 9p so don't hold much of an
opinion, but didn't you say you need threads in order to handle async
syscalls? That may not be the deep threading we're discussing here.
btw, IIUC currently disk hotunplug will stall a guest, no? We need
async aio_flush().
But aio_flush() never takes a very long time, right :-)
We had this discussion in the past re: live migration because we do an
aio_flush() in the critical stage.
Live migration will stall a guest anyway. It doesn't matter if
aio_flush blocks for a few ms, since the final stage will dominate it.
--
Do not meddle in the internals of kernels, for they are subtle and quick to
panic.