On 19/10/22 2:30 am, Peter Xu wrote:
On Tue, Oct 18, 2022 at 10:51:12AM -0400, Peter Xu wrote:
On Tue, Oct 18, 2022 at 09:18:28AM +0100, Daniel P. Berrangé wrote:
On Mon, Oct 17, 2022 at 05:15:35PM -0400, Peter Xu wrote:
On Mon, Oct 17, 2022 at 12:38:30PM +0100, Daniel P. Berrangé wrote:
On Mon, Oct 17, 2022 at 01:06:00PM +0530, manish.mishra wrote:
Hi Daniel,
I was thinking for some solutions for this so wanted to discuss that before
going ahead. Also added Juan and Peter in loop.
1. Earlier i was thinking, on destination side as of now for default
and multi-FD channel first data to be sent is MAGIC_NUMBER and VERSION
so may be we can decide mapping based on that. But then that does not
work for newly added post copy preempt channel as it does not send
any MAGIC number. Also even for multiFD just MAGIC number does not
tell which multifd channel number is it, even though as per my thinking
it does not matter. So MAGIC number should be good for indentifying
default vs multiFD channel?
Yep, you don't need to know more than the MAGIC value.
In migration_io_process_incoming, we need to use MSG_PEEK to look at
the first 4 bytes pendingon the wire. If those bytes are 'QEVM' that's
the primary channel, if those bytes are big endian 0x11223344, that's
a multifd channel. Using MSG_PEEK aviods need to modify thue later
code that actually reads this data.
The challenge is how long to wait with the MSG_PEEK. If we do it
in a blocking mode, its fine for main channel and multifd, but
IIUC for the post-copy pre-empt channel we'd be waiting for
something that will never arrive.
Having suggested MSG_PEEK though, this may well not work if the
channel has TLS present. In fact it almost definitely won't work.
To cope with TLS migration_io_process_incoming would need to
actually read the data off the wire, and later methods be
taught to skip reading the magic.
2. For post-copy preempt may be we can initiate this channel only
after we have received a request from remote e.g. remote page fault.
This to me looks safest considering post-copy recorvery case too.
I can not think of any depedency on post copy preempt channel which
requires it to be initialised very early. May be Peter can confirm
this.
I guess that could work
Currently all preempt code still assumes when postcopy activated it's in
preempt mode. IIUC such a change will bring an extra phase of postcopy
with no-preempt before preempt enabled. We may need to teach qemu to
understand that if it's needed.
Meanwhile the initial page requests will not be able to benefit from the
new preempt channel too.
3. Another thing we can do is to have 2-way handshake on every
channel creation with some additional metadata, this to me looks
like cleanest approach and durable, i understand that can break
migration to/from old qemu, but then that can come as migration
capability?
The benefit of (1) is that the fix can be deployed for all existing
QEMU releases by backporting it. (3) will meanwhile need mgmt app
updates to make it work, which is much more work to deploy.
We really shoulud have had a more formal handshake, and I've described
ways to achieve this in the past, but it is quite alot of work.
I don't know whether (1) is a valid option if there are use cases that it
cannot cover (on either tls or preempt). The handshake is definitely the
clean approach.
What's the outcome of such wrongly ordered connections? Will migration
fail immediately and safely?
For multifd, I think it should fail immediately after the connection
established.
For preempt, I'd also expect the same thing because the only wrong order to
happen right now is having the preempt channel to be the migration channel,
then it should also fail immediately on the first qemu_get_byte().
Hopefully that's still not too bad - I mean, if we can fail constantly and
safely (never fail during postcopy), we can always retry and as long as
connections created successfully we can start the migration safely. But
please correct me if it's not the case.
It should typically fail as the magic bytes are different, which will not
pass validation. The exception being the postcopy pre-empt channel which
may well cause migration to stall as nothing will be sent initially by
the src.
Hmm right..
Actually if preempt channel is special we can fix it alone. As both of you
discussed, we can postpone the preempt channel setup, maybe not as late as
when we receive the 1st page request, but:
(1) For newly established migration, we can postpone preempt channel
setup (postcopy_preempt_setup, resume=false) to the entrance of
postcopy_start().
(2) For a postcopy recovery process, we can postpone preempt channel
setup (postcopy_preempt_setup, resume=true) to postcopy_do_resume(),
maybe between qemu_savevm_state_resume_prepare() and the final
handshake of postcopy_resume_handshake().
Yes Peter, agree postcopy_start and postcopy_do_resume should also work, as by
then we already have some 2-way communication, for e.g. for non-recovery case
we send ping cmd, so probaly we can block in postcopy_start till we get reply
of pong. Also for postcopy_do_resume too probably after response of
|MIG_CMD_POSTCOPY_RESUME|.
I need to try and test a bit for above idea. But the same trick may not
play well on multifd even if it works.
I had one concern, during recover we do not send any magic. As of now we do not
support multifd with postcopy so it should be fine, we can do explict checking
for non-recovery case. But i remember from some discussion in future there may
be support for multiFD with postcopy or have multiple postcopy preempt channels
too, then proper handshake will be required? So at some point we want to take
that path? For now i agree approach 1 will be good as suggested by Daniel it
can be backported easily to older qemu's too.
The sender side is relatively easy because migration thread can move on
without the preempt channel, then the main thread will keep taking care of
it, when connected it can notify the migration thread.
It seems trickier with dest node where the migration loading thread is only
a coroutine of the main thread, so during loading the vm I don't really see
how further socket connections can be established. Now it's okay with
thread being shared because we only do migration_incoming_process() and
enter the coroutine if all channels are ready.