Re: [Qemu-devel] QEMU migration cancellation

Dr. David Alan Gilbert Wed, 11 Oct 2017 10:37:46 -0700

* Jag Raman (jag.ra...@oracle.com) wrote:
> Hi,

Hi Jag,


> I'd like to check about the behavior of a QEMU instance when live
> migration is cancelled.
> 
> If the migration of a guest OS from source QEMU instance to destination
> instance is cancelled, the destination instance exits with a failure
> code. Could you please explain why this design decision was taken?

There isn't really any communication of the cancellation - the source
just stops, and the destination is left to conclude it's got an
incomplete migration instance.

> I'm wondering if it's OK to change the behavior of the destination
> instance in this case. Would it be OK for the destination instance to
> not exit with the failure code, and instead retry processing incoming
> migration? I'm dealing with an internal bug report that's asking whether
> it would make more sense for the destination process to hang around for
> another attempt at migration than to be killed.

Can you explain why you'd want that to happen in the case of a
cancellation? I can see why you might want to do it in the event of
a network failure (which is the case Peter is dealing with for
postcopy);  but why after a cancellation?

> We came across Peter's planned changes to migration postcopy[1] which
> indicate that migration-cancellation is planned to be enhanced during
> the postcopy phase. Are there any such enhancements planned for the
> active phase as well?

Not planned; as far as I know this is the first time anyone has asked
for it;  it's probably possible to reuse some of Peter's code for it.
Essentially you have many of the same problems, in particular you don't
quite know how much of the data sent was actually received by the
destination (hmm I wonder if we can modify Peter's code to allow it
before it goes in :-)

There may be some gotcha's to do with the exact point at which the
cancellation happened (e.g. restarting after you've started serialising
the device state may be trickier).

> I'd also like to know the difference between qemu_fclose() &
> qemu_file_shutdown(). The source instance currently uses the shutdown
> function to terminate the connection between source & destination. But
> it seems to disconnect the connection abruptly. Whereas fclose function
> seems to disconnect it more gracefully. When I dug deeper, I couldn't
> specifically tell the difference between the two. I'd like to know if I
> could substitute the shutdown function with fclose function in
> migrate_fd_cancel().

close() closes the file descriptor and at that point it's no longer
valid and could get reallocated.  So for example if another thread is still 
using it
then the other thread (e.g. the migration thread) is suddenly trying to
use a non-existing fd, even worse that fd could have been reallocated
and it ends up writing migration data to a disk or network socket.
As you say 'close' is graceful - the downside of that is that it waits
to have sent all it's data and for some part of the TCP socket close
to happen (I don't know the details).

shutdown() doesn't deallocate the fd; it just forces all operations on
it to error and return.  The nice part of this is that if the
networking to the destination host has broken and you have a hung
socket, you can still perform the shutdown(), and then restart a
migration to a different host.    With close() you might have to wait
10s of minutes for the TCP socket to eventually give up before erroring.
Because it doesn't deallocate it, the migration threads just take
the failure paths and don't do anything illegal, but just drop out
at the end with a failed migration, whichever part of code they happened
to have been stuck in.

Dave

> [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg05892.html
> 
> Thanks!
> --
> Jag
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] QEMU migration cancellation

Reply via email to