* Jag Raman (jag.ra...@oracle.com) wrote: > Hi, Hi Jag,
> I'd like to check about the behavior of a QEMU instance when live > migration is cancelled. > > If the migration of a guest OS from source QEMU instance to destination > instance is cancelled, the destination instance exits with a failure > code. Could you please explain why this design decision was taken? There isn't really any communication of the cancellation - the source just stops, and the destination is left to conclude it's got an incomplete migration instance. > I'm wondering if it's OK to change the behavior of the destination > instance in this case. Would it be OK for the destination instance to > not exit with the failure code, and instead retry processing incoming > migration? I'm dealing with an internal bug report that's asking whether > it would make more sense for the destination process to hang around for > another attempt at migration than to be killed. Can you explain why you'd want that to happen in the case of a cancellation? I can see why you might want to do it in the event of a network failure (which is the case Peter is dealing with for postcopy); but why after a cancellation? > We came across Peter's planned changes to migration postcopy[1] which > indicate that migration-cancellation is planned to be enhanced during > the postcopy phase. Are there any such enhancements planned for the > active phase as well? Not planned; as far as I know this is the first time anyone has asked for it; it's probably possible to reuse some of Peter's code for it. Essentially you have many of the same problems, in particular you don't quite know how much of the data sent was actually received by the destination (hmm I wonder if we can modify Peter's code to allow it before it goes in :-) There may be some gotcha's to do with the exact point at which the cancellation happened (e.g. restarting after you've started serialising the device state may be trickier). > I'd also like to know the difference between qemu_fclose() & > qemu_file_shutdown(). The source instance currently uses the shutdown > function to terminate the connection between source & destination. But > it seems to disconnect the connection abruptly. Whereas fclose function > seems to disconnect it more gracefully. When I dug deeper, I couldn't > specifically tell the difference between the two. I'd like to know if I > could substitute the shutdown function with fclose function in > migrate_fd_cancel(). close() closes the file descriptor and at that point it's no longer valid and could get reallocated. So for example if another thread is still using it then the other thread (e.g. the migration thread) is suddenly trying to use a non-existing fd, even worse that fd could have been reallocated and it ends up writing migration data to a disk or network socket. As you say 'close' is graceful - the downside of that is that it waits to have sent all it's data and for some part of the TCP socket close to happen (I don't know the details). shutdown() doesn't deallocate the fd; it just forces all operations on it to error and return. The nice part of this is that if the networking to the destination host has broken and you have a hung socket, you can still perform the shutdown(), and then restart a migration to a different host. With close() you might have to wait 10s of minutes for the TCP socket to eventually give up before erroring. Because it doesn't deallocate it, the migration threads just take the failure paths and don't do anything illegal, but just drop out at the end with a failed migration, whichever part of code they happened to have been stuck in. Dave > [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg05892.html > > Thanks! > -- > Jag -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK