From: "Daniel P. Berrange" <berra...@redhat.com> There are two common cases where migrate_cancel is intended to be used
1. When migration is not converging due to an overactive guest and insufficient network bandwidth 2. When migration is stuck due a network outage, waiting for the TCP transmit timeout to occurr & return an I/O error for send() In the second case, if you attempt to use 'migrate_cancel' it will also get stuck. This can be seen by attempting to migrate to a QEMU which has been SIGSTOP'd $ ./x86_64-softmmu/qemu-system-x86_64 -cdrom ~/boot.iso -m 600 \ -monitor stdio -vnc :2 -incoming tcp:localhost:9000 QEMU 0.14.1 monitor - type 'help' for more information (qemu) <Ctrl-Z> [1]+ Stopped And in another shell $ ./x86_64-softmmu/qemu-system-x86_64 -cdrom ~/boot.iso -m 600 \ -monitor stdio -vnc :1 QEMU 0.14.1 monitor - type 'help' for more information (qemu) migrate -d tcp:localhost:9000 (qemu) info migrate Migration status: active transferred ram: 416 kbytes remaining ram: 621624 kbytes total ram: 623040 kbytes (qemu) migrate_cancel This last command will never return, until the first QEMU is resumed. Looking at the stack trace in GDB you see #0 0x0000003a8320e4c2 in __libc_send (fd=10, buf=0x1bc7c70, n=19777, flags=0) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28 #1 0x000000000048fb1e in socket_write (s=<optimized out>, buf=<optimized out>, size=<optimized out>) at migration-tcp.c:39 #2 0x000000000048eba4 in migrate_fd_put_buffer (opaque=0x1b76ad0, data=0x1bc7c70, size=19777) at migration.c:324 #3 0x000000000048e442 in buffered_flush (s=0x1b76b90) at buffered_file.c:87 #4 0x000000000048e4cf in buffered_close (opaque=0x1b76b90) at buffered_file.c:177 #5 0x0000000000496d57 in qemu_fclose (f=0x1bbfc10) at savevm.c:479 #6 0x000000000048f4ca in migrate_fd_cleanup (s=0x1b76ad0) at migration.c:291 #7 0x000000000048f035 in do_migrate_cancel (mon=<optimized out>, qdict=<optimized out>, ret_data=<optimized out>) at migration.c:136[snip] [snip] The migration_fd_cleanup method is where the problem really starts. Specifically it does if (s->file) { DPRINTF("closing file\n"); if (qemu_fclose(s->file) != 0) { ret = -1; } s->file = NULL; } if (s->fd != -1) close(s->fd); And gets stuck in the qemu_fclose() bit because that method (rightly) tries to flush all outstanding buffers before closing. Unfortunately while this is desirable when migration ends successfully, it is undesirable when we are failing/cancelling migration. It is hard to tell qemu_fclose() that it shouldn't flush buffers directly, so the alternative is to ensure that this method fails quickly when it attempts I/O. This is easily achieved, simply by closing 's->fd' before calling qemu_fclose. Signed-off-by: Daniel P. Berrange <berra...@redhat.com> --- migration.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/migration.c b/migration.c index f5959b4..a432c3b 100644 --- a/migration.c +++ b/migration.c @@ -286,6 +286,13 @@ int migrate_fd_cleanup(FdMigrationState *s) qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); + if ((s->state == MIG_STATE_ERROR || + s->state == MIG_STATE_CANCELLED) && + s->fd != -1) { + close(s->fd); + s->fd = -1; + } + if (s->file) { DPRINTF("closing file\n"); if (qemu_fclose(s->file) != 0) { -- 1.7.6