On 2011-08-04 22:00, Michael Tokarev wrote: > 04.08.2011 23:52, Michael Tokarev пишет: >> 04.08.2011 23:24, Luiz Capitulino пишет: >>> On Wed, 3 Aug 2011 18:51:44 +0400 >>> Michael Tokarev <m...@tls.msk.ru> wrote: >>> >>>> If we do, it results in double monitor_resume() (second being called >>>> from migrate_fd_cleanup() anyway) and monitor suspend count becoming >>>> negative. >>> >>> Are you hitting an specific issue or did you find this by code inspection? >> >> The specific case I'm talking about can trivially be triggered. >> Run migrate "exec:nosuchfile" and your monitor will be stuck >> forever. >> >>> IIRC, I asked Marcelo to add the monitor_resume() call in the fix for >>> e447b1a603 because the monitor wasn't being resumed in some cases. Don't >>> remember which though, do you Marcelo? >> >> And this (e447b1a603) is actually exactly the same thing: when the >> child terminates before qemu completes sending stuff to it. >> >>> I see two possibilities here: >>> >>> 1. After e447b1a603 there was some change that made the monitor_resume() >>> call >>> in migrate_fd_put_buffer() unnecessary >>> >>> 2. We're calling it in the wrong place >>> >>> Taking a quick look at the code I see that migrate_fd_cleanup() doesn't >>> seem to be called when qemu_savevm_state_iterate() fails, for example. >> >> Yes indeed, but that will lead to a havoc in other places, like >> leaving callback notifier registered even after migration gets >> cancelled, or keeping the file open. >> >> Note that qemu_savevm_state_iterate() gets called from a callback >> routine. >> >> What I observe here is that when the exec: process exits prematurely, >> qemu will continue writing stuff to pipe up till migrate_fd_put_buffer() >> hits error (EPIPE iirc), at which point we'll call mon->resume() twice. >> The same happens in case of tcp migration when the other end closes >> the connection. >> >> I don't know when qemu_savevm_state_iterate() may fail, or why >> qemu_file_has_error() in there does not return true. >> >> Maybe the problem here is much more severe actually, -- that's why >> I initially asked what it is all about, -- before offering the patch. > > This looks like a race condition: which of the two places > will detect the error first. > > Sometimes I see this (-monitor stdio): > > (qemu) migrate "exec:nothing" > sh: nothing: not found > > (qemu) > > and after this, monitor is stuck as I described. But sometimes > I see this: > > (qemu) migrate "exec:nothing" > sh: nothing: not found > (qemu) > > And it continues working. Note the extra newline. > If the error gets detected sooner (like in case of > exec:nothing, or even better, after removing > /bin/sh so that system(3) will fail), the chance > to have stuck monitor is much higher. If the > error gets detected later, it will survive most > likely. Or something like that anyway.
Resume on error in migrate_fd_put_buffer raced with migrate_fd_cleanup triggered via migrate_fd_put_ready called from migrate_fd_connect. This migration code is a horrible maze. Patch below tries to move monitor_resume a bit out of this. Please check if it works for you, then I'll send it out properly. Thanks, Jan --- diff --git a/migration.c b/migration.c index 2a15b98..756fa62 100644 --- a/migration.c +++ b/migration.c @@ -292,18 +292,17 @@ int migrate_fd_cleanup(FdMigrationState *s) ret = -1; } s->file = NULL; + } else { + if (s->mon) { + monitor_resume(s->mon); + } } - if (s->fd != -1) + if (s->fd != -1) { close(s->fd); - - /* Don't resume monitor until we've flushed all of the buffers */ - if (s->mon) { - monitor_resume(s->mon); + s->fd = -1; } - s->fd = -1; - return ret; } @@ -330,9 +329,6 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) if (ret == -EAGAIN) { qemu_set_fd_handler2(s->fd, NULL, NULL, migrate_fd_put_notify, s); } else if (ret < 0) { - if (s->mon) { - monitor_resume(s->mon); - } s->state = MIG_STATE_ERROR; notifier_list_notify(&migration_state_notifiers, NULL); } @@ -458,6 +454,9 @@ int migrate_fd_close(void *opaque) { FdMigrationState *s = opaque; + if (s->mon) { + monitor_resume(s->mon); + } qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); return s->close(s); }
signature.asc
Description: OpenPGP digital signature