04.08.2011 23:52, Michael Tokarev пишет: > 04.08.2011 23:24, Luiz Capitulino пишет: >> On Wed, 3 Aug 2011 18:51:44 +0400 >> Michael Tokarev <m...@tls.msk.ru> wrote: >> >>> If we do, it results in double monitor_resume() (second being called >>> from migrate_fd_cleanup() anyway) and monitor suspend count becoming >>> negative. >> >> Are you hitting an specific issue or did you find this by code inspection? > > The specific case I'm talking about can trivially be triggered. > Run migrate "exec:nosuchfile" and your monitor will be stuck > forever. > >> IIRC, I asked Marcelo to add the monitor_resume() call in the fix for >> e447b1a603 because the monitor wasn't being resumed in some cases. Don't >> remember which though, do you Marcelo? > > And this (e447b1a603) is actually exactly the same thing: when the > child terminates before qemu completes sending stuff to it. > >> I see two possibilities here: >> >> 1. After e447b1a603 there was some change that made the monitor_resume() >> call >> in migrate_fd_put_buffer() unnecessary >> >> 2. We're calling it in the wrong place >> >> Taking a quick look at the code I see that migrate_fd_cleanup() doesn't >> seem to be called when qemu_savevm_state_iterate() fails, for example. > > Yes indeed, but that will lead to a havoc in other places, like > leaving callback notifier registered even after migration gets > cancelled, or keeping the file open. > > Note that qemu_savevm_state_iterate() gets called from a callback > routine. > > What I observe here is that when the exec: process exits prematurely, > qemu will continue writing stuff to pipe up till migrate_fd_put_buffer() > hits error (EPIPE iirc), at which point we'll call mon->resume() twice. > The same happens in case of tcp migration when the other end closes > the connection. > > I don't know when qemu_savevm_state_iterate() may fail, or why > qemu_file_has_error() in there does not return true. > > Maybe the problem here is much more severe actually, -- that's why > I initially asked what it is all about, -- before offering the patch.
This looks like a race condition: which of the two places will detect the error first. Sometimes I see this (-monitor stdio): (qemu) migrate "exec:nothing" sh: nothing: not found (qemu) and after this, monitor is stuck as I described. But sometimes I see this: (qemu) migrate "exec:nothing" sh: nothing: not found (qemu) And it continues working. Note the extra newline. If the error gets detected sooner (like in case of exec:nothing, or even better, after removing /bin/sh so that system(3) will fail), the chance to have stuck monitor is much higher. If the error gets detected later, it will survive most likely. Or something like that anyway. /mjt