After some debugging I found a programming error in error handling in migration, but I'm not sure how to fix it.
When migration starts, monitor gets suspended, calling monitor_suspend() routine which increments assotiated suspend_cnt counter. At the end of migration, in migrate_fd_cleanup(), monitor_resume() gets called, which decrements the counter. But monitor_resume() gets also called from another place, in migrate_fd_put_buffer(), in case we encountered a write error. So, suppose a tcp endpoint has disconnected, or the exec: program terminated due to error or whatnot -- in all these cases write will fail, and we'll call monitor_resume() twice as a result: once in this place in migrate_fd_put_buffer(), and once more at the end in migrate_fd_cleanup(). This results in suspend_cnt being decremented twice, with the resultant value being -1. So monitor_can_read() will return 0 from now on, since it compares suspend_cnt with 0. And hence, monitor will stop working. To me it looks like monitor_resume() call should be removed from migrate_fd_put_buffer(), but I'm not sure _why_ it were here in the first place. There's more: monitor_suspend() gets called from within protocol handlers (using migrate_fd_monitor_suspend() routine), -- are we sure that all current and future protocol handlers will call this function? Thanks! /mjt