When postcopy migration starts, the source side sends all non-postcopiable device data in one package command and immediately transitions to a "postcopy-active" state. However, if the destination side fails to load the device data or crashes during it, the source side stays paused indefinitely with no way of recovery.
This series introduces a new "postcopy-setup" state during which the destination side is guaranteed to not been started yet and, the source side can recover and resume and the destination side gracefully exit. Key element of this feature is isolating the postcopy-run command from non-postcopiable data and sending it only after the destination side acknowledges, that it has loaded all devices and is ready to be started. This is necessary, as once the postcopy-run command is sent, the source side cannot be sure if the destination is running or not and if it can safely resume in case of a failure. Reusing existing ping/pong messages was also considered, PING 3 is right before the postcopy-run command, but there are two reasons why the PING 3 message might not be delivered to the source side: 1. destination machine failed, it is not running, and the source side can resume, 2. there is a network failure, so PING 3 delivery fails, but until until TCP or other transport times out, the destination could process the postcopy-run command and start, in which case the source side cannot resume. Furthermore, this series contains two more patches required for the implementation of this feature, that make the listen thread joinable for graceful cleanup and detach it explicitly otherwise, and one patch fixing state transitions inside postcopy_start(). Such (or similar) feature could be potentially useful also for normal (only precopy) migration with return-path, to prevent issues when network failure happens just as the destination side shuts the return-path. When I tested such scenario (by filtering out the SHUT command), the destination started and reported successful migration, while the source side reported failed migration and tried to resume, but exited as it failed to gain disk image file lock. Another suggestion from Peter, that I would like to discuss, is that instead of introducing a new state, we could move the boundary between "device" and "postcopy-active" states to when the postcopy-run command is actually sent (in this series boundary of "postcopy-setup" and "postcopy-active"), however, I am not sure if such change would not have any unwanted implications. Juraj Marcin (4): qemu-thread: Introduce qemu_thread_detach() migration: Fix state transition in postcopy_start() error handling migration: Make listen thread joinable migration: Introduce postcopy-setup capability and state include/qemu/thread.h | 1 + migration/migration.c | 77 +++++++++++++++++++++++--- migration/migration.h | 7 +++ migration/options.c | 16 ++++++ migration/options.h | 1 + migration/postcopy-ram.c | 7 +++ migration/savevm.c | 53 ++++++++++++++++-- qapi/migration.json | 19 ++++++- tests/qtest/migration/postcopy-tests.c | 55 ++++++++++++++++++ tests/qtest/migration/precopy-tests.c | 3 +- util/qemu-thread-posix.c | 8 +++ util/qemu-thread-win32.c | 10 ++++ 12 files changed, 241 insertions(+), 16 deletions(-) -- 2.50.1