Peter Xu <pet...@redhat.com> writes: > Normally the postcopy recover phase should only exist for a super short > period, that's the duration when QEMU is trying to recover from an > interrupted postcopy migration, during which handshake will be carried out > for continuing the procedure with state changes from PAUSED -> RECOVER -> > POSTCOPY_ACTIVE again. > > Here RECOVER phase should be super small, that happens right after the > admin specified a new but working network link for QEMU to reconnect to > dest QEMU. > > However there can still be case where the channel is broken in this small > RECOVER window. > > If it happens, with current code there's no way the src QEMU can got kicked > out of RECOVER stage. No way either to retry the recover in another channel > when established. > > This patch allows the RECOVER phase to fail itself too - we're mostly > ready, just some small things missing, e.g. properly kick the main > migration thread out when sleeping on rp_sem when we found that we're at > RECOVER stage. When this happens, it fails the RECOVER itself, and > rollback to PAUSED stage. Then the user can retry another round of > recovery. > > To make it even stronger, teach QMP command migrate-pause to explicitly > kick src/dst QEMU out when needed, so even if for some reason the migration > thread didn't got kicked out already by a failing rethrn-path thread, the > admin can also kick it out. > > This will be an super, super corner case, but still try to cover that. > > One can try to test this with two proxy channels for migration: > > (a) socat unix-listen:/tmp/src.sock,reuseaddr,fork tcp:localhost:10000 > (b) socat tcp-listen:10000,reuseaddr,fork unix:/tmp/dst.sock > > So the migration channel will be: > > (a) (b) > src -> /tmp/src.sock -> tcp:10000 -> /tmp/dst.sock -> dst > > Then to make QEMU hang at RECOVER stage, one can do below: > > (1) stop the postcopy using QMP command postcopy-pause > (2) kill the 2nd proxy (b) > (3) try to recover the postcopy using /tmp/src.sock on src > (4) src QEMU will go into RECOVER stage but won't be able to continue > from there, because the channel is actually broken at (b) > > Before this patch, step (4) will make src QEMU stuck in RECOVER stage, > without a way to kick the QEMU out or continue the postcopy again. After > this patch, (4) will quickly fail qemu and bounce back to PAUSED stage. > > Admin can also kick QEMU from (4) into PAUSED when needed using > migrate-pause when needed. > > After bouncing back to PAUSED stage, one can recover again. > > Reported-by: Xiaohui Li <xiao...@redhat.com> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2111332 > Signed-off-by: Peter Xu <pet...@redhat.com>
Reviewed-by: Fabiano Rosas <faro...@suse.de>