Peter Xu <pet...@redhat.com> writes: > This patch adds a migration state on src called "postcopy-recover-setup". > The new state will describe the intermediate step starting from when the > src QEMU received a postcopy recovery request, until the migration channels > are properly established, but before the recovery process take place. > > The request came from Libvirt where Libvirt currently rely on the migration > state events to detect migration state changes. That works for most of the > migration process but except postcopy recovery failures at the beginning. > > Currently postcopy recovery only has two major states: > > - postcopy-paused: this is the state that both sides of QEMU will be in > for a long time as long as the migration channel was interrupted. > > - postcopy-recover: this is the state where both sides of QEMU handshake > with each other, preparing for a continuation of postcopy which used to > be interrupted. > > The issue here is when the recovery port is invalid, the src QEMU will take > the URI/channels, noticing the ports are not valid, and it'll silently keep > in the postcopy-paused state, with no event sent to Libvirt. In this case, > the only thing Libvirt can do is to poll the migration status with a proper > interval, however that's less optimal. > > Considering that this is the only case where Libvirt won't get a > notification from QEMU on such events, let's add postcopy-recover-setup > state to mimic what we have with the "setup" state of a newly initialized > migration, describing the phase of connection establishment. > > With that, postcopy recovery will have two paths to go now, and either path > will guarantee an event generated. Now the events will look like this > during a recovery process on src QEMU: > > - Initially when the recovery is initiated on src, QEMU will go from > "postcopy-paused" -> "postcopy-recover-setup". Old QEMUs don't have > this event. > > - Depending on whether the channel re-establishment is succeeded: > > - In succeeded case, src QEMU will move from "postcopy-recover-setup" > to "postcopy-recover". Old QEMUs also have this event. > > - In failure case, src QEMU will move from "postcopy-recover-setup" to > "postcopy-paused" again. Old QEMUs don't have this event. > > This guarantees that Libvirt will always receive a notification for > recovery process properly. > > One thing to mention is, such new status is only needed on src QEMU not > both. On dest QEMU, the state machine doesn't change. Hence the events > don't change either. It's done like so because dest QEMU may not have an > explicit point of setup start. E.g., it can happen that when dest QEMUs > doesn't use migrate-recover command to use a new URI/channel, but the old > URI/channels can be reused in recovery, in which case the old ports simply > can work again after the network routes are fixed up. > > Add a new helper postcopy_is_paused() detecting whether postcopy is still > paused, taking RECOVER_SETUP into account too. When using it on both > src/dst, a slight change is done altogether to always wait for the > semaphore before checking the status, because for both sides a sem_post() > will be required for a recovery. > > Cc: Jiri Denemark <jdene...@redhat.com> > Cc: Fabiano Rosas <faro...@suse.de> > Cc: Prasad Pandit <ppan...@redhat.com> > Buglink: https://issues.redhat.com/browse/RHEL-38485 > Signed-off-by: Peter Xu <pet...@redhat.com>
Reviewed-by: Fabiano Rosas <faro...@suse.de>