* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote: > Hi, > > On 2016/12/6 23:24, Dr. David Alan Gilbert wrote: > > * Kevin Wolf (kw...@redhat.com) wrote: > > > Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben: > > > > commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case > > > > which migration aborted QEMU because it didn't regain the control > > > > of images while some errors happened. > > > > > > > > Actually, we have another case in that error path to abort QEMU > > > > because of the same reason: > > > > migration_thread() > > > > migration_completion() > > > > bdrv_inactivate_all() ----------------> inactivate images > > > > qemu_savevm_state_complete_precopy() > > > > socket_writev_buffer() --------> error because > > > > destination fails > > > > qemu_fflush() -------------------> set error on migration > > > > stream > > > > qemu_mutex_unlock_iothread() ------> unlock > > > > qmp_migrate_cancel() ---------------------> user cancelled > > > > migration > > > > migrate_set_state() ------------------> set migrate CANCELLING > > > > > > Important to note here: qmp_migrate_cancel() is executed by a concurrent > > > thread, it doesn't depend on any code paths in migration_completion(). > > > > > > > migration_completion() -----------------> go on to fail_invalidate > > > > if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch > > > > migration_thread() -----------------------> break migration loop > > > > vm_start() -----------------------------> restart guest with > > > > inactive > > > > images > > > > We failed to regain the control of images because we only regain it > > > > while the migration state is "active", but here users cancelled the > > > > migration > > > > when they found some errors happened (for example, libvirtd daemon is > > > > shutdown > > > > in destination unexpectedly). > > > > > > > > Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com> > > > > --- > > > > migration/migration.c | 3 ++- > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > > index f498ab8..0c1ee6d 100644 > > > > --- a/migration/migration.c > > > > +++ b/migration/migration.c > > > > @@ -1752,7 +1752,8 @@ fail_invalidate: > > > > /* If not doing postcopy, vm_start() will be called: let's regain > > > > * control on images. > > > > */ > > > > - if (s->state == MIGRATION_STATUS_ACTIVE) { > > > > > > This if condition tries to check whether we ran the code path that > > > called bdrv_inactivate_all(), so that we only try to reactivate images > > > it if we really inactivated them first. > > > > > > The problem with it is that it ignores a possible concurrent > > > modification of s->state. > > > > > > > + if (s->state == MIGRATION_STATUS_ACTIVE || > > > > + s->state == MIGRATION_STATUS_CANCELLING) { > > > > > > This adds another state that we could end up with with a concurrent > > > modification, so that even in this case we undo the inactivation. > > > > > > However, it is no longer limited to the cases where we inactivated the > > > image. It also applies to other code paths (like the postcopy one) where > > > we didn't inactivate images. > > > > > > What saves the patch is that bdrv_invalidate_cache() is a no-op for > > > block devices that aren't inactivated, so calling it more often than > > > necessary is okay. > > > > > > But then, if we're going to rely on this, it would be much better to > > > just remove the if altogether. I can't say whether there are any other > > > possible values of s->state that we should consider, and by removing the > > > if we would be guaranteed to catch all of them. > > > > > > If we don't want to rely on it, just keep a local bool that remembers > > > whether we inactivated images and check that here. > > > > > > > Error *local_err = NULL; > > > > > > > > bdrv_invalidate_cache_all(&local_err); > > > > > > So in summary, this is a horrible patch because it checks the wrong > > > thing, and for I can't really say if it covers everything it needs to > > > cover, but arguably it happens to correctly fix the outcome of a > > > previously failing case. > > > > > > Normally I would reject such a patch and require a clean solution, but > > > then we're on the day of -rc3, so if you can't send v2 right away, we > > > might not have the time for it. > > > > > > Tough call... > > > > Hmm, this case is messy; I created this function having split it out > > of the main loop a couple of years back but it did get more messy > > with more s->state checks; as far as I can tell it's always > > done the transition to COMPLETED at the end well after the locked > > section, so there's always been that chance that cancellation sneaks > > in just before or just after the locked section. > > > > Some of the bad cases that can happen: > > a) A cancel sneaks in after the ACTIVE check but before or after > > the locked section; should we reactivate the disks? Well that > > depends on whether the destination actually got the full migration > > stream - we don't know! > > If the destination actually starts running we must not reactivate > > the disk on the source even if the CPU is stopped. > > > > Yes, we didn't have a mechanism to know exactly whether or not the VM in > destination is well received. > > > b) If the bdrv_inactive_all fails for one device, but the others > > are fine, we go down the fail: label and don't reactivate, so > > the source dies even though it might have been mostly OK. > > > > > We can move the _lock to before the check of s->state at the top, > > which would stop the cancel sneaking in early. > > In the case where postcopy was never enabled (!migrate_postcopy_ram()) > > we can move the COMPLETED transition into the lock as well; so I think > > then we kind of become safe. > > In the case where postcopy was enabled I think we can do the COMPLETED > > transition before waiting for the return path to close - I think but > > I need to think more about that. > > And there seem to be some dodgy cases where we call the invalidate > > there after a late postcopy failure; that's bad, we shouldn't reactivate > > the source disks after going into postcopy. > > > > So, in summary; this function is a mess - it needs a much bigger > > fix than this patch. > > > > So what's the conclusion ? > Will you send a patch to fix it ? Or let's fix it step by step ? > I think Kevin's suggestion which just remove the *if* check is OK.
I don't see any of the simple solutions are easy; so I plan to look at it properly, but am not sure when; if you have time to do it then feel free. Dave > > Thanks, > Hailiang > > > Dave > > > > > Kevin > > > > > -- > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > > > . > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK