On 03/03/2017 14:26, Dr. David Alan Gilbert wrote: > * Paolo Bonzini (pbonz...@redhat.com) wrote: >> >> >> On 03/03/2017 14:11, Dr. David Alan Gilbert wrote: >>> * Paolo Bonzini (pbonz...@redhat.com) wrote: >>>> >>>> >>>> On 03/03/2017 13:00, Dr. David Alan Gilbert wrote: >>>>> Ouch that's pretty nasty; I remember Paolo explaining to me a while ago >>>>> that >>>>> their were times when run_on_cpu would have to drop the BQL and I worried >>>>> about it, >>>>> but this is the 1st time I've seen an error due to it. >>>>> >>>>> Do you know what the migration state was at that point? Was it >>>>> MIGRATION_STATUS_CANCELLING? >>>>> I'm thinking perhaps we should stop 'cont' from continuing while >>>>> migration is in >>>>> MIGRATION_STATUS_CANCELLING. Do we send an event when we hit CANCELLED - >>>>> so that >>>>> perhaps libvirt could avoid sending the 'cont' until then? >>>> >>>> No, there's no event, though I thought libvirt would poll until >>>> "query-migrate" returns the cancelled state. Of course that is a small >>>> consolation, because a segfault is unacceptable. >>> >>> I think you might get an event if you set the new migrate capability called >>> 'events' on! >>> >>> void migrate_set_state(int *state, int old_state, int new_state) >>> { >>> if (atomic_cmpxchg(state, old_state, new_state) == old_state) { >>> trace_migrate_set_state(new_state); >>> migrate_generate_event(new_state); >>> } >>> } >>> >>> static void migrate_generate_event(int new_state) >>> { >>> if (migrate_use_events()) { >>> qapi_event_send_migration(new_state, &error_abort); >>> } >>> } >>> >>> That event feature went in sometime after 2.3.0. >>> >>>> One possibility is to suspend the monitor in qmp_migrate_cancel and >>>> resume it (with add_migration_state_change_notifier) when we hit the >>>> CANCELLED state. I'm not sure what the latency would be between the end >>>> of migrate_fd_cancel and finally reaching CANCELLED. >>> >>> I don't like suspending monitors; it can potentially take quite a >>> significant >>> time to do a cancel. >>> How about making 'cont' fail if we're in CANCELLING? >> >> Actually I thought that would be the case already (in fact CANCELLING is >> internal only; the outside world sees it as "active" in query-migrate). >> >> Lei, what is the runstate? (That is, why did cont succeed at all)? > > I suspect it's RUN_STATE_FINISH_MIGRATE - we set that before we do the device > save, and that's what we get at the end of a migrate and it's legal to restart > from there.
Yeah, but I think we get there at the end of a failed migrate only. So perhaps we can introduce a new state RUN_STATE_FAILED_MIGRATE and forbid "cont" from finish-migrate (only allow it from failed-migrate)? Paolo >> Paolo >> >>> I'd really love to see the 'run_on_cpu' being more careful about the BQL; >>> we really need all of the rest of the devices to stay quiesced at times. >> >> That's not really possible, because of how condition variables work. :( > > *Really* we need to find a solution to that - there's probably lots of > other things that can spring up in that small window other than the > 'cont'. > > Dave > > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >