* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > For PVM, if there is failover request from users. > The colo thread will exit the loop while the failover BH does the > cleanup work and resumes VM. > > Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com> > Signed-off-by: Li Zhijian <lizhij...@cn.fujitsu.com> > --- > v11: > - Don't call migration_end() in primary_vm_do_failover(), > The cleanup work will be done in migration_thread(). > - Remove vm_start() in primary_vm_do_failover() which also been done > in migraiton_thread() > v10: > - Call migration_end() in primary_vm_do_failover() > --- > include/migration/colo.h | 3 +++ > include/migration/failover.h | 1 + > migration/colo-failover.c | 7 +++++- > migration/colo.c | 56 > ++++++++++++++++++++++++++++++++++++++++++-- > 4 files changed, 64 insertions(+), 3 deletions(-) > > diff --git a/include/migration/colo.h b/include/migration/colo.h > index ba27719..0b02e95 100644 > --- a/include/migration/colo.h > +++ b/include/migration/colo.h > @@ -32,4 +32,7 @@ void *colo_process_incoming_thread(void *opaque); > bool migration_incoming_in_colo_state(void); > > COLOMode get_colo_mode(void); > + > +/* failover */ > +void colo_do_failover(MigrationState *s); > #endif > diff --git a/include/migration/failover.h b/include/migration/failover.h > index 882c625..fba3931 100644 > --- a/include/migration/failover.h > +++ b/include/migration/failover.h > @@ -26,5 +26,6 @@ void failover_init_state(void); > int failover_set_state(int old_state, int new_state); > int failover_get_state(void); > void failover_request_active(Error **errp); > +bool failover_request_is_active(void); > > #endif > diff --git a/migration/colo-failover.c b/migration/colo-failover.c > index 1b1be24..0c525da 100644 > --- a/migration/colo-failover.c > +++ b/migration/colo-failover.c > @@ -32,7 +32,7 @@ static void colo_failover_bh(void *opaque) > error_report("Unkown error for failover, old_state=%d", old_state); > return; > } > - /*TODO: Do failover work */ > + colo_do_failover(NULL); > } > > void failover_request_active(Error **errp) > @@ -67,6 +67,11 @@ int failover_get_state(void) > return atomic_read(&failover_state); > } > > +bool failover_request_is_active(void) > +{ > + return ((failover_get_state() != FAILOVER_STATUS_NONE)); > +} > + > void qmp_x_colo_lost_heartbeat(Error **errp) > { > if (get_colo_mode() == COLO_MODE_UNKNOWN) { > diff --git a/migration/colo.c b/migration/colo.c > index cedfc63..7a42fc6 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -41,6 +41,42 @@ bool migration_incoming_in_colo_state(void) > return mis && (mis->state == MIGRATION_STATUS_COLO); > } > > +static bool colo_runstate_is_stopped(void) > +{ > + return runstate_check(RUN_STATE_COLO) || !runstate_is_running(); > +} > + > +static void primary_vm_do_failover(void) > +{ > + MigrationState *s = migrate_get_current(); > + int old_state; > + > + if (s->state != MIGRATION_STATUS_FAILED) { > + migrate_set_state(&s->state, MIGRATION_STATUS_COLO, > + MIGRATION_STATUS_COMPLETED); > + }
That's a little odd; it will only move to completed if the current state is MIGRATION_STATUS_COLO, but you only do it if the state isn't FAILED. You could remove the if and just call migrate_set_state like that and it would be safe as long as you really only expect to have to deal with MIGRATION_STATUS_COLO. > + old_state = failover_set_state(FAILOVER_STATUS_HANDLING, > + FAILOVER_STATUS_COMPLETED); > + if (old_state != FAILOVER_STATUS_HANDLING) { > + error_report("Serious error while do failover for Primary VM," > + "old_state: %d", old_state); It's generally better to state the reason rather than say it's 'serious'; so something like: 'Incorrect state (%d) while doing failover for Primary VM' tells you more, and looks less scary! > + return; > + } > +} > + > +void colo_do_failover(MigrationState *s) > +{ > + /* Make sure vm stopped while failover */ > + if (!colo_runstate_is_stopped()) { > + vm_stop_force_state(RUN_STATE_COLO); > + } That feels like a race? couldn't we be just at the end of taking a checkpoint and about to restart when you do the if, so it reads that it's currently stopped but then it restarts it by the time you have a chance to do anything? I see in patch 13 (Save PVM....) you have: qemu_mutex_lock_iothread(); vm_start(); qemu_mutex_unlock_iothread(); So maybe if that code is executed just before the failover, then it would stop at the _lock_, we would run here but then as soon as we finish wouldn't it vm_start there? > + if (get_colo_mode() == COLO_MODE_PRIMARY) { > + primary_vm_do_failover(); > + } > +} > + > /* colo checkpoint control helper */ > static int colo_ctl_put(QEMUFile *f, uint32_t cmd, uint64_t value) > { > @@ -122,9 +158,22 @@ static int colo_do_checkpoint_transaction(MigrationState > *s, > } > > qemu_mutex_lock_iothread(); > + if (failover_request_is_active()) { > + qemu_mutex_unlock_iothread(); > + ret = -1; > + goto out; > + } > vm_stop_force_state(RUN_STATE_COLO); > qemu_mutex_unlock_iothread(); > trace_colo_vm_state_change("run", "stop"); > + /* > + * failover request bh could be called after > + * vm_stop_force_state so we check failover_request_is_active() again. > + */ > + if (failover_request_is_active()) { > + ret = -1; > + goto out; > + } I'm confused about why the check is needed specifically here; for example can't that happen at any point where we're ousite of the iothread lock? e.g. couldn't we set failover just a couple of lines lower, lets say just after the s->params.blk= 0 ? > /* Disable block migration */ > s->params.blk = 0; > @@ -221,6 +270,11 @@ static void colo_process_checkpoint(MigrationState *s) > trace_colo_vm_state_change("stop", "run"); > > while (s->state == MIGRATION_STATUS_COLO) { > + if (failover_request_is_active()) { > + error_report("failover request"); > + goto out; > + } > + > current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); > if (current_time - checkpoint_time < > s->parameters[MIGRATION_PARAMETER_CHECKPOINT_DELAY]) { > @@ -242,8 +296,6 @@ out: > if (ret < 0) { > error_report("%s: %s", __func__, strerror(-ret)); > } > - migrate_set_state(&s->state, MIGRATION_STATUS_COLO, > - MIGRATION_STATUS_COMPLETED); I had to think about this; but yes, I guess the only way out is via the failover which does the completed above. Dave > qsb_free(buffer); > buffer = NULL; > -- > 1.8.3.1 > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK