25.06.2018 21:03, John Snow wrote:

On 06/25/2018 01:50 PM, Dr. David Alan Gilbert wrote:
* Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
* Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote:
15.06.2018 15:06, Dr. David Alan Gilbert wrote:
* Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote:
Invalidate cache before source start in case of failed migration.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
Why doesn't the code at the bottom of migration_completion,
fail_invalidate:   and the code in migrate_fd_cancel   handle this?

What case did you see it in that those didn't handle?
(Also I guess it probably should set s->block_inactive = false)
on source I see:

81392@1529065750.766289:migrate_set_state new state 7
81392@1529065750.766330:migration_thread_file_err
81392@1529065750.766332:migration_thread_after_loop

so, we are leaving loop on
         if (qemu_file_get_error(s->to_dst_file)) {
             migrate_set_state(&s->state, current_active_state,
MIGRATION_STATUS_FAILED);
trace_migration_thread_file_err();
break;
         }

and skip migration_completion()
Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
test that had previously ended with a 'cancelled' state has now ended up
in 'failed' (which is the state 7 you have above).
I suspect there's something else going on as well; I think what is
supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for
a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel
it does the recovery, but because it's going to failed instead, then
it's jumping over that recovery.
Going back and actually looking at the patch again;
can I ask for 1 small change;
    Can you set s->block_inactive = false   in the case where you
don't get the local_err (Like we do at the bottom of migrate_fd_cancel)


Does that make sense?

Thanks,

Dave

Vladimir, one more question for you because I'm not as familiar with
this code:

In the normal case we need to invalidate the qcow2 cache as a way to
re-engage the disk (yes?) when we have failed during the late-migration
steps.

In this case, we seem to be observing a failure during the bulk transfer
loop. Why is it important to invalidate the cache at this step -- would
the disk have been inactivated yet? It shouldn't, because it's in the
bulk transfer phase -- or am I missing something?

I feel like this code is behaving in a way that's fairly surprising for
a casual reader so I was hoping you could elaborate for me.

--js

In my case, source is already in postcopy state, when error occured, so it is inactivated.

--
Best regards,
Vladimir


Reply via email to