Cédric Le Goater <c...@redhat.com> writes:

> In case of error, close_return_path_on_source() can perform a shutdown
> to exit the return-path thread.  However, in migrate_fd_cleanup(),
> 'to_dst_file' is closed before calling close_return_path_on_source()
> and the shutdown fails, leaving the source and destination waiting for
> an event to occur.

Hi, Cédric

Are you sure this is not caused by patch 13? That 'if (ms->to_dst_file'
was there to avoid this sort of thing happening.

Is there some reordering possibility that I'm not spotting in the code
below? I think the data dependency on to_dst_file shouldn't allow it.

migrate_fd_cleanup:
        qemu_mutex_lock(&s->qemu_file_lock);
        tmp = s->to_dst_file;
        s->to_dst_file = NULL;
        qemu_mutex_unlock(&s->qemu_file_lock);
        ...
        qemu_fclose(tmp);

close_return_path_on_source:
    WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
            qemu_file_get_error(ms->to_dst_file)) {
            qemu_file_shutdown(ms->rp_state.from_dst_file);
        }
    }

I'm thinking maybe the culprit is the close_return_path_on_source() at
migration_completion(). It might be possible for it to race with the
migrate_fd_cleanup_bh from migration_iteration_finish().

If that's the case, then I think that one possible fix would be to hold
the BQL at migration_completion() so the BH doesn't get dispatched until
we properly close the return path.

>
> Close the file after calling close_return_path_on_source() so that the
> shutdown succeeds and the return-path thread exits.
>
> Signed-off-by: Cédric Le Goater <c...@redhat.com>
> ---
>
>  This is an RFC because the correct fix implies reworking the QEMUFile
>  construct, built on top of the QEMU I/O channel.
>
>  migration/migration.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 
> 5f55af3d7624750ca416c4177781241b3e291e5d..de329f2c553288935d824748286e79e535929b8b
>  100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1313,6 +1313,8 @@ void migrate_set_state(int *state, int old_state, int 
> new_state)
>  
>  static void migrate_fd_cleanup(MigrationState *s)
>  {
> +    QEMUFile *tmp = NULL;
> +
>      g_free(s->hostname);
>      s->hostname = NULL;
>      json_writer_free(s->vmdesc);
> @@ -1321,8 +1323,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>      qemu_savevm_state_cleanup();
>  
>      if (s->to_dst_file) {
> -        QEMUFile *tmp;
> -
>          trace_migrate_fd_cleanup();
>          bql_unlock();
>          if (s->migration_thread_running) {
> @@ -1341,15 +1341,14 @@ static void migrate_fd_cleanup(MigrationState *s)
>           * critical section won't block for long.
>           */
>          migration_ioc_unregister_yank_from_file(tmp);
> -        qemu_fclose(tmp);
>      }
>  
> -    /*
> -     * We already cleaned up to_dst_file, so errors from the return
> -     * path might be due to that, ignore them.
> -     */
>      close_return_path_on_source(s);
>  
> +    if (tmp) {
> +        qemu_fclose(tmp);
> +    }
> +
>      assert(!migration_is_active(s));
>  
>      if (s->state == MIGRATION_STATUS_CANCELLING) {

Reply via email to