On 2/23/24 15:05, Fabiano Rosas wrote:
Peter Xu <pet...@redhat.com> writes:

On Fri, Feb 16, 2024 at 02:35:26PM -0300, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:

Hello Fabiano

On 2/14/24 21:35, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:

Hello Fabiano

On 2/8/24 14:29, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:

In case of error, close_return_path_on_source() can perform a shutdown
to exit the return-path thread.  However, in migrate_fd_cleanup(),
'to_dst_file' is closed before calling close_return_path_on_source()
and the shutdown fails, leaving the source and destination waiting for
an event to occur.

Hi, Cédric

Are you sure this is not caused by patch 13?

It happens with upstream QEMU without any patch.

I might have taken that "shutdown fails" in the commit message too
literaly. Anyway, I have a proposed solution:

-->8--
  From 729aa7b5b7f130f756d41649fdd0862bd2e90430 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <faro...@suse.de>
Date: Wed, 14 Feb 2024 16:45:43 -0300
Subject: [PATCH] migration: Join the return path thread before releasing
   to_dst_file
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The return path thread might hang at a blocking system call. Before
joining the thread we might need to issue a shutdown() on the socket
file descriptor to release it. To determine whether the shutdown() is
necessary we look at the QEMUFile error.

Make sure we only clean up the QEMUFile after the return path has been
waited for.

Yes. That's the important part.

This fixes a hang when qemu_savevm_state_setup() produced an error
that was detected by migration_detect_error(). That skips
migration_completion() so close_return_path_on_source() would get
stuck waiting for the RP thread to terminate.

At migrate_fd_cleanup() I'm keeping the relative order of joining the
migration thread and the return path just in case.

That doesn't look necessary.

Indeed. But I don't trust the migration code, it's full of undocumented
dependencies like that.

What was the reason to join the migration thread only when
s->to_dst_file is valid ?

I didn't find any explicit reason looking through the history. It seems
we used to rely on to_dst_file before migration_thread_running was
introduced.

I wouldn't mind keeping that 'if' there.

Let's see what Peter thinks about it.

Frankly I don't have a strong opinion on current patch 14 or the new
proposal, but it seems we reached a consensus.

Fabiano, would you repost with a formal patch, with the proper tags?

Yes, I'll post it soon.


One thing I am still not sure is whether we should still have patch 13
altogether? Please see my other reply on whether it's possible to have
migrate_get_error() == true but qemu_file_get_error() == false.

I'll include it then.

Thanks for taking over.

I have included :

 [PATCH] migration: Join the return path thread before releasing to_dst_file

in my series and dropped 13-14. I hope to send a follow up on :

  https://lore.kernel.org/qemu-devel/20240207133347.1115903-1-...@redhat.com/

before we reach soft freeze. It's growing quite a lot.

C.


Reply via email to