Cédric Le Goater <c...@redhat.com> writes:

> Hello Fabiano
>
> On 2/14/24 21:35, Fabiano Rosas wrote:
>> Cédric Le Goater <c...@redhat.com> writes:
>> 
>>> Hello Fabiano
>>>
>>> On 2/8/24 14:29, Fabiano Rosas wrote:
>>>> Cédric Le Goater <c...@redhat.com> writes:
>>>>
>>>>> In case of error, close_return_path_on_source() can perform a shutdown
>>>>> to exit the return-path thread.  However, in migrate_fd_cleanup(),
>>>>> 'to_dst_file' is closed before calling close_return_path_on_source()
>>>>> and the shutdown fails, leaving the source and destination waiting for
>>>>> an event to occur.
>>>>
>>>> Hi, Cédric
>>>>
>>>> Are you sure this is not caused by patch 13?
>>>
>>> It happens with upstream QEMU without any patch.
>> 
>> I might have taken that "shutdown fails" in the commit message too
>> literaly. Anyway, I have a proposed solution:
>> 
>> -->8--
>>  From 729aa7b5b7f130f756d41649fdd0862bd2e90430 Mon Sep 17 00:00:00 2001
>> From: Fabiano Rosas <faro...@suse.de>
>> Date: Wed, 14 Feb 2024 16:45:43 -0300
>> Subject: [PATCH] migration: Join the return path thread before releasing
>>   to_dst_file
>> MIME-Version: 1.0
>> Content-Type: text/plain; charset=UTF-8
>> Content-Transfer-Encoding: 8bit
>> 
>> The return path thread might hang at a blocking system call. Before
>> joining the thread we might need to issue a shutdown() on the socket
>> file descriptor to release it. To determine whether the shutdown() is
>> necessary we look at the QEMUFile error.
>> 
>> Make sure we only clean up the QEMUFile after the return path has been
>> waited for.
>
> Yes. That's the important part.
>
>> This fixes a hang when qemu_savevm_state_setup() produced an error
>> that was detected by migration_detect_error(). That skips
>> migration_completion() so close_return_path_on_source() would get
>> stuck waiting for the RP thread to terminate.
>> 
>> At migrate_fd_cleanup() I'm keeping the relative order of joining the
>> migration thread and the return path just in case.
>
> That doesn't look necessary.

Indeed. But I don't trust the migration code, it's full of undocumented
dependencies like that.

> What was the reason to join the migration thread only when
> s->to_dst_file is valid ?

I didn't find any explicit reason looking through the history. It seems
we used to rely on to_dst_file before migration_thread_running was
introduced.

I wouldn't mind keeping that 'if' there.

Let's see what Peter thinks about it.


Reply via email to