On 2/5/24 04:37, Peter Xu wrote:
On Fri, Feb 02, 2024 at 12:11:09PM -0300, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:
On 2/2/24 15:42, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:
In case of error, close_return_path_on_source() can perform a shutdown
to exit the return-path thread. However, in migrate_fd_cleanup(),
'to_dst_file' is closed before calling close_return_path_on_source()
and the shutdown fails, leaving the source and destination waiting for
an event to occur.
At close_return_path_on_source, qemu_file_shutdown() and checking
ms->to_dst_file are done under the qemu_file_lock, so how could
migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
check have passed?
This is not a locking issue, it's much simpler. migrate_fd_cleanup()
clears the ms->to_dst_file pointer and closes the QEMUFile and then
calls close_return_path_on_source() which then tries to use resources
which are not available anymore.
I'm missing something here. Which resources? I assume you're talking
about this:
WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
if (ms->to_dst_file && ms->rp_state.from_dst_file &&
qemu_file_get_error(ms->to_dst_file)) {
qemu_file_shutdown(ms->rp_state.from_dst_file);
}
}
How do we get past the 'if (ms->to_dst_file)'?
We don't; migrate_fd_cleanup() will release ms->to_dst_file, then call
close_return_path_on_source(), found that to_dst_file==NULL and then skip
the shutdown().
One other option might be that we do close_return_path_on_source() before
the chunk of releasing to_dst_file.
This "two qemufiles share the same ioc" issue had bitten us before IIRC,
and the only concern of that workaround is we keep postponing resolution of
the real issue, then we keep getting bitten by it..
Maybe we can wait a few days to see if Dan can join the conversation and if
we can reach a consensus on a complete solution. Otherwise I think we can
still work this around, but maybe that'll require a comment block
explaining the bits after such movement.
yes. The series should have been sent with an RFC.
I changed PATCH 1 to use migrate_has_error() instead of
qemu_file_get_error(ms->to_dst_file). I will keep PATCH 2 as it is for
the time being and wait for more feedback.
The prereq series adds an Error** argument to the .save_setup() and
.log_global*() handlers. I should send this week.
Thanks,
C.
Thanks,