Hello Fabiano
On 2/8/24 14:29, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:
In case of error, close_return_path_on_source() can perform a shutdown
to exit the return-path thread. However, in migrate_fd_cleanup(),
'to_dst_file' is closed before calling close_return_path_on_source()
and the shutdown fails, leaving the source and destination waiting for
an event to occur.
Hi, Cédric
Are you sure this is not caused by patch 13?
It happens with upstream QEMU without any patch.
When vfio_listener_log_global_start() fails, it sets an error on the
QEMUFile. To reproduce without a VFIO device, you can inject an error
when dirty tracking is started. Something like below,
@@ -2817,6 +2817,8 @@ static void ram_init_bitmaps(RAMState *r
* containing all 1s to exclude any discarded pages from migration.
*/
migration_bitmap_clear_discarded_pages(rs);
+
+ qemu_file_set_error(migrate_get_current()->to_dst_file, -EAGAIN);
}
static int ram_init_all(RAMState **rsp)
Activate return-path and migrate.
That 'if (ms->to_dst_file'
was there to avoid this sort of thing happening.
Is there some reordering possibility that I'm not spotting in the code
below? I think the data dependency on to_dst_file shouldn't allow it.
migrate_fd_cleanup:
qemu_mutex_lock(&s->qemu_file_lock);
tmp = s->to_dst_file;
s->to_dst_file = NULL;
qemu_mutex_unlock(&s->qemu_file_lock);
...
qemu_fclose(tmp);
close_return_path_on_source:
WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
if (ms->to_dst_file && ms->rp_state.from_dst_file &&
qemu_file_get_error(ms->to_dst_file)) {
qemu_file_shutdown(ms->rp_state.from_dst_file);
}
}
close_return_path_on_source() is called by migrate_fd_cleanup() in
the same thread. So, when we reach the locking section ms->to_dst_file
is already NULL and qemu_fclose() has been closed :/
May be I misunderstood. Please try to reproduce with the little hack
above.
Thanks,
C.
I'm thinking maybe the culprit is the close_return_path_on_source() at
migration_completion(). It might be possible for it to race with the
migrate_fd_cleanup_bh from migration_iteration_finish().
If that's the case, then I think that one possible fix would be to hold
the BQL at migration_completion() so the BH doesn't get dispatched until
we properly close the return path.
Close the file after calling close_return_path_on_source() so that the
shutdown succeeds and the return-path thread exits.
Signed-off-by: Cédric Le Goater <c...@redhat.com>
---
This is an RFC because the correct fix implies reworking the QEMUFile
construct, built on top of the QEMU I/O channel.
migration/migration.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index
5f55af3d7624750ca416c4177781241b3e291e5d..de329f2c553288935d824748286e79e535929b8b
100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1313,6 +1313,8 @@ void migrate_set_state(int *state, int old_state, int
new_state)
static void migrate_fd_cleanup(MigrationState *s)
{
+ QEMUFile *tmp = NULL;
+
g_free(s->hostname);
s->hostname = NULL;
json_writer_free(s->vmdesc);
@@ -1321,8 +1323,6 @@ static void migrate_fd_cleanup(MigrationState *s)
qemu_savevm_state_cleanup();
if (s->to_dst_file) {
- QEMUFile *tmp;
-
trace_migrate_fd_cleanup();
bql_unlock();
if (s->migration_thread_running) {
@@ -1341,15 +1341,14 @@ static void migrate_fd_cleanup(MigrationState *s)
* critical section won't block for long.
*/
migration_ioc_unregister_yank_from_file(tmp);
- qemu_fclose(tmp);
}
- /*
- * We already cleaned up to_dst_file, so errors from the return
- * path might be due to that, ignore them.
- */
close_return_path_on_source(s);
+ if (tmp) {
+ qemu_fclose(tmp);
+ }
+
assert(!migration_is_active(s));
if (s->state == MIGRATION_STATUS_CANCELLING) {