On Tue, Mar 05, 2024 at 04:56:29PM -0300, Fabiano Rosas wrote: > Commit bc38feddeb ("io: fsync before closing a file channel") added a > fsync/fdatasync at the closing point of the QIOChannelFile to ensure > integrity of the migration stream in case of QEMU crash. > > The decision to do the sync at qio_channel_close() was not the best > since that function runs in the main thread and the fsync can cause > QEMU to hang for several minutes, depending on the migration size and > disk speed. > > To fix the hang, remove the fsync from qio_channel_file_close(). > > At this moment, the migration code is the only user of the fsync and > we're taking the tradeoff of not having a sync at all, leaving the > responsibility to the upper layers. > > Fixes: bc38feddeb ("io: fsync before closing a file channel") > Reviewed-by: Daniel P. Berrangé <berra...@redhat.com> > Signed-off-by: Fabiano Rosas <faro...@suse.de>
Since 9.0 is reaching and it's important we avoid such hang, I queued this version. However to make sure we can still remember why we do this after a few years, I added a rich comment and will squash into this patch: ======= diff --git a/migration/multifd.c b/migration/multifd.c index 0a8fef046b..bf9d483f7a 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -714,6 +714,22 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp) * released because finalize() of the iochannel is only * triggered on the last reference and it's not guaranteed * that we always hold the last refcount when reaching here. + * + * Closing the fd explicitly has the benefit that if there is any + * registered I/O handler callbacks on such fd, that will get a + * POLLNVAL event and will further trigger the cleanup to finally + * release the IOC. + * + * FIXME: It should logically be guaranteed that all multifd + * channels have no I/O handler callback registered when reaching + * here, because migration thread will wait for all multifd channel + * establishments to complete during setup. Since + * migrate_fd_cleanup() will be scheduled in main thread too, all + * previous callbacks should guarantee to be completed when + * reaching here. See multifd_send_state.channels_created and its + * usage. In the future, we could replace this with an assert + * making sure we're the last reference, or simply drop it if above + * is more clear to be justified. */ qio_channel_close(p->c, &error_abort); object_unref(OBJECT(p->c)); ======== Thanks, -- Peter Xu