Peter Xu <pet...@redhat.com> writes: > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: >> On 3.02.2025 21:20, Peter Xu wrote: >> > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: >> > > On 3.02.2025 19:20, Peter Xu wrote: >> > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: >> > > > > From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com> >> > > > > >> > > > > Multifd send channels are terminated by calling >> > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >> > > > > multifd_send_terminate_threads(), which in the TLS case essentially >> > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. >> > > > > >> > > > > Unfortunately, this does not terminate the TLS session properly and >> > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. >> > > > > >> > > > > The only reason why this wasn't causing migration failures is because >> > > > > the current migration code apparently does not check for migration >> > > > > error being set after the end of the multifd receive process. >> > > > > >> > > > > However, this will change soon so the multifd receive code has to be >> > > > > prepared to not return an error on such premature TLS session EOF. >> > > > > Use the newly introduced QIOChannelTLS method for that. >> > > > > >> > > > > It's worth noting that even if the sender were to be changed to >> > > > > terminate >> > > > > the TLS connection properly the receive side still needs to remain >> > > > > compatible with older QEMU bit stream which does not do this. >> > > > >> > > > If this is an existing bug, we could add a Fixes. >> > > >> > > It is an existing issue but only uncovered by this patch set. >> > > >> > > As far as I can see it was always there, so it would need some >> > > thought where to point that Fixes tag. >> > >> > If there's no way to trigger a real functional bug anyway, it's also ok we >> > omit the Fixes. >> > >> > > > Two pure questions.. >> > > > >> > > > - What is the correct way to terminate the TLS session without >> > > > this flag? >> > > >> > > I guess one would need to call gnutls_bye() like in this GnuTLS example: >> > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >> > > >> > > > - Why this is only needed by multifd sessions? >> > > >> > > What uncovered the issue was switching the load threads to using >> > > migrate_set_error() instead of their own result variable >> > > (load_threads_ret) which you had requested during the previous >> > > patch set version review: >> > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ >> > > >> > > Turns out that the multifd receive code always returned >> > > error in the TLS case, just nothing was previously checking for >> > > that error presence. >> > >> > What I was curious is whether this issue also exists for the main migration >> > channel when with tls, especially when e.g. multifd not enabled at all. As >> > I don't see anywhere that qemu uses gnutls_bye() for any tls session. >> > >> > I think it's a good to find that we overlooked this before.. and IMHO it's >> > always good we could fix this. >> > >> > Does it mean we need proper gnutls_bye() somewhere? >> > >> > If we need an explicit gnutls_bye(), then I wonder if that should be done >> > on the main channel as well. >> >> That's a good question and looking at the code qemu_loadvm_state_main() exits >> on receiving "QEMU_VM_EOF" section (that's different from receiving socket >> EOF) >> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit >> size >> in qemu_loadvm_state() - so still not until channel EOF. > > I had a closer look, I do feel like such pre-mature termination is caused > by explicit shutdown()s of the iochannels, looks like that can cause issue > even after everything is sent. Then I noticed indeed multifd sender > iochannels will get explicit shutdown()s since commit 077fbb5942, while we > don't do that for the main channel. Maybe that is a major difference. > > Now I wonder whether we should shutdown() the channel at all if migration > succeeded, because looks like it can cause tls session to interrupt even if > the shutdown() is done after sent everything, and if so it'll explain why > you hit the issue with tls. > >> >> Then I can't see anything else reading the channel until it is closed in >> migration_incoming_state_destroy(). >> >> So most likely the main migration channel will never read far enough to >> reach that GNUTLS_E_PREMATURE_TERMINATION error. >> >> > If we don't need gnutls_bye(), then should we always ignore pre-mature >> > termination of tls no matter if it's multifd or non-multifd channel (or >> > even a tls session that is not migration-related)? >> >> So basically have this patch extended to calling >> qio_channel_tls_set_premature_eof_okay() also on the main migration channel? > > If above theory can stand, then eof-okay could be a workaround papering > over the real problem that we shouldn't always shutdown().. > > Could you have a look at below patch and see whether it can fix the problem > you hit too, in replace of these two patches (including the previous > iochannel change)? > > Thanks, > > ===8<=== > From 3147084174b0e0bda076ad205ae139f8fc433892 Mon Sep 17 00:00:00 2001 > From: Peter Xu <pet...@redhat.com> > Date: Mon, 3 Feb 2025 17:27:45 -0500 > Subject: [PATCH] migration: Avoid shutdown multifd channels if migration > succeeded > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > Multifd channels behave differently from the main channel when shutting > down: the sender side will always shutdown() on the multifd iochannels, no > matter whether migration succeeded or not. QEMU doesn't do that on src. > > Such behavior was introduced in commit 077fbb5942 ("multifd: Shut down the > QIO channels to avoid blocking the send threads when they are terminated.") > to fix a hang issue when multifd enabled.
I'm always skeptical of "hangs" being fixed with shutdown(), when the multifd code has had multiple issues with incorrect locking and ordering of thread creation/deletion. > > This might be problematic though, especially in TLS context, because it > looks like such shutdown() on src (even if succeeded) could cause > destination multifd iochannels to receive pre-mature terminations of TLS > sessions. Speaking of TLS, enabling asan and running the TLS tests crashes QEMU in a very obscure way. May or may not be related to thread termination issues. I've been skipping postcopy/recovery/tls/psk when doing asan builds for a while now. Just FYI, I'm not asking we take action on this and I'm keeping an eye on it. > > It's debatable whether such shutdown() should be explicitly done even for a > succeeded migration. This patch moves the shutdown() instead from > finalization phase into qmp_migrate_cancel(), so that we only do the > shutdown() when cancels, and we should avoid such when it succeeds. > > When at it, keep all the shutdown() code together, moving the return path > shutdown() (which seems a bit redundant, but no harm to do) to where the > rest channels are shutdown. > > Cc: Li Zhang <lizh...@suse.de> This will bounce, she's no longer at SUSE. > Cc: Dr. David Alan Gilbert <d...@treblig.org> > Cc: Daniel P. Berrangé <berra...@redhat.com> > Reported-by: Maciej S. Szmigiero <m...@maciej.szmigiero.name> > Signed-off-by: Peter Xu <pet...@redhat.com> > --- > migration/multifd.h | 1 + > migration/migration.c | 24 +++++++++++++++++------- > migration/multifd.c | 14 +++++++++++--- > 3 files changed, 29 insertions(+), 10 deletions(-) > > diff --git a/migration/multifd.h b/migration/multifd.h > index bd785b9873..26ef94ac93 100644 > --- a/migration/multifd.h > +++ b/migration/multifd.h > @@ -340,6 +340,7 @@ static inline void > multifd_send_prepare_header(MultiFDSendParams *p) > > void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc); > bool multifd_send(MultiFDSendData **send_data); > +void multifd_send_shutdown_iochannels(void); > MultiFDSendData *multifd_send_data_alloc(void); > > static inline uint32_t multifd_ram_page_size(void) > diff --git a/migration/migration.c b/migration/migration.c > index 74c50cc72c..e43f8222dc 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1565,13 +1565,6 @@ static void migrate_fd_cancel(MigrationState *s) > > trace_migrate_fd_cancel(); > > - WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) { > - if (s->rp_state.from_dst_file) { > - /* shutdown the rp socket, so causing the rp thread to shutdown > */ > - qemu_file_shutdown(s->rp_state.from_dst_file); > - } > - } > - > do { > old_state = s->state; > if (!migration_is_running()) { > @@ -1594,6 +1587,23 @@ static void migrate_fd_cancel(MigrationState *s) > if (s->to_dst_file) { > qemu_file_shutdown(s->to_dst_file); > } > + /* > + * Above should work already, because the iochannel is shared > + * between outgoing and return path qemufiles, however just to > + * be on the safe side to set qemufile error on return path too > + * if existed. > + */ > + if (s->rp_state.from_dst_file) { > + qemu_file_shutdown(s->rp_state.from_dst_file); > + } > + } > + > + /* > + * We need to shutdown multifd channels too if they are available, > + * to make sure no multifd send threads will be stuck at syscalls. > + */ > + if (migrate_multifd()) { > + multifd_send_shutdown_iochannels(); > } > } > > diff --git a/migration/multifd.c b/migration/multifd.c > index ab73d6d984..96bcbb1e0c 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -384,6 +384,17 @@ static void multifd_send_set_error(Error *err) > } > } > > +void multifd_send_shutdown_iochannels(void) > +{ > + QIOChannel *c; > + int i; > + > + for (i = 0; i < migrate_multifd_channels(); i++) { > + c = multifd_send_state->params[i].c; > + qio_channel_shutdown(c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL); > + } > +} > + > static void multifd_send_terminate_threads(void) > { > int i; > @@ -404,9 +415,6 @@ static void multifd_send_terminate_threads(void) > MultiFDSendParams *p = &multifd_send_state->params[i]; > > qemu_sem_post(&p->sem); > - if (p->c) { > - qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL); > - } > } > > /* > -- > 2.47.0