On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: > On 3.02.2025 23:56, Peter Xu wrote: > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: > > > On 3.02.2025 21:20, Peter Xu wrote: > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: > > > > > On 3.02.2025 19:20, Peter Xu wrote: > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com> > > > > > > > > > > > > > > Multifd send channels are terminated by calling > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > > > > > multifd_send_terminate_threads(), which in the TLS case > > > > > > > essentially > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > > > > > > > > > Unfortunately, this does not terminate the TLS session properly > > > > > > > and > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION > > > > > > > error. > > > > > > > > > > > > > > The only reason why this wasn't causing migration failures is > > > > > > > because > > > > > > > the current migration code apparently does not check for migration > > > > > > > error being set after the end of the multifd receive process. > > > > > > > > > > > > > > However, this will change soon so the multifd receive code has to > > > > > > > be > > > > > > > prepared to not return an error on such premature TLS session EOF. > > > > > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > > > > > > > > > It's worth noting that even if the sender were to be changed to > > > > > > > terminate > > > > > > > the TLS connection properly the receive side still needs to remain > > > > > > > compatible with older QEMU bit stream which does not do this. > > > > > > > > > > > > If this is an existing bug, we could add a Fixes. > > > > > > > > > > It is an existing issue but only uncovered by this patch set. > > > > > > > > > > As far as I can see it was always there, so it would need some > > > > > thought where to point that Fixes tag. > > > > > > > > If there's no way to trigger a real functional bug anyway, it's also ok > > > > we > > > > omit the Fixes. > > > > > > > > > > Two pure questions.. > > > > > > > > > > > > - What is the correct way to terminate the TLS session without > > > > > > this flag? > > > > > > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS > > > > > example: > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 > > > > > > > > > > > - Why this is only needed by multifd sessions? > > > > > > > > > > What uncovered the issue was switching the load threads to using > > > > > migrate_set_error() instead of their own result variable > > > > > (load_threads_ret) which you had requested during the previous > > > > > patch set version review: > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ > > > > > > > > > > Turns out that the multifd receive code always returned > > > > > error in the TLS case, just nothing was previously checking for > > > > > that error presence. > > > > > > > > What I was curious is whether this issue also exists for the main > > > > migration > > > > channel when with tls, especially when e.g. multifd not enabled at all. > > > > As > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session. > > > > > > > > I think it's a good to find that we overlooked this before.. and IMHO > > > > it's > > > > always good we could fix this. > > > > > > > > Does it mean we need proper gnutls_bye() somewhere? > > > > > > > > If we need an explicit gnutls_bye(), then I wonder if that should be > > > > done > > > > on the main channel as well. > > > > > > That's a good question and looking at the code qemu_loadvm_state_main() > > > exits > > > on receiving "QEMU_VM_EOF" section (that's different from receiving > > > socket EOF) > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit > > > size > > > in qemu_loadvm_state() - so still not until channel EOF. > > > > I had a closer look, I do feel like such pre-mature termination is caused > > by explicit shutdown()s of the iochannels, looks like that can cause issue > > even after everything is sent. Then I noticed indeed multifd sender > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we > > don't do that for the main channel. Maybe that is a major difference. > > > > Now I wonder whether we should shutdown() the channel at all if migration > > succeeded, because looks like it can cause tls session to interrupt even if > > the shutdown() is done after sent everything, and if so it'll explain why > > you hit the issue with tls. > > > > > > > > Then I can't see anything else reading the channel until it is closed in > > > migration_incoming_state_destroy(). > > > > > > So most likely the main migration channel will never read far enough to > > > reach that GNUTLS_E_PREMATURE_TERMINATION error. > > > > > > > If we don't need gnutls_bye(), then should we always ignore pre-mature > > > > termination of tls no matter if it's multifd or non-multifd channel (or > > > > even a tls session that is not migration-related)? > > > > > > So basically have this patch extended to calling > > > qio_channel_tls_set_premature_eof_okay() also on the main migration > > > channel? > > > > If above theory can stand, then eof-okay could be a workaround papering > > over the real problem that we shouldn't always shutdown().. > > > > Could you have a look at below patch and see whether it can fix the problem > > you hit too, in replace of these two patches (including the previous > > iochannel change)? > > > > Unfortunately, the patch below does not fix the problem: > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was > > non-properly terminated. > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was > > non-properly terminated. > > I think that, even in the absence of shutdown(), if the sender does not > call gnutls_bye() the TLS session is considered improperly terminated.
Ah.. How about one more change on top of above change to disconnect properly for TLS? Something like gnutls_bye() in qio_channel_tls_close(), would that make sense to you? In general, I think it'll be good we fix this from the source rather than bypassing an error reported by gnutls facilities. We'll see what we can do to help (this includes Fabiano), but the hope is we can figure out the right way soon. Thanks, -- Peter Xu