"Maciej S. Szmigiero" <m...@maciej.szmigiero.name> writes: > On 3.02.2025 23:56, Peter Xu wrote: >> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: >>> On 3.02.2025 21:20, Peter Xu wrote: >>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: >>>>> On 3.02.2025 19:20, Peter Xu wrote: >>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: >>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com> >>>>>>> >>>>>>> Multifd send channels are terminated by calling >>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially >>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket. >>>>>>> >>>>>>> Unfortunately, this does not terminate the TLS session properly and >>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. >>>>>>> >>>>>>> The only reason why this wasn't causing migration failures is because >>>>>>> the current migration code apparently does not check for migration >>>>>>> error being set after the end of the multifd receive process. >>>>>>> >>>>>>> However, this will change soon so the multifd receive code has to be >>>>>>> prepared to not return an error on such premature TLS session EOF. >>>>>>> Use the newly introduced QIOChannelTLS method for that. >>>>>>> >>>>>>> It's worth noting that even if the sender were to be changed to >>>>>>> terminate >>>>>>> the TLS connection properly the receive side still needs to remain >>>>>>> compatible with older QEMU bit stream which does not do this. >>>>>> >>>>>> If this is an existing bug, we could add a Fixes. >>>>> >>>>> It is an existing issue but only uncovered by this patch set. >>>>> >>>>> As far as I can see it was always there, so it would need some >>>>> thought where to point that Fixes tag. >>>> >>>> If there's no way to trigger a real functional bug anyway, it's also ok we >>>> omit the Fixes. >>>> >>>>>> Two pure questions.. >>>>>> >>>>>> - What is the correct way to terminate the TLS session without this >>>>>> flag? >>>>> >>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example: >>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >>>>> >>>>>> - Why this is only needed by multifd sessions? >>>>> >>>>> What uncovered the issue was switching the load threads to using >>>>> migrate_set_error() instead of their own result variable >>>>> (load_threads_ret) which you had requested during the previous >>>>> patch set version review: >>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ >>>>> >>>>> Turns out that the multifd receive code always returned >>>>> error in the TLS case, just nothing was previously checking for >>>>> that error presence. >>>> >>>> What I was curious is whether this issue also exists for the main migration >>>> channel when with tls, especially when e.g. multifd not enabled at all. As >>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session. >>>> >>>> I think it's a good to find that we overlooked this before.. and IMHO it's >>>> always good we could fix this. >>>> >>>> Does it mean we need proper gnutls_bye() somewhere? >>>> >>>> If we need an explicit gnutls_bye(), then I wonder if that should be done >>>> on the main channel as well. >>> >>> That's a good question and looking at the code qemu_loadvm_state_main() >>> exits >>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket >>> EOF) >>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit >>> size >>> in qemu_loadvm_state() - so still not until channel EOF. >> >> I had a closer look, I do feel like such pre-mature termination is caused >> by explicit shutdown()s of the iochannels, looks like that can cause issue >> even after everything is sent. Then I noticed indeed multifd sender >> iochannels will get explicit shutdown()s since commit 077fbb5942, while we >> don't do that for the main channel. Maybe that is a major difference. >> >> Now I wonder whether we should shutdown() the channel at all if migration >> succeeded, because looks like it can cause tls session to interrupt even if >> the shutdown() is done after sent everything, and if so it'll explain why >> you hit the issue with tls. >> >>> >>> Then I can't see anything else reading the channel until it is closed in >>> migration_incoming_state_destroy(). >>> >>> So most likely the main migration channel will never read far enough to >>> reach that GNUTLS_E_PREMATURE_TERMINATION error. >>> >>>> If we don't need gnutls_bye(), then should we always ignore pre-mature >>>> termination of tls no matter if it's multifd or non-multifd channel (or >>>> even a tls session that is not migration-related)? >>> >>> So basically have this patch extended to calling >>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel? >> >> If above theory can stand, then eof-okay could be a workaround papering >> over the real problem that we shouldn't always shutdown().. >> >> Could you have a look at below patch and see whether it can fix the problem >> you hit too, in replace of these two patches (including the previous >> iochannel change)? >> > > Unfortunately, the patch below does not fix the problem: >> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was >> non-properly terminated. >> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was >> non-properly terminated. > > I think that, even in the absence of shutdown(), if the sender does not > call gnutls_bye() the TLS session is considered improperly terminated. >
I havent't looked much further into this, but can we craft a reproducer for it with current master code? It would help us take a look at this problem independently of this series. Even an assert somewhere would help. >> Thanks, >> > Thanks, > Maciej