"Maciej S. Szmigiero" <m...@maciej.szmigiero.name> writes:

> On 3.02.2025 23:56, Peter Xu wrote:
>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>
>>>>>>>
>>>>>>> Multifd send channels are terminated by calling
>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>
>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>
>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>> the current migration code apparently does not check for migration
>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>
>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>
>>>>>>> It's worth noting that even if the sender were to be changed to 
>>>>>>> terminate
>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>
>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>
>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>
>>>>> As far as I can see it was always there, so it would need some
>>>>> thought where to point that Fixes tag.
>>>>
>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>> omit the Fixes.
>>>>
>>>>>> Two pure questions..
>>>>>>
>>>>>>      - What is the correct way to terminate the TLS session without this 
>>>>>> flag?
>>>>>
>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>
>>>>>>      - Why this is only needed by multifd sessions?
>>>>>
>>>>> What uncovered the issue was switching the load threads to using
>>>>> migrate_set_error() instead of their own result variable
>>>>> (load_threads_ret) which you had requested during the previous
>>>>> patch set version review:
>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>
>>>>> Turns out that the multifd receive code always returned
>>>>> error in the TLS case, just nothing was previously checking for
>>>>> that error presence.
>>>>
>>>> What I was curious is whether this issue also exists for the main migration
>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>
>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>> always good we could fix this.
>>>>
>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>
>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>> on the main channel as well.
>>>
>>> That's a good question and looking at the code qemu_loadvm_state_main() 
>>> exits
>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket 
>>> EOF)
>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit 
>>> size
>>> in qemu_loadvm_state() - so still not until channel EOF.
>> 
>> I had a closer look, I do feel like such pre-mature termination is caused
>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>> even after everything is sent.  Then I noticed indeed multifd sender
>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>> don't do that for the main channel.  Maybe that is a major difference.
>> 
>> Now I wonder whether we should shutdown() the channel at all if migration
>> succeeded, because looks like it can cause tls session to interrupt even if
>> the shutdown() is done after sent everything, and if so it'll explain why
>> you hit the issue with tls.
>> 
>>>
>>> Then I can't see anything else reading the channel until it is closed in
>>> migration_incoming_state_destroy().
>>>
>>> So most likely the main migration channel will never read far enough to
>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>
>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>> even a tls session that is not migration-related)?
>>>
>>> So basically have this patch extended to calling
>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>> 
>> If above theory can stand, then eof-okay could be a workaround papering
>> over the real problem that we shouldn't always shutdown()..
>> 
>> Could you have a look at below patch and see whether it can fix the problem
>> you hit too, in replace of these two patches (including the previous
>> iochannel change)?
>> 
>
> Unfortunately, the patch below does not fix the problem:
>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was 
>> non-properly terminated.
>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was 
>> non-properly terminated.
>
> I think that, even in the absence of shutdown(), if the sender does not
> call gnutls_bye() the TLS session is considered improperly terminated.
>

I havent't looked much further into this, but can we craft a reproducer
for it with current master code? It would help us take a look at this
problem independently of this series. Even an assert somewhere would
help.

>> Thanks,
>> 
> Thanks,
> Maciej

Reply via email to