On Fri, Feb 07, 2025 at 10:17:19AM -0300, Fabiano Rosas wrote: > Peter Xu <pet...@redhat.com> writes: > > > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote: > >> > In any case we'd still need some kind of a compatibility behavior for > >> > the TLS bit stream emitted by older QEMU versions (which is always > >> > improperly terminated). > >> > > >> > >> There is no compat issue. For <= 9.2, QEMU is still doing an extra > >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on > >> the destination and it gets stuck waiting for the > >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always > >> closes the connection before dst reaches the extra recv(). > >> > >> I test migration both ways with 2 previous QEMU versions and the > >> gnutls_bye() series passes all tests. I also put an assert at > >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The > >> MULTIFD_FLAG_EOS should behave the same. > > > > Which are the versions you tried? As only 9.1 and 9.2 has 637280aeb2, so I > > wonder if the same issue would hit too with 9.0 or older. > > Good point. 9.0 indeed breaks. > > > > > I'd confess I feel unreliable relying on the side effect of 637280aeb2, > > because fundamentally it works based on the fact that multifd threads need > > to be kicked out by the main load thread SYNC event on dest QEMU to avoid > > the readv() from going wrong. > > > > We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is > a bug that 1d457daf86 fixed. > > > What I'm not sure here is, is it sheer luck that the main channel SYNC will > > always arrive _before_ pre-mature terminations of the multifd channels? It > > sounds like it could also happen when the multifd channels got its > > pre-mature termination early, before the main thread got the SYNC. > > You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that > puts the recv thread in the "won't see the termination" state and that > is serialized: > > SEND RECV > -------------------------+---------------------------- > 1 multifd_send_sync_main() > 2 pending_sync==true, > 3 send thread sends SYNC recv thread gets SYNC > 4 <some work> recv gets stuck. > 5 multifd_send_shutdown() <time passes> > 6 shutdown() multifd_recv_shutdown() > recv_terminate_threads() > recv exits without recv() > > In other words, RECV would need to see the shutdown (6) before the SYNC > (3), which I don't think it possible.
Ah yeah, I somehow remembered we sent a SYNC in the main channel but forgot to push the per-channel SYNC. I got it the other way round. Yeah if data is always ordered with shutdown() effect on recv then it seems in order. > > > > > Maybe we still need a compat property at the end.. > > This is actually similar to preempt_pre_7_2, what about: > > /* > * This variable only makes sense when set on the machine that is > * the destination of a multifd migration with TLS enabled. It > * affects the behavior of the last send->recv iteration with > * regards to termination of the TLS session. Defaults to true. > * > * When set: > * > * - the destination QEMU instance can expect to never get a > * GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error > * message: "The TLS connection was non-properly terminated". > * > * When clear: > * > * - the destination QEMU instance can expect to see a > * GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel > * whenever the last recv() call of that channel happens after > * the source QEMU instance has already issued shutdown() on the > * channel. This is affected by (at least) commits 637280aeb2 > * and 1d457daf86. If we want to reference them after all, we could use another sentence to describe the effects: * Commit 637280aeb2 (since 9.1) introduced a side effect to cause * pre-mature termination not happen, while commit 1d457daf86 * (since 10.0) can unexpectedly re-expose the pre-mature * termination issue. > * > * NOTE: Regardless of the state of this option, a premature > * termination of the TLS connection might happen due to error at > * any moment prior to the last send->recv iteration. > */ > bool multifd_clean_tls_termination; > > And I think the more straight-forward implementation is to incorporate > Maciej's premature_ok patches (in some form), otherwise that option will > have to take effect on the QIOChannel which is a layering violation. If we take Dan's comment into account: https://lore.kernel.org/r/z6i86e-hzjalx...@redhat.com It means whenever multifd recv thread invokes the iochannel API it will use multifd_clean_tls_termination to decide QIO_CHANNEL_READ_RELAXED_EOF flag to pass in. I hope this is not layer violation, or I could miss something.. So if we're on the same page we need that knob, to make this series easier we could make it two steps: - Step 1: introduce the parameter and QIO_CHANNEL_READ_RELAXED_EOF, set it default to false. - Step 2: Your other RFC series to implement gnutls_bye(), at last make it a compat property and switch default true. Then Maciej only needs step 1, it looks to me. -- Peter Xu