Peter Xu <pet...@redhat.com> writes: > On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote: > > [...] > >> > Then assuming we have a clear model with all these threads issue fixed (no >> > matter whether we'd shrink 2N threads into N threads), then what we need to >> > do, IMHO, is making sure to join() all of them before destroying anything >> > (say, per-channel MultiFDSendParams). Then when we destroy everything >> > safely, either mutex/sem/etc.. Because no one will race us anymore. >> >> This doesn't address the race. There's a data dependency between the >> multifd channels and the migration thread around the channels_ready >> semaphore. So we cannot join the migration thread because it could be >> stuck waiting for the semaphore, which means we cannot join+cleanup the >> channel thread because the semaphore is still being used. > > I think this is the major part of confusion, on why this can happen. > > The problem is afaik multifd_save_cleanup() is only called by > migrate_fd_cleanup(), which is further only called in: > > 1) migrate_fd_cleanup_bh() > 2) migrate_fd_connect() > > For 1): it's only run when migration comletes/fails/etc. (in all cases, > right before it quits..) and then kicks off migrate_fd_cleanup_schedule(). > So migration thread shouldn't be stuck, afaiu, or it won't be able to kick > that BH. > > For 2): it's called by the main thread, where migration thread should have > not yet been created. > > With that, I don't see how migrate_fd_cleanup() would need to worry about > migration thread > > Did I miss something?
There are two points: 1) multifd_new_send_channel_async() doesn't set an Error. Even if multifd_channel_connect() fails, we'll still continue with migrate_fd_connect(). I don't see any code that looks at the migration error (s->error). 2) the TLS handshake thread of one of the channels could simply not get any chance to run until something else fails and we reach multifd_save_cleanup() from the BH path. This second point in particular is why I don't think simply joining the TLS thread will avoid the race. There's nothing linking the multifd channels together, we could have 7 of them operational and a 8th one still going through the TLS handshake. That said, I'm not sure about the exact path we take to reach the bug situation. It's very hard to reproduce so I'm relying entirely on code inspection.