On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote: [...]
> > Then assuming we have a clear model with all these threads issue fixed (no > > matter whether we'd shrink 2N threads into N threads), then what we need to > > do, IMHO, is making sure to join() all of them before destroying anything > > (say, per-channel MultiFDSendParams). Then when we destroy everything > > safely, either mutex/sem/etc.. Because no one will race us anymore. > > This doesn't address the race. There's a data dependency between the > multifd channels and the migration thread around the channels_ready > semaphore. So we cannot join the migration thread because it could be > stuck waiting for the semaphore, which means we cannot join+cleanup the > channel thread because the semaphore is still being used. I think this is the major part of confusion, on why this can happen. The problem is afaik multifd_save_cleanup() is only called by migrate_fd_cleanup(), which is further only called in: 1) migrate_fd_cleanup_bh() 2) migrate_fd_connect() For 1): it's only run when migration comletes/fails/etc. (in all cases, right before it quits..) and then kicks off migrate_fd_cleanup_schedule(). So migration thread shouldn't be stuck, afaiu, or it won't be able to kick that BH. For 2): it's called by the main thread, where migration thread should have not yet been created. With that, I don't see how migrate_fd_cleanup() would need to worry about migration thread Did I miss something? Thanks, -- Peter Xu