Peter Xu <pet...@redhat.com> writes:

> On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote:
>
> [...]
>
>> > Then assuming we have a clear model with all these threads issue fixed (no
>> > matter whether we'd shrink 2N threads into N threads), then what we need to
>> > do, IMHO, is making sure to join() all of them before destroying anything
>> > (say, per-channel MultiFDSendParams).  Then when we destroy everything
>> > safely, either mutex/sem/etc..  Because no one will race us anymore.
>> 
>> This doesn't address the race. There's a data dependency between the
>> multifd channels and the migration thread around the channels_ready
>> semaphore. So we cannot join the migration thread because it could be
>> stuck waiting for the semaphore, which means we cannot join+cleanup the
>> channel thread because the semaphore is still being used.
>
> I think this is the major part of confusion, on why this can happen.
>
> The problem is afaik multifd_save_cleanup() is only called by
> migrate_fd_cleanup(), which is further only called in:
>
>   1) migrate_fd_cleanup_bh()
>   2) migrate_fd_connect()
>
> For 1): it's only run when migration comletes/fails/etc. (in all cases,
> right before it quits..) and then kicks off migrate_fd_cleanup_schedule().
> So migration thread shouldn't be stuck, afaiu, or it won't be able to kick
> that BH.
>
> For 2): it's called by the main thread, where migration thread should have
> not yet been created.
>
> With that, I don't see how migrate_fd_cleanup() would need to worry about
> migration thread
>
> Did I miss something?

There are two points:

1) multifd_new_send_channel_async() doesn't set an Error. Even if
multifd_channel_connect() fails, we'll still continue with
migrate_fd_connect(). I don't see any code that looks at the migration
error (s->error).

2) the TLS handshake thread of one of the channels could simply not get
any chance to run until something else fails and we reach
multifd_save_cleanup() from the BH path.

This second point in particular is why I don't think simply joining the
TLS thread will avoid the race. There's nothing linking the multifd
channels together, we could have 7 of them operational and a 8th one
still going through the TLS handshake.

That said, I'm not sure about the exact path we take to reach the bug
situation. It's very hard to reproduce so I'm relying entirely on code
inspection.

Reply via email to