On Wed, Jul 21, 2021 at 10:55:00AM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (pet...@redhat.com) wrote: > > We have a logic in await_return_path_close_on_source() that we will > > explicitly > > shutdown the socket when migration encounters errors. However it could be > > racy > > because from_dst_file could have been reset right after checking it but > > before > > passing it to qemu_file_shutdown() by the rp_thread. > > > > Fix it by shutdown() on the src file instead. Since they must be a pair of > > qemu files, shutdown on either of them will work the same. > > > > Since at it, drop the check for from_dst_file directly, which makes the > > behavior even more predictable. > > So while the existing code maybe racy, I'm not sure that this change > keeps the semantics; the channel may well have dup()'d the fd's for the > two directions, and I'm not convinced that a shutdown() on one will > necessarily impact the other; and if the shutdown doesn't happen the > rp_thread might not exit, and we might block on the koin.
dup() seems fine as long as the backend file/socket is the same; but I get the point here, e.g., potentially from/to channels can indeed be different ones. > > Why don't we solve this a different way - how about we move the: > ms->rp_state.from_dst_file = NULL; > qemu_fclose(rp); > > out of the source_return_path_thread and put it in > await_return_path_close_on_source, immediately after the join? > Then we *know* that the the rp thread isn't messing with it. Yes that looks working for this special case of when rp_thread quits. It's just that it'll make things a bit more complicated, previously from_dst_file is only reset in rp_thread (for either paused or completed migration), now we handle "migration completes" a bit different. This also reminded me that maybe we can also use the qemu_file_lock mutex as we use to try protect access to to_dst_file, because I think from_dst_file is potentially racy in the same way. Even if we moved the reset to migration thread, we still have e.g. migrate_fd_cancel() calling: if (s->rp_state.from_dst_file) { qemu_file_shutdown(s->rp_state.from_dst_file); } I think that is also racy too when running in the main thread, as either the migration thread or rp_thread could have reset it right after the check. So.. maybe I start to use the qemu_file_lock to cover from_dst_file cases too? Then I'll cover at leasd both migrate_fd_cancel() and this case. Thanks, -- Peter Xu