* Daniel P. Berrangé (berra...@redhat.com) wrote: > On Mon, Nov 29, 2021 at 11:20:08AM +0000, Dr. David Alan Gilbert wrote: > > * Daniel P. Berrangé (berra...@redhat.com) wrote: > > > On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote: > > > > When doing live migration with multifd channels 8, 16 or larger number, > > > > the guest hangs in the presence of the network errors such as missing > > > > TCP ACKs. > > > > > > > > At sender's side: > > > > The main thread is blocked on qemu_thread_join, migration_fd_cleanup > > > > is called because one thread fails on qio_channel_write_all when > > > > the network problem happens and other send threads are blocked on > > > > sendmsg. > > > > They could not be terminated. So the main thread is blocked on > > > > qemu_thread_join > > > > to wait for the threads terminated. > > > > > > Isn't the right answer here to ensure we've called 'shutdown' on > > > all the FDs, so that the threads get kicked out of sendmsg, before > > > trying to join the thread ? > > > > I agree a timeout is wrong here; there is no way to get a good timeout > > value. > > However, I'm a bit confused - we should be able to try a shutdown on the > > receive side using the 'yank' command. - that's what it's there for; Li > > does this solve your problem? > > Why do we even need to use 'yank' on the receive side ? Until migration > has switched over from src to dst, the receive side is discardable and > the whole process can just be teminated with kill(SIGTERM/SIGKILL).
True, although it's nice to be able to quit cleanly. > On the source side 'yank' is needed, because the QEMU process is still > running the live workload and thus is precious and mustn't be killed. True. Dave > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK