* Li Zhang (lizh...@suse.de) wrote: > > On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote: > > * Daniel P. Berrangé (berra...@redhat.com) wrote: > > > On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote: > > > > When doing live migration with multifd channels 8, 16 or larger number, > > > > the guest hangs in the presence of the network errors such as missing > > > > TCP ACKs. > > > > > > > > At sender's side: > > > > The main thread is blocked on qemu_thread_join, migration_fd_cleanup > > > > is called because one thread fails on qio_channel_write_all when > > > > the network problem happens and other send threads are blocked on > > > > sendmsg. > > > > They could not be terminated. So the main thread is blocked on > > > > qemu_thread_join > > > > to wait for the threads terminated. > > > Isn't the right answer here to ensure we've called 'shutdown' on > > > all the FDs, so that the threads get kicked out of sendmsg, before > > > trying to join the thread ? > > I agree a timeout is wrong here; there is no way to get a good timeout > > value. > > However, I'm a bit confused - we should be able to try a shutdown on the > > receive side using the 'yank' command. - that's what it's there for; Li > > does this solve your problem? > > No, I tried to register 'yank' on the receive side, the receive threads are > still waiting there. > > It seems that on send side, 'yank' doesn't work either when the send threads > are blocked. > > This may be not the case to call yank. I am not quite sure about it.
We need to fix that; 'yank' should be able to recover from any network issue. If it's not working we need to understand why. > > > > multifd_load_cleanup already kicks sem_sync before trying to do a > > thread_join - so have we managed to trigger that on the receive side? > > There is no problem with sem_sync in function multifd_load_cleanup. > > But it is not called in my case, because no errors are detected on the > receive side. If you're getting TCP errors why aren't you seeing any errors on the receive side? > The problem is here: > > void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp) > { > MigrationIncomingState *mis = migration_incoming_get_current(); > Error *local_err = NULL; > bool start_migration; > > ... > > if (!mis->from_src_file) { > > ... > > } else { > /* Multiple connections */ > assert(migrate_use_multifd()); > start_migration = multifd_recv_new_channel(ioc, &local_err); > if (local_err) { > error_propagate(errp, local_err); > return; > } > } > if (start_migration) { > migration_incoming_process(); > } > } > > start_migration is always 0, and migration is not started because some > receive threads are not created. > > No errors are detected here and the main process works well but receive > threads are all waiting for semaphore. > > It's hard to know if the receive threads are not created. If we can find a > way to check if any receive threads So is this only a problem for network issues that happen during startup, before all the threads have been created? Dave > are not created, we can kick the sem_sync and do cleanup. > > From the source code, the thread will be created when QIO channel detects > something by GIO watch if I understand correctly. > > If nothing is detected, socket_accept_icoming_migration won't be called, the > thread will not be created. > > socket_start_incoming_migration_internal -> > > qio_net_listener_set_client_func_full(listener, > socket_accept_incoming_migration, > NULL, NULL, > g_main_context_get_thread_default()); > > qio_net_listener_set_client_func_full -> > > qio_channel_add_watch_source( > QIO_CHANNEL(listener->sioc[i]), G_IO_IN, > qio_net_listener_channel_func, > listener, (GDestroyNotify)object_unref, context); > > socket_accept_incoming_migration -> > > migration_channel_process_incoming -> > > migration_ioc_process_incoming -> > > multifd_recv_new_channel -> > > qemu_thread_create(&p->thread, p->name, > multifd_recv_thread, p, > QEMU_THREAD_JOINABLE); > > > > > Dave > > > > > Regards, > > > Daniel > > > -- > > > |: https://berrange.com -o- > > > https://www.flickr.com/photos/dberrange :| > > > |: https://libvirt.org -o- > > > https://fstop138.berrange.com :| > > > |: https://entangle-photo.org -o- > > > https://www.instagram.com/dberrange :| > > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK