Li Zhang <lizh...@suse.de> wrote: > When doing live migration with multifd channels 8, 16 or larger number, > the guest hangs in the presence of the network errors such as missing TCP > ACKs. > > At sender's side: > The main thread is blocked on qemu_thread_join, migration_fd_cleanup > is called because one thread fails on qio_channel_write_all when > the network problem happens and other send threads are blocked on sendmsg. > They could not be terminated. So the main thread is blocked on > qemu_thread_join > to wait for the threads terminated. > > (gdb) bt > 0 0x00007f30c8dcffc0 in __pthread_clockjoin_ex () at /lib64/libpthread.so.0 > 1 0x000055cbb716084b in qemu_thread_join (thread=0x55cbb881f418) at > ../util/qemu-thread-posix.c:627 > 2 0x000055cbb6b54e40 in multifd_save_cleanup () at ../migration/multifd.c:542 > 3 0x000055cbb6b4de06 in migrate_fd_cleanup (s=0x55cbb8024000) at > ../migration/migration.c:1808 > 4 0x000055cbb6b4dfb4 in migrate_fd_cleanup_bh (opaque=0x55cbb8024000) at > ../migration/migration.c:1850 > 5 0x000055cbb7173ac1 in aio_bh_call (bh=0x55cbb7eb98e0) at > ../util/async.c:141 > 6 0x000055cbb7173bcb in aio_bh_poll (ctx=0x55cbb7ebba80) at > ../util/async.c:169 > 7 0x000055cbb715ba4b in aio_dispatch (ctx=0x55cbb7ebba80) at > ../util/aio-posix.c:381 > 8 0x000055cbb7173ffe in aio_ctx_dispatch (source=0x55cbb7ebba80, > callback=0x0, user_data=0x0) at ../util/async.c:311 > 9 0x00007f30c9c8cdf4 in g_main_context_dispatch () at > /usr/lib64/libglib-2.0.so.0 > 10 0x000055cbb71851a2 in glib_pollfds_poll () at ../util/main-loop.c:232 > 11 0x000055cbb718521c in os_host_main_loop_wait (timeout=42251070366) at > ../util/main-loop.c:255 > 12 0x000055cbb7185321 in main_loop_wait (nonblocking=0) at > ../util/main-loop.c:531 > 13 0x000055cbb6e6ba27 in qemu_main_loop () at ../softmmu/runstate.c:726 > 14 0x000055cbb6ad6fd7 in main (argc=68, argv=0x7ffc0c578888, > envp=0x7ffc0c578ab0) at ../softmmu/main.c:50 > > At receiver's side: > Several receive threads are not created successfully and the receive threads > which have been created are blocked on qemu_sem_wait. No semaphores are posted > because migration is not started if not all the receive threads are created > successfully and multifd_recv_sync_main is not called which posts the > semaphore > to receive threads. So the receive threads are waiting on the semaphore and > never return. It shouldn't wait for the semaphore forever. > Use qemu_sem_timedwait to wait for a while, then return and close the > channels. > So the guest doesn't hang anymore. > > (gdb) bt > 0 0x00007fd61c43f064 in do_futex_wait.constprop () at /lib64/libpthread.so.0 > 1 0x00007fd61c43f158 in __new_sem_wait_slow.constprop.0 () at > /lib64/libpthread.so.0 > 2 0x000056075916014a in qemu_sem_wait (sem=0x56075b6515f0) at > ../util/qemu-thread-posix.c:358 > 3 0x0000560758b56643 in multifd_recv_thread (opaque=0x56075b651550) at > ../migration/multifd.c:1112 > 4 0x0000560759160598 in qemu_thread_start (args=0x56075befad00) at > ../util/qemu-thread-posix.c:556 > 5 0x00007fd61c43594a in start_thread () at /lib64/libpthread.so.0 > 6 0x00007fd61c158d0f in clone () at /lib64/libc.so.6 > > Signed-off-by: Li Zhang <lizh...@suse.de> > --- > migration/multifd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index 7c9deb1921..656239ca2a 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -1109,7 +1109,7 @@ static void *multifd_recv_thread(void *opaque) > > if (flags & MULTIFD_FLAG_SYNC) { > qemu_sem_post(&multifd_recv_state->sem_sync); > - qemu_sem_wait(&p->sem_sync); > + qemu_sem_timedwait(&p->sem_sync, 1000); > } > }
Problem happens here, but I think that the solution is not worng. We are returning from the semaphore without given a single error message. Later, Juan.