On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote: > On 25/02/2025 18.44, Thomas Huth wrote: > > On 25/02/2025 11.12, Kevin Wolf wrote: > > > Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: > > > > > > > > Hi! > > > > > > > > I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When > > > > running > > > > > > > > ./check -raw 233 > > > > > > > > the test simply hangs. Looking at the log, the last message is "== check > > > > plain client to TLS server fails ==". I added some debug messages, and > > > > it > > > > seems like the previous NBD server is not correctly terminated here. > > > > The test works fine again if I apply this patch: > > > > > > > > diff --git a/tests/qemu-iotests/common.nbd > > > > b/tests/qemu-iotests/common.nbd > > > > --- a/tests/qemu-iotests/common.nbd > > > > +++ b/tests/qemu-iotests/common.nbd > > > > @@ -35,7 +35,7 @@ nbd_server_stop() > > > > read NBD_PID < "$nbd_pid_file" > > > > rm -f "$nbd_pid_file" > > > > if [ -n "$NBD_PID" ]; then > > > > - kill "$NBD_PID" > > > > + kill -9 "$NBD_PID" > > > > fi > > > > fi > > > > rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" > > > > > > > > ... but that does not look like the right solution to me. What could > > > > prevent > > > > the qemu-nbd from correctly shutting down when it receives a normal > > > > SIGTERM > > > > signal? > > > > > > Not sure. In theory, qemu_system_killed() should set state = TERMINATE > > > and make main_loop_wait() return through the notification, which should > > > then make it shut down. Maybe you can attach gdb and check what 'state' > > > is when it hangs and if it's still in the main loop? > > > > I attached a gdb and ran "bt", and it looks like it is hanging in an > > exit() handler: > > > > (gdb) bt > > #0 0x00007f127f8fff1d in syscall () from /lib64/libc.so.6 > > #1 0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 > > #2 0x00005583df3048b2 in flush_trace_file (wait=true) at > > ../../devel/qemu/ trace/simple.c:140 > > #3 st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 > > #4 0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6 > > #5 0x00007f127f82978e in exit () from /lib64/libc.so.6 > > #6 0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized > > out>) at ../../devel/qemu/qemu-nbd.c:1242 > > Ah, now that I wrote that: I recently ran "configure" with > --enable-trace-backends=simple ... when I remove that from "config.status" > again, then the test works fine again 8-) > > Still, I think it should not hang with the simple trace backend here, should > it?
IIUC this is waiting on trace_empty_cond. This condition should be signalled from wait_for_trace_records_available which is in turn called from writeout_thread. This thread is started from st_init, which is called from trace_init_backends which should be called from qemu-nbd. I would expect this thread to still be running when exit() handlers are run. Does GDB show any other threads running at the time of this hang ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|