On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote:
> On 25/02/2025 18.44, Thomas Huth wrote:
> > On 25/02/2025 11.12, Kevin Wolf wrote:
> > > Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben:
> > > > 
> > > >   Hi!
> > > > 
> > > > I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When 
> > > > running
> > > > 
> > > >   ./check -raw 233
> > > > 
> > > > the test simply hangs. Looking at the log, the last message is "== check
> > > > plain client to TLS server fails ==". I added some debug messages, and 
> > > > it
> > > > seems like the previous NBD server is not correctly terminated here.
> > > > The test works fine again if I apply this patch:
> > > > 
> > > > diff --git a/tests/qemu-iotests/common.nbd 
> > > > b/tests/qemu-iotests/common.nbd
> > > > --- a/tests/qemu-iotests/common.nbd
> > > > +++ b/tests/qemu-iotests/common.nbd
> > > > @@ -35,7 +35,7 @@ nbd_server_stop()
> > > >           read NBD_PID < "$nbd_pid_file"
> > > >           rm -f "$nbd_pid_file"
> > > >           if [ -n "$NBD_PID" ]; then
> > > > -            kill "$NBD_PID"
> > > > +            kill -9 "$NBD_PID"
> > > >           fi
> > > >       fi
> > > >       rm -f "$nbd_unix_socket" "$nbd_stderr_fifo"
> > > > 
> > > > ... but that does not look like the right solution to me. What could 
> > > > prevent
> > > > the qemu-nbd from correctly shutting down when it receives a normal 
> > > > SIGTERM
> > > > signal?
> > > 
> > > Not sure. In theory, qemu_system_killed() should set state = TERMINATE
> > > and make main_loop_wait() return through the notification, which should
> > > then make it shut down. Maybe you can attach gdb and check what 'state'
> > > is when it hangs and if it's still in the main loop?
> > 
> > I attached a gdb and ran "bt", and it looks like it is hanging in an
> > exit() handler:
> > 
> > (gdb) bt
> > #0  0x00007f127f8fff1d in syscall () from /lib64/libc.so.6
> > #1  0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0
> > #2  0x00005583df3048b2 in flush_trace_file (wait=true) at
> > ../../devel/qemu/ trace/simple.c:140
> > #3  st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383
> > #4  0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6
> > #5  0x00007f127f82978e in exit () from /lib64/libc.so.6
> > #6  0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized
> > out>) at ../../devel/qemu/qemu-nbd.c:1242
> 
> Ah, now that I wrote that: I recently ran "configure" with
> --enable-trace-backends=simple ... when I remove that from "config.status"
> again, then the test works fine again 8-)
> 
> Still, I think it should not hang with the simple trace backend here, should 
> it?

IIUC this is waiting on trace_empty_cond.

This condition should be signalled from wait_for_trace_records_available
which is in turn called from writeout_thread.

This thread is started from st_init, which is called from trace_init_backends
which should be called from qemu-nbd. I would expect this thread to still
be running when exit() handlers are run.

Does GDB show any other threads running at the time of this hang ?


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Reply via email to