Steps with test packages on Focal (shutdown-on-init) --- Environment: ---
On top of LXD VM in comments #12/#13. Enable PPA & debug symbols sudo add-apt-repository -yn ppa:mfo/lp2059272 sudo sed '/^deb / s,$, main/debug,' -i /etc/apt/sources.list.d/mfo-ubuntu-lp2059272-focal.list sudo apt update Install packages sudo apt install --yes libvirt{0,-daemon{,-driver- qemu}}{,-dbgsym} libvirt-clients gdb qemu-system-x86 $ dpkg -s libvirt-daemon | grep ^Version: Version: 6.0.0-0ubuntu8.18~ppa1 Libvirtd debug logging cat <<EOF | sudo tee -a /etc/libvirt/libvirtd.conf log_filters="1:qemu 1:libvirt" log_outputs="3:syslog:libvirtd 1:file:/var/log/libvirt/libvirtd-debug.log" EOF Follow `Steps to reproduce on Focal (shutdown-on-init)` in comment #13 --- Up to ... Check the backtrace of the domain status XML save function, coming from QEMU process reconnect: t 20 (gdb) bt #0 virDomainObjSave (obj=0x7fe638012540, xmlopt=0x7fe63800d4e0, statusDir=0x7fe63800cf10 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157 #1 0x00007fe644190545 in qemuProcessReconnect (opaque=<optimized out>) at ../../../src/qemu/qemu_process.c:8123 #2 0x00007fe64aebd54a in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:196 #3 0x00007fe64ab7e609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #4 0x00007fe64aaa3353 in clone () from /lib/x86_64-linux-gnu/libc.so.6 $ sudo kill $(pidof libvirtd) Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1180 t 20 (gdb) p xmlopt.privateData.format $1 = (virDomainXMLPrivateDataFormatFunc) 0x7fe644152890 <qemuDomainObjPrivateXMLFormat> Let the cleanup function finish t 1 finish Notice it took a while (30 seconds). (gdb) t 20 (gdb) p xmlopt.privateData.format $3 = (virDomainXMLPrivateDataFormatFunc) 0x0 Let the save function continue, and libvirt finish shutdown: (gdb) c & (gdb) t 1 (gdb) c (gdb) q Check the VM status XML *after*: ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml <domstatus state='running' reason='booted' pid='6817'> <domain type='qemu' id='1'> And everything happened as in the reproducer. i.e., the SAME behavior happened BY DEFAULT. Just with a 30 seconds delay. Checking the libvirtd debug logs to confirm the patch behavior: $ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p' 2024-03-30 22:49:24.737+0000: 6875: debug : qemuStateCleanupWait:1144 : timeout 30, timeout_env '(null)' 2024-03-30 22:49:24.737+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 0 2024-03-30 22:49:24.737+0000: 6875: warning : qemuStateCleanupWait:1153 : Waiting for qemuProcessReconnect() threads (1) to end. Configure with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait up to N seconds; current = 30) 2024-03-30 22:49:25.740+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 1 2024-03-30 22:49:26.740+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 2 2024-03-30 22:49:27.740+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 3 2024-03-30 22:49:28.741+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 4 2024-03-30 22:49:29.741+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 5 2024-03-30 22:49:30.741+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 6 2024-03-30 22:49:31.742+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 7 2024-03-30 22:49:32.742+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 8 2024-03-30 22:49:33.742+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 9 2024-03-30 22:49:34.742+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 10 2024-03-30 22:49:35.743+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 11 2024-03-30 22:49:36.743+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 12 2024-03-30 22:49:37.744+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 13 2024-03-30 22:49:38.744+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 14 2024-03-30 22:49:39.744+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 15 2024-03-30 22:49:40.744+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 16 2024-03-30 22:49:41.745+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 17 2024-03-30 22:49:42.745+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 18 2024-03-30 22:49:43.746+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 19 2024-03-30 22:49:44.746+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 20 2024-03-30 22:49:45.747+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 21 2024-03-30 22:49:46.747+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 22 2024-03-30 22:49:47.748+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 23 2024-03-30 22:49:48.748+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 24 2024-03-30 22:49:49.749+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 25 2024-03-30 22:49:50.749+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 26 2024-03-30 22:49:51.750+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 27 2024-03-30 22:49:52.750+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 28 2024-03-30 22:49:53.750+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 29 2024-03-30 22:49:54.751+0000: 6875: warning : qemuStateCleanupWait:1164 : Leaving qemuProcessReconnect() threads (1) per timeout (30) 2024-03-30 22:51:00.315+0000: 6906: debug : qemuDomainObjEndJob:9746 : Stopping job: modify (async=none vm=0x7fe638012540 name=test-vm) 2024-03-30 22:51:00.315+0000: 6906: debug : qemuProcessReconnect:8161 : Not decrementing qemuProcessReconnect() threads as the QEMU driver is already deallocated/freed. This would be shown in libvirtd syslog/journalctl (warnings/errors): $ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p' | grep -e warning -e error 2024-03-30 22:49:24.737+0000: 6875: warning : qemuStateCleanupWait:1153 : Waiting for qemuProcessReconnect() threads (1) to end. Configure with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait up to N seconds; current = 30) 2024-03-30 22:49:54.751+0000: 6875: warning : qemuStateCleanupWait:1164 : Leaving qemuProcessReconnect() threads (1) per timeout (30) Stop the VM, and restart it with libvirt. sudo kill $(sudo cat /run/libvirt/qemu/test-vm.pid) && sudo rm /run/libvirt/qemu/test-vm.{pid,xml} sudo systemctl start libvirtd.service && virsh start test-vm && sudo systemctl stop 'libvirtd*' Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=5 --- The same result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=5 (ie wait at most 5 seconds) Repeat, with `gdb -ex 'set environment LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT 5' -ex 'run'`: The steps 't 1; finish' take 5 seconds, instead of 30 seconds. ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml <domstatus state='running' reason='booted' pid='7005'> <domain type='qemu' id='1'> ubuntu@lp2059272-focal:~$ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p' 2024-03-30 23:00:11.016+0000: 7017: debug : qemuStateCleanupWait:1144 : timeout 5, timeout_env '5' 2024-03-30 23:00:11.016+0000: 7017: debug : qemuStateCleanupWait:1150 : threads 1, seconds 0 2024-03-30 23:00:11.016+0000: 7017: warning : qemuStateCleanupWait:1153 : Waiting for qemuProcessReconnect() threads (1) to end. Configure with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait up to N seconds; current = 5) 2024-03-30 23:00:12.017+0000: 7017: debug : qemuStateCleanupWait:1150 : threads 1, seconds 1 2024-03-30 23:00:13.018+0000: 7017: debug : qemuStateCleanupWait:1150 : threads 1, seconds 2 2024-03-30 23:00:14.018+0000: 7017: debug : qemuStateCleanupWait:1150 : threads 1, seconds 3 2024-03-30 23:00:15.018+0000: 7017: debug : qemuStateCleanupWait:1150 : threads 1, seconds 4 2024-03-30 23:00:16.018+0000: 7017: warning : qemuStateCleanupWait:1164 : Leaving qemuProcessReconnect() threads (1) per timeout (5) 2024-03-30 23:00:45.694+0000: 7048: debug : qemuDomainObjEndJob:9746 : Stopping job: modify (async=none vm=0x7f40d0052de0 name=test-vm) 2024-03-30 23:00:45.694+0000: 7048: debug : qemuProcessReconnect:8161 : Not decrementing qemuProcessReconnect() threads as the QEMU driver is already deallocated/freed. Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=0 --- The same result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=0 (ie do not wait) Repeat, with `gdb -ex 'set environment LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT 0' -ex 'run'`: The steps 't 1; finish' take 0 seconds (no wait), instead of 30 or 5 seconds. ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml <domstatus state='running' reason='booted' pid='7113'> <domain type='qemu' id='1'> ubuntu@lp2059272-focal:~$ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p' 2024-03-30 23:03:11.487+0000: 7124: debug : qemuStateCleanupWait:1144 : timeout 0, timeout_env '0' 2024-03-30 23:03:11.488+0000: 7124: warning : qemuStateCleanupWait:1164 : Leaving qemuProcessReconnect() threads (1) per timeout (0) 2024-03-30 23:03:15.313+0000: 7155: debug : qemuDomainObjEndJob:9746 : Stopping job: modify (async=none vm=0x7ff620052ad0 name=test-vm) 2024-03-30 23:03:15.313+0000: 7155: debug : qemuProcessReconnect:8161 : Not decrementing qemuProcessReconnect() threads as the QEMU driver is already deallocated/freed. Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=-1 --- A different result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=-1 (ie wait forever) Repeat, with `gdb -ex 'set environment LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT -1' -ex 'run'`: The steps 't 1; finish' does not finish, it keeps running, waiting for the pending thread. t 1 finish ... wait, wait, wait ... ctrl-c (gdb) bt #0 0x00007fb29ceed23f in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fb29cef2ec7 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fb29d0bf557 in g_usleep () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #3 0x00007fb2906498f5 in qemuStateCleanupWait () at ../../../src/qemu/qemu_driver.c:1159 #4 qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1184 #5 0x00007fb29d4e746f in virStateCleanup () at ../../../src/libvirt.c:669 #6 0x00005569adc89bc8 in main (argc=<optimized out>, argv=<optimized out>) at ../../../src/remote/remote_daemon.c:1447 Check the formatter/options again; it is *STILL* referenced, not 0x0 anymore: t 20 (gdb) p xmlopt.privateData.format $1 = (virDomainXMLPrivateDataFormatFunc) 0x7fb2905d8890 <qemuDomainObjPrivateXMLFormat> Thread 1 is still in qemuStateCleanupWait(), so let it run again, (gdb) c & And unblock the other thread. Now libvirt finishes shutting down. (gdb) t 20 (gdb) c ... [Inferior 1 (process 7233) exited normally] The logs show that thread has actually finished before libvirt exited. ubuntu@lp2059272-focal:~$ sudo tail -n200 /var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p' 2024-03-30 23:06:00.512+0000: 7233: debug : qemuStateCleanupWait:1144 : timeout -1, timeout_env '-1' 2024-03-30 23:06:00.512+0000: 7233: debug : qemuStateCleanupWait:1150 : threads 1, seconds 0 2024-03-30 23:06:00.512+0000: 7233: warning : qemuStateCleanupWait:1153 : Waiting for qemuProcessReconnect() threads (1) to end . Configure with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait up to N seconds; current = -1) 2024-03-30 23:06:01.513+0000: 7233: debug : qemuStateCleanupWait:1150 : threads 1, seconds 1 2024-03-30 23:06:02.513+0000: 7233: debug : qemuStateCleanupWait:1150 : threads 1, seconds 2 2024-03-30 23:06:03.514+0000: 7233: debug : qemuStateCleanupWait:1150 : threads 1, seconds 3 ... 2024-03-30 23:09:43.994+0000: 7233: debug : qemuStateCleanupWait:1150 : threads 1, seconds 130 2024-03-30 23:09:44.994+0000: 7233: debug : qemuStateCleanupWait:1150 : threads 1, seconds 131 2024-03-30 23:09:45.994+0000: 7233: debug : qemuStateCleanupWait:1150 : threads 1, seconds 132 2024-03-30 23:09:46.075+0000: 7264: debug : qemuDomainObjEndJob:9746 : Stopping job: modify (async=none vm=0x7fb28c04c1c0 name=test-vm) 2024-03-30 23:09:46.075+0000: 7264: debug : qemuProcessReconnect:8158 : Decrementing qemuProcessReconnect() threads. 2024-03-30 23:09:46.995+0000: 7233: debug : qemuStateCleanupWait:1170 : All qemuProcessReconnect() threads finished And the `monitor path` is still in the XML: ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml <domstatus state='running' reason='booted' pid='7222'> <monitor path='/var/lib/libvirt/qemu/domain-1-test-vm/monitor.sock' type='unix'/> <domain type='qemu' id='1'> Of course, the above also happens by default if the thread finishes within the default timeout (30 seconds). Scenario: (default/real-world) no env var, and the thread finishes quickly --- (Running the steps real quick.) Thread 20 "libvirtd" hit Breakpoint 2, virDomainObjSave (obj=0x55c688ebbe80, xmlopt=0x55c688eb3f40, statusDir=0x55c688e78f60 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157 $ sudo kill $(pidof libvirtd) Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1181 (gdb) t 1 (gdb) c & (gdb) t 20 (gdb) c ... [Inferior 1 (process 32761) exited normally] ubuntu@lp2059272-focal:~$ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p' 2024-03-30 23:12:10.242+0000: 7281: debug : qemuStateCleanupWait:1144 : timeout 30, timeout_env '(null)' 2024-03-30 23:12:10.242+0000: 7281: debug : qemuStateCleanupWait:1150 : threads 1, seconds 0 2024-03-30 23:12:10.242+0000: 7281: warning : qemuStateCleanupWait:1153 : Waiting for qemuProcessReconnect() threads (1) to end. Configure with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait up to N seconds; current = 30) 2024-03-30 23:12:11.242+0000: 7281: debug : qemuStateCleanupWait:1150 : threads 1, seconds 1 2024-03-30 23:12:11.484+0000: 7312: debug : qemuDomainObjEndJob:9746 : Stopping job: modify (async=none vm=0x7f7b4c04c3a0 name=test-vm) 2024-03-30 23:12:11.484+0000: 7312: debug : qemuProcessReconnect:8158 : Decrementing qemuProcessReconnect() threads. 2024-03-30 23:12:12.243+0000: 7281: debug : qemuStateCleanupWait:1170 : All qemuProcessReconnect() threads finished ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml <domstatus state='running' reason='booted' pid='7222'> <monitor path='/var/lib/libvirt/qemu/domain-1-test-vm/monitor.sock' type='unix'/> <domain type='qemu' id='1'> Now, the next time libvirtd starts, it correctly parses that XML: $ sudo systemctl start libvirtd.service ubuntu@lp2059272-focal:~$ journalctl -b -u libvirtd.service | grep error ... Mar 30 23:14:27 lp2059272-focal libvirtd[7325]: 7341: error : dnsmasqCapsRefreshInternal:714 : Cannot check dnsmasq binary /usr/sbin/dnsmasq: No such file or directory And libvirt is now aware of the domain, and can manage it: $ virsh list Id Name State ------------------------- 1 test-vm running $ virsh destroy test-vm Domain test-vm destroyed ** Description changed: [ Impact ] * If a race condition occurs on libvirtd shutdown, a QEMU domain status XML (/run/libvirt/qemu/*.xml) might lose the QEMU-driver specific information, such as '<monitor path=.../>'. + (The race condition details are in [Other Info].) * On the next libvirtd startup, the parsing of that QEMU domain's status XML fails as '<monitor path=' is not found: $ journalctl -b -u libvirtd.service | tail ... ... libvirtd[2789]: internal error: no monitor path ... libvirtd[2789]: Failed to load config for domain 'test-vm' * As a result, the domain is not listed in `virsh list`, and `virsh` commands to it fail. $ virsh list Id Name State -------------------- * The domain is still running, but libvirt considers it as shutdown, which might cause conflicts/issues with higher-level tools (e.g., openstack nova). $ virsh list --all Id Name State -------------------------- - test-vm shut off $ pgrep -af qemu-system-x86_64 | cut -d, -f1 2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm, [ Test Plan ] - * Synthetic reproducer with GDB in comments #1 and #2. + * (Focal/Jammy) shutdown-on-runtime: + Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy) and #12 and #14 (Focal). - On failure, the XML is saved *without* '<monitor path=' + * (Focal-only) shutdown-on-init: + Synthetic reproducer/verification with GDB in comments #13 and #15. + + * On failure, the XML is saved *without* '<monitor path=' and libvirt fails to parse the domain on startup. The domain is *not* listed in `virsh list`. - (comment #1) - On success, the XML is saved *with* '<monitor path=' + * On success, the XML is saved *with* '<monitor path=' and libvirt correctly parses the domain on startup. The domain is listed in `virsh list`. - (comment #2) * Normal 'restart' testing in comment #5. + * Test packages built successfully in all architectures + with -proposed enabled in Launchpad PPA mfo/lp2059272 [0] + + [0] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272 + + [ Regression Potential ] - * The patch changes *where* in the libvirt qemu driver's + * One patch changes *where* in the libvirt qemu driver's shutdown path the worker thread pool is stopped/freed: from _after_ releasing other data to _before_ doing so. + + * The other patch (Focal-only) introduces a bounded wait + (with configurable timeout via an environment variable) + in the (same) libvirt qemu driver's shutdown path. + + By default, this waits for qemuProcessReconnect threads + for up to 30 seconds (expected to finish in less than + 1 second, in practice), and gives up / continues with + shutdown anyway so not to introduce a behavior change + on this path (prevents impact in case of regressions). * Therefore, the potential for regression is limited to the libvirt qemu driver's shutdown path, and would be observed when stopping/restarting libvirtd.service. * The behavior during normal operation is not affected. [Other Info] - * The fix commit [1] is included in Mantic and later, - and needed in Focal and Jammy. + * In Focal, race windows exist if libvirtd shuts down + _after_ initialization and _during_ initialization + (which is unlikely in practice, but it's possible.) + + Say, 'shutdown'on-runtime' and 'shutdown-on-init'. + + * In Jammy, only 'shutdown-on-runtime' might happen, + due to the introduction of the '.stateShutdownWait' + driver callback (not available in Focal), which + indirectly prevents the 'shutdown-on-init' race + due to additional synchronization with locking. + + * For 'shutdown-on-init' (Focal-only), we should use a + downstream-only patch (with configurable behavior), + since upstream addressed this issue indirectly with + the '.stateShutdownWait' callbacks and other changes + (which are not SRU material, ~10 patches, redesign [2]). + + * For 'shutdown-on-runtime': use upstream commit [1]. + It's needed in Focal and Jammy (included in Mantic). $ git describe --contains 152770333449cd3b78b4f5a9f1148fc1f482d842 v9.3.0-rc1~90 $ rmadison -a source libvirt | sed -n '/focal/,$p' libvirt | 6.0.0-0ubuntu8 | focal | source libvirt | 6.0.0-0ubuntu8.16 | focal-security | source libvirt | 6.0.0-0ubuntu8.16 | focal-updates | source libvirt | 6.0.0-0ubuntu8.17 | focal-proposed | source libvirt | 8.0.0-1ubuntu7 | jammy | source libvirt | 8.0.0-1ubuntu7.5 | jammy-security | source libvirt | 8.0.0-1ubuntu7.8 | jammy-updates | source libvirt | 9.6.0-1ubuntu1 | mantic | source libvirt | 10.0.0-2ubuntu1 | noble | source libvirt | 10.0.0-2ubuntu5 | noble-proposed | source [1] https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842 - * Test packages built successfully in all architectures - with -proposed enabled in Launchpad PPA mfo/lp2059272 [2] + [2] https://listman.redhat.com/archives/libvir-list/2020-July/205291.html + PATCH 00/10] resolve hangs/crashes on libvirtd shutdown - [2] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272 + commit 94e45d1042e21e03a15ce993f90fbef626f1ae41 + Author: Nikolay Shirokovskiy <nshirokovs...@virtuozzo.com> + Date: Thu Jul 23 09:53:04 2020 +0300 + + rpc: finish all threads before exiting main loop + + $ git describe --contains 94e45d1042e21e03a15ce993f90fbef626f1ae41 + v6.8.0-rc1~279 + [Original Description] There's a race condition on libvirtd shutdown that might cause the domain status XML file(s) to lose the '<monitor path=...'> tag/field. This causes an error on libvirtd startup, and the domain is not listed/managed, despite it is still running. $ virsh list Id Name State ------------------------- 1 test-vm running $ sudo systemctl restart libvirtd.service $ journalctl -b -u libvirtd.service | tail ... ... libvirtd[2789]: internal error: no monitor path ... libvirtd[2789]: Failed to load config for domain 'test-vm' $ virsh list Id Name State -------------------- $ virsh list --all Id Name State -------------------------- - test-vm shut off $ pgrep -af qemu-system-x86_64 | cut -d, -f1 2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm, ** Description changed: [ Impact ] * If a race condition occurs on libvirtd shutdown, a QEMU domain status XML (/run/libvirt/qemu/*.xml) might lose the QEMU-driver specific information, such as '<monitor path=.../>'. - (The race condition details are in [Other Info].) + (The race condition details are in [Other Info].) * On the next libvirtd startup, the parsing of that QEMU domain's status XML fails as '<monitor path=' is not found: $ journalctl -b -u libvirtd.service | tail ... ... libvirtd[2789]: internal error: no monitor path ... libvirtd[2789]: Failed to load config for domain 'test-vm' * As a result, the domain is not listed in `virsh list`, and `virsh` commands to it fail. $ virsh list Id Name State -------------------- * The domain is still running, but libvirt considers it as shutdown, which might cause conflicts/issues with higher-level tools (e.g., openstack nova). $ virsh list --all Id Name State -------------------------- - test-vm shut off $ pgrep -af qemu-system-x86_64 | cut -d, -f1 2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm, [ Test Plan ] * (Focal/Jammy) shutdown-on-runtime: - Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy) and #12 and #14 (Focal). + Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy) and #12 and #14 (Focal). - * (Focal-only) shutdown-on-init: - Synthetic reproducer/verification with GDB in comments #13 and #15. + * (Focal-only) shutdown-on-init: + Synthetic reproducer/verification with GDB in comments #13 and #15. - * On failure, the XML is saved *without* '<monitor path=' + * On failure, the XML is saved *without* '<monitor path=' and libvirt fails to parse the domain on startup. The domain is *not* listed in `virsh list`. * On success, the XML is saved *with* '<monitor path=' and libvirt correctly parses the domain on startup. The domain is listed in `virsh list`. * Normal 'restart' testing in comment #5. * Test packages built successfully in all architectures with -proposed enabled in Launchpad PPA mfo/lp2059272 [0] [0] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272 - [ Regression Potential ] * One patch changes *where* in the libvirt qemu driver's shutdown path the worker thread pool is stopped/freed: from _after_ releasing other data to _before_ doing so. - * The other patch (Focal-only) introduces a bounded wait - (with configurable timeout via an environment variable) - in the (same) libvirt qemu driver's shutdown path. + * The other patch (Focal-only) introduces a bounded wait + (with configurable timeout via an environment variable) + in the (same) libvirt qemu driver's shutdown path. - By default, this waits for qemuProcessReconnect threads - for up to 30 seconds (expected to finish in less than - 1 second, in practice), and gives up / continues with - shutdown anyway so not to introduce a behavior change - on this path (prevents impact in case of regressions). + By default, this waits for qemuProcessReconnect threads + for up to 30 seconds (expected to finish in less than + 1 second, in practice), and gives up / continues with + shutdown anyway so not to introduce a behavior change + on this path (prevents impact in case of regressions). * Therefore, the potential for regression is limited to the libvirt qemu driver's shutdown path, and would be observed when stopping/restarting libvirtd.service. * The behavior during normal operation is not affected. [Other Info] - * In Focal, race windows exist if libvirtd shuts down - _after_ initialization and _during_ initialization - (which is unlikely in practice, but it's possible.) + * In Focal, race windows exist if libvirtd shuts down + _after_ initialization and _during_ initialization + (which is unlikely in practice, but it's possible.) - Say, 'shutdown'on-runtime' and 'shutdown-on-init'. + Say, 'shutdown'on-runtime' and 'shutdown-on-init'. - * In Jammy, only 'shutdown-on-runtime' might happen, - due to the introduction of the '.stateShutdownWait' - driver callback (not available in Focal), which - indirectly prevents the 'shutdown-on-init' race - due to additional synchronization with locking. - - * For 'shutdown-on-init' (Focal-only), we should use a - downstream-only patch (with configurable behavior), - since upstream addressed this issue indirectly with - the '.stateShutdownWait' callbacks and other changes - (which are not SRU material, ~10 patches, redesign [2]). + * In Jammy, only 'shutdown-on-runtime' might happen, + due to the introduction of the '.stateShutdownWait' + driver callback (not available in Focal), which + indirectly prevents the 'shutdown-on-init' race + due to additional synchronization with locking. * For 'shutdown-on-runtime': use upstream commit [1]. - It's needed in Focal and Jammy (included in Mantic). + It's needed in Focal and Jammy (included in Mantic). + + * For 'shutdown-on-init' (Focal-only), we should use a + downstream-only patch (with configurable behavior), + since upstream addressed this issue indirectly with + the '.stateShutdownWait' callbacks and other changes + (which are not SRU material, ~10 patches, redesign [2]) + in 6.8.0. + + [1] + https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842 $ git describe --contains 152770333449cd3b78b4f5a9f1148fc1f482d842 v9.3.0-rc1~90 $ rmadison -a source libvirt | sed -n '/focal/,$p' libvirt | 6.0.0-0ubuntu8 | focal | source libvirt | 6.0.0-0ubuntu8.16 | focal-security | source libvirt | 6.0.0-0ubuntu8.16 | focal-updates | source libvirt | 6.0.0-0ubuntu8.17 | focal-proposed | source libvirt | 8.0.0-1ubuntu7 | jammy | source libvirt | 8.0.0-1ubuntu7.5 | jammy-security | source libvirt | 8.0.0-1ubuntu7.8 | jammy-updates | source libvirt | 9.6.0-1ubuntu1 | mantic | source libvirt | 10.0.0-2ubuntu1 | noble | source libvirt | 10.0.0-2ubuntu5 | noble-proposed | source - [1] - https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842 - [2] https://listman.redhat.com/archives/libvir-list/2020-July/205291.html - PATCH 00/10] resolve hangs/crashes on libvirtd shutdown + [PATCH 00/10] resolve hangs/crashes on libvirtd shutdown commit 94e45d1042e21e03a15ce993f90fbef626f1ae41 Author: Nikolay Shirokovskiy <nshirokovs...@virtuozzo.com> Date: Thu Jul 23 09:53:04 2020 +0300 rpc: finish all threads before exiting main loop $ git describe --contains 94e45d1042e21e03a15ce993f90fbef626f1ae41 v6.8.0-rc1~279 - [Original Description] There's a race condition on libvirtd shutdown that might cause the domain status XML file(s) to lose the '<monitor path=...'> tag/field. This causes an error on libvirtd startup, and the domain is not listed/managed, despite it is still running. $ virsh list Id Name State ------------------------- 1 test-vm running $ sudo systemctl restart libvirtd.service $ journalctl -b -u libvirtd.service | tail ... ... libvirtd[2789]: internal error: no monitor path ... libvirtd[2789]: Failed to load config for domain 'test-vm' $ virsh list Id Name State -------------------- $ virsh list --all Id Name State -------------------------- - test-vm shut off $ pgrep -af qemu-system-x86_64 | cut -d, -f1 2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm, -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2059272 Title: libvirt domain is not listed/managed after libvirt restart with messages "internal error: no monitor path" and "Failed to load config for domain" To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/2059272/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs