Public bug reported: This bug leads to data loss and corrupted filesystems!
Over the past few months/years, we've had occasional reports that users see their systems stuck on a 10min "stopping LXD snap" type message from systemd. We always thought it was our fault so added a lot of fallback logic and logging to our shutdown path to figure out what's going on. However in those cases, no sign of the shutdown logic was really reported, making it very odd. Because that usually happens when no more shells or SSH connections are running, figuring out what was going on took a LONG time. As mentioned in https://github.com/lxc/lxd-pkg-snap/issues/39, I've finally succeeded yesterday, having a system where this is reproducible AND where I can keep a shell. The process list looked like: ``` root 1 0.1 0.0 161464 8680 ? Ss 17:30 0:18 /sbin/init root 262 0.0 0.0 26296 8384 ? S<s 17:30 0:00 /lib/systemd/systemd-journald root 279 0.0 0.0 14476 3644 ? Ss 17:30 0:02 /lib/systemd/systemd-udevd systemd+ 300 0.0 0.0 11636 5176 ? Ss 17:30 0:00 /lib/systemd/systemd-networkd systemd+ 410 0.0 0.0 10436 4228 ? Ss 17:30 0:00 /lib/systemd/systemd-resolved systemd+ 411 0.0 0.0 83752 2512 ? Ssl 17:30 0:00 /lib/systemd/systemd-timesyncd root 412 0.0 0.1 1823860 17352 ? Ssl 17:30 0:09 /run/lxd_config/9p/lxd-agent root 10716 0.0 0.0 8468 3400 pts/0 Ss 20:28 0:00 \_ bash root 10891 0.0 0.0 10420 3236 pts/0 R+ 20:34 0:00 \_ ps fauxww message+ 419 0.0 0.0 6980 4076 ? Ss 17:30 0:03 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only root 7104 0.0 0.0 7680 4448 ? Ss 18:01 0:00 /usr/sbin/haveged --Foreground --verbose=1 -w 1024 root 7777 0.0 0.0 77944 1096 ? Ss 18:02 0:00 /sbin/lvmetad -f root 7707 0.0 0.0 1872 1272 ? Ss 18:36 0:00 /bin/sh /snap/lxd/14335/commands/daemon.start root 7863 0.0 0.3 1631048 47756 ? Sl 18:36 0:03 \_ lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd lxd 8189 0.0 0.0 6940 3028 ? Ss 18:36 0:00 \_ dnsmasq --keep-in-foreground --strict-order --bind-interfaces --except-interface=lo --no-ping --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.0.3.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.0.3.2,10.0.3.254,1h --listen-address=2001:470:b368:4242::1 --enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd root 7858 0.0 0.0 85040 1232 ? Sl 18:36 0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid root 10729 0.4 0.1 1282204 20464 ? Ssl 20:28 0:01 /usr/bin/snap run --command=stop lxd.daemon ``` Note that LXD is still running and that "snap run --command=stop lxd.daemon" has been invoked. Also note that no "snapd" processes are running. That "snap run" will hang there for 10min until systemd kills everything, including any running containers, causing any unflushed data to be lost and in some cases the entire LXD partition to be corrupted. Tracing the "snap run" process, it's attempting to connect to snapd through /run/snapd.socket. This is simply impossible and will never succeed as snapd isn't running anymore, so it just hangs there indefinitely. Now on the reproducing side of things, that's what took us ages, it's very hard to get a system in the right conditions AND have a shell when it happens. LXD VMs are the way to retain that shell, thanks to our agent not going away on shutdown until the kernel kills it. This appears to need to be combined with another condition though, zyga on IRC suggests that this code path would hit if system-id changed and/or the kernel got updated. In my case, our arm64 test VMs seem to be showing this behavior every time. This may be due to an issue with system-id in those, but it does make reproducing the issue and debugging it a fair bit easier. ** Affects: snapd (Ubuntu) Importance: Critical Status: Triaged ** Tags: champagne -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1871652 Title: Daemon snaps not properly stopped in some cases To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1871652/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs