Public bug reported:

This bug leads to data loss and corrupted filesystems!

Over the past few months/years, we've had occasional reports that users see 
their systems stuck on a 10min "stopping LXD snap" type message from systemd.

We always thought it was our fault so added a lot of fallback logic and
logging to our shutdown path to figure out what's going on. However in
those cases, no sign of the shutdown logic was really reported, making
it very odd.

Because that usually happens when no more shells or SSH connections are
running, figuring out what was going on took a LONG time.


As mentioned in https://github.com/lxc/lxd-pkg-snap/issues/39, I've finally 
succeeded yesterday, having a system where this is reproducible AND where I can 
keep a shell.

The process list looked like:

```
root         1  0.1  0.0 161464  8680 ?        Ss   17:30   0:18 /sbin/init
root       262  0.0  0.0  26296  8384 ?        S<s  17:30   0:00 
/lib/systemd/systemd-journald
root       279  0.0  0.0  14476  3644 ?        Ss   17:30   0:02 
/lib/systemd/systemd-udevd
systemd+   300  0.0  0.0  11636  5176 ?        Ss   17:30   0:00 
/lib/systemd/systemd-networkd
systemd+   410  0.0  0.0  10436  4228 ?        Ss   17:30   0:00 
/lib/systemd/systemd-resolved
systemd+   411  0.0  0.0  83752  2512 ?        Ssl  17:30   0:00 
/lib/systemd/systemd-timesyncd
root       412  0.0  0.1 1823860 17352 ?       Ssl  17:30   0:09 
/run/lxd_config/9p/lxd-agent
root     10716  0.0  0.0   8468  3400 pts/0    Ss   20:28   0:00  \_ bash
root     10891  0.0  0.0  10420  3236 pts/0    R+   20:34   0:00      \_ ps 
fauxww
message+   419  0.0  0.0   6980  4076 ?        Ss   17:30   0:03 
/usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile 
--systemd-activation --syslog-only
root      7104  0.0  0.0   7680  4448 ?        Ss   18:01   0:00 
/usr/sbin/haveged --Foreground --verbose=1 -w 1024
root      7777  0.0  0.0  77944  1096 ?        Ss   18:02   0:00 /sbin/lvmetad 
-f
root      7707  0.0  0.0   1872  1272 ?        Ss   18:36   0:00 /bin/sh 
/snap/lxd/14335/commands/daemon.start
root      7863  0.0  0.3 1631048 47756 ?       Sl   18:36   0:03  \_ lxd 
--logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
lxd       8189  0.0  0.0   6940  3028 ?        Ss   18:36   0:00      \_ 
dnsmasq --keep-in-foreground --strict-order --bind-interfaces 
--except-interface=lo --no-ping --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 
--quiet-ra --listen-address=10.0.3.1 --dhcp-no-override --dhcp-authoritative 
--dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases 
--dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts 
--dhcp-range 10.0.3.2,10.0.3.254,1h --listen-address=2001:470:b368:4242::1 
--enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd -S 
/lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd
root      7858  0.0  0.0  85040  1232 ?        Sl   18:36   0:00 lxcfs 
/var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
root     10729  0.4  0.1 1282204 20464 ?       Ssl  20:28   0:01 /usr/bin/snap 
run --command=stop lxd.daemon
```

Note that LXD is still running and that "snap run --command=stop lxd.daemon" 
has been invoked.
Also note that no "snapd" processes are running.

That "snap run" will hang there for 10min until systemd kills
everything, including any running containers, causing any unflushed data
to be lost and in some cases the entire LXD partition to be corrupted.


Tracing the "snap run" process, it's attempting to connect to snapd through 
/run/snapd.socket.
This is simply impossible and will never succeed as snapd isn't running 
anymore, so it just hangs there indefinitely.


Now on the reproducing side of things, that's what took us ages, it's very hard 
to get a system in the right conditions AND have a shell when it happens.

LXD VMs are the way to retain that shell, thanks to our agent not going
away on shutdown until the kernel kills it. This appears to need to be
combined with another condition though, zyga on IRC suggests that this
code path would hit if system-id changed and/or the kernel got updated.

In my case, our arm64 test VMs seem to be showing this behavior every
time. This may be due to an issue with system-id in those, but it does
make reproducing the issue and debugging it a fair bit easier.

** Affects: snapd (Ubuntu)
     Importance: Critical
         Status: Triaged


** Tags: champagne

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1871652

Title:
  Daemon snaps not properly stopped in some cases

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1871652/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to